Deep Neural Network in Python

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
6
down vote

favorite
3












I have written a neural network in Python and focused on adaptability and performance. I want to use it to dive deeper into that field. I am far from being an expert in neural networks and the same goes for Python. I do not want to use Tensorflow since I really want to understand how a neural network works.



My questions are:



  • How can I increase the performance? At the moment it takes days to train the network

The code runs on a single core. But since every loop over the batches run independently it can be parallelized.



  • How can I parallelize the loop over the batches?

I found some tutorials on parallel loops in Python but I could not apply it to my problem.



Here is my tested code with some pseudo training data:



from numpy import random, zeros, array, dot
from scipy.special import expit
import time

def sigma(x):
return expit(x)

def sigma_prime(x):
u = expit(x)
return u-u*u

def SGD(I, L, batch_size, eta):

images = len(L)

# Pre-activation
z = [zeros((layer_size[l],1)) for l in range(1,nn_size)]

# Activations
a = [zeros((layer_size[l],1)) for l in range(nn_size)]

# Ground truth
y = zeros((images, layer_size[-1]))
for i in range(images):
y[i,L[i]] = 1.0

while (1):

t0 = time.time()

# Create random batch
batch = random.randint(0,images,batch_size)

dW = [zeros((layer_size[l+1], layer_size[l])) for l in range(nn_size-1)]
db = [zeros((layer_size[l],1)) for l in range(1, nn_size)]

for i in batch:
# Feedforward
a[0] = array([I[i]]).T
for l in range(nn_size-1):
z[l] = dot(W[l], a[l]) + b[l]
a[l+1] = sigma(z[l])

# Backpropagation
delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += dot(delta, a[nn_size-2].T)
dW[nn_size-2] += delta.dot(a[nn_size-2].T)
db[nn_size-2] += delta
for l in reversed(range(nn_size-2)):
delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
dW[l] += dot(delta, a[l].T)
db[l] += delta

# Update Weights and Biases
for l in range(nn_size-1):
W[l] += - eta * dW[l] / batch_size
b[l] += - eta * db[l] / batch_size

print(time.time() - t0)

input_size = 1000
output_size = 10

layer_size = [input_size, 30**2, 30**2, 30**2, output_size]

nn_size = len(layer_size)
layer_size = layer_size

# Weights
W = [random.randn(layer_size[l+1],layer_size[l]) for l in range(nn_size-1)]

# Bias
b = [random.randn(layer_size[l],1) for l in range(1,nn_size)]

# Some random training data with label
size_training_data = 1000
# Random data I of size "size_training_data" x "input_size"
I = random.rand(size_training_data, input_size)
# Label for all training data
L = random.randint(0,10, input_size)

batch_size = 100
eta = 0.1
SGD(I, L, batch_size, eta)


The output shows the time needed for one batch of size batch_size.







share|improve this question





















  • Could you make it clearer what what it's printing means as well as what the variables mean?
    – Solomon Ucko
    Mar 27 at 2:08











  • @SolomonUcko Done!
    – Samuel
    Mar 27 at 21:04










  • Also, could you explain what all the variables are?
    – Solomon Ucko
    Mar 27 at 21:35










  • @SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
    – Samuel
    Mar 27 at 21:39










  • Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice that while (1): should probably be while True: and that you could probably use expit instead of sigma, as opposed to creating an 'alias'.
    – Solomon Ucko
    Mar 27 at 21:46

















up vote
6
down vote

favorite
3












I have written a neural network in Python and focused on adaptability and performance. I want to use it to dive deeper into that field. I am far from being an expert in neural networks and the same goes for Python. I do not want to use Tensorflow since I really want to understand how a neural network works.



My questions are:



  • How can I increase the performance? At the moment it takes days to train the network

The code runs on a single core. But since every loop over the batches run independently it can be parallelized.



  • How can I parallelize the loop over the batches?

I found some tutorials on parallel loops in Python but I could not apply it to my problem.



Here is my tested code with some pseudo training data:



from numpy import random, zeros, array, dot
from scipy.special import expit
import time

def sigma(x):
return expit(x)

def sigma_prime(x):
u = expit(x)
return u-u*u

def SGD(I, L, batch_size, eta):

images = len(L)

# Pre-activation
z = [zeros((layer_size[l],1)) for l in range(1,nn_size)]

# Activations
a = [zeros((layer_size[l],1)) for l in range(nn_size)]

# Ground truth
y = zeros((images, layer_size[-1]))
for i in range(images):
y[i,L[i]] = 1.0

while (1):

t0 = time.time()

# Create random batch
batch = random.randint(0,images,batch_size)

dW = [zeros((layer_size[l+1], layer_size[l])) for l in range(nn_size-1)]
db = [zeros((layer_size[l],1)) for l in range(1, nn_size)]

for i in batch:
# Feedforward
a[0] = array([I[i]]).T
for l in range(nn_size-1):
z[l] = dot(W[l], a[l]) + b[l]
a[l+1] = sigma(z[l])

# Backpropagation
delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += dot(delta, a[nn_size-2].T)
dW[nn_size-2] += delta.dot(a[nn_size-2].T)
db[nn_size-2] += delta
for l in reversed(range(nn_size-2)):
delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
dW[l] += dot(delta, a[l].T)
db[l] += delta

# Update Weights and Biases
for l in range(nn_size-1):
W[l] += - eta * dW[l] / batch_size
b[l] += - eta * db[l] / batch_size

print(time.time() - t0)

input_size = 1000
output_size = 10

layer_size = [input_size, 30**2, 30**2, 30**2, output_size]

nn_size = len(layer_size)
layer_size = layer_size

# Weights
W = [random.randn(layer_size[l+1],layer_size[l]) for l in range(nn_size-1)]

# Bias
b = [random.randn(layer_size[l],1) for l in range(1,nn_size)]

# Some random training data with label
size_training_data = 1000
# Random data I of size "size_training_data" x "input_size"
I = random.rand(size_training_data, input_size)
# Label for all training data
L = random.randint(0,10, input_size)

batch_size = 100
eta = 0.1
SGD(I, L, batch_size, eta)


The output shows the time needed for one batch of size batch_size.







share|improve this question





















  • Could you make it clearer what what it's printing means as well as what the variables mean?
    – Solomon Ucko
    Mar 27 at 2:08











  • @SolomonUcko Done!
    – Samuel
    Mar 27 at 21:04










  • Also, could you explain what all the variables are?
    – Solomon Ucko
    Mar 27 at 21:35










  • @SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
    – Samuel
    Mar 27 at 21:39










  • Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice that while (1): should probably be while True: and that you could probably use expit instead of sigma, as opposed to creating an 'alias'.
    – Solomon Ucko
    Mar 27 at 21:46













up vote
6
down vote

favorite
3









up vote
6
down vote

favorite
3






3





I have written a neural network in Python and focused on adaptability and performance. I want to use it to dive deeper into that field. I am far from being an expert in neural networks and the same goes for Python. I do not want to use Tensorflow since I really want to understand how a neural network works.



My questions are:



  • How can I increase the performance? At the moment it takes days to train the network

The code runs on a single core. But since every loop over the batches run independently it can be parallelized.



  • How can I parallelize the loop over the batches?

I found some tutorials on parallel loops in Python but I could not apply it to my problem.



Here is my tested code with some pseudo training data:



from numpy import random, zeros, array, dot
from scipy.special import expit
import time

def sigma(x):
return expit(x)

def sigma_prime(x):
u = expit(x)
return u-u*u

def SGD(I, L, batch_size, eta):

images = len(L)

# Pre-activation
z = [zeros((layer_size[l],1)) for l in range(1,nn_size)]

# Activations
a = [zeros((layer_size[l],1)) for l in range(nn_size)]

# Ground truth
y = zeros((images, layer_size[-1]))
for i in range(images):
y[i,L[i]] = 1.0

while (1):

t0 = time.time()

# Create random batch
batch = random.randint(0,images,batch_size)

dW = [zeros((layer_size[l+1], layer_size[l])) for l in range(nn_size-1)]
db = [zeros((layer_size[l],1)) for l in range(1, nn_size)]

for i in batch:
# Feedforward
a[0] = array([I[i]]).T
for l in range(nn_size-1):
z[l] = dot(W[l], a[l]) + b[l]
a[l+1] = sigma(z[l])

# Backpropagation
delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += dot(delta, a[nn_size-2].T)
dW[nn_size-2] += delta.dot(a[nn_size-2].T)
db[nn_size-2] += delta
for l in reversed(range(nn_size-2)):
delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
dW[l] += dot(delta, a[l].T)
db[l] += delta

# Update Weights and Biases
for l in range(nn_size-1):
W[l] += - eta * dW[l] / batch_size
b[l] += - eta * db[l] / batch_size

print(time.time() - t0)

input_size = 1000
output_size = 10

layer_size = [input_size, 30**2, 30**2, 30**2, output_size]

nn_size = len(layer_size)
layer_size = layer_size

# Weights
W = [random.randn(layer_size[l+1],layer_size[l]) for l in range(nn_size-1)]

# Bias
b = [random.randn(layer_size[l],1) for l in range(1,nn_size)]

# Some random training data with label
size_training_data = 1000
# Random data I of size "size_training_data" x "input_size"
I = random.rand(size_training_data, input_size)
# Label for all training data
L = random.randint(0,10, input_size)

batch_size = 100
eta = 0.1
SGD(I, L, batch_size, eta)


The output shows the time needed for one batch of size batch_size.







share|improve this question













I have written a neural network in Python and focused on adaptability and performance. I want to use it to dive deeper into that field. I am far from being an expert in neural networks and the same goes for Python. I do not want to use Tensorflow since I really want to understand how a neural network works.



My questions are:



  • How can I increase the performance? At the moment it takes days to train the network

The code runs on a single core. But since every loop over the batches run independently it can be parallelized.



  • How can I parallelize the loop over the batches?

I found some tutorials on parallel loops in Python but I could not apply it to my problem.



Here is my tested code with some pseudo training data:



from numpy import random, zeros, array, dot
from scipy.special import expit
import time

def sigma(x):
return expit(x)

def sigma_prime(x):
u = expit(x)
return u-u*u

def SGD(I, L, batch_size, eta):

images = len(L)

# Pre-activation
z = [zeros((layer_size[l],1)) for l in range(1,nn_size)]

# Activations
a = [zeros((layer_size[l],1)) for l in range(nn_size)]

# Ground truth
y = zeros((images, layer_size[-1]))
for i in range(images):
y[i,L[i]] = 1.0

while (1):

t0 = time.time()

# Create random batch
batch = random.randint(0,images,batch_size)

dW = [zeros((layer_size[l+1], layer_size[l])) for l in range(nn_size-1)]
db = [zeros((layer_size[l],1)) for l in range(1, nn_size)]

for i in batch:
# Feedforward
a[0] = array([I[i]]).T
for l in range(nn_size-1):
z[l] = dot(W[l], a[l]) + b[l]
a[l+1] = sigma(z[l])

# Backpropagation
delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += dot(delta, a[nn_size-2].T)
dW[nn_size-2] += delta.dot(a[nn_size-2].T)
db[nn_size-2] += delta
for l in reversed(range(nn_size-2)):
delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
dW[l] += dot(delta, a[l].T)
db[l] += delta

# Update Weights and Biases
for l in range(nn_size-1):
W[l] += - eta * dW[l] / batch_size
b[l] += - eta * db[l] / batch_size

print(time.time() - t0)

input_size = 1000
output_size = 10

layer_size = [input_size, 30**2, 30**2, 30**2, output_size]

nn_size = len(layer_size)
layer_size = layer_size

# Weights
W = [random.randn(layer_size[l+1],layer_size[l]) for l in range(nn_size-1)]

# Bias
b = [random.randn(layer_size[l],1) for l in range(1,nn_size)]

# Some random training data with label
size_training_data = 1000
# Random data I of size "size_training_data" x "input_size"
I = random.rand(size_training_data, input_size)
# Label for all training data
L = random.randint(0,10, input_size)

batch_size = 100
eta = 0.1
SGD(I, L, batch_size, eta)


The output shows the time needed for one batch of size batch_size.









share|improve this question












share|improve this question




share|improve this question








edited Mar 27 at 21:03
























asked Mar 23 at 8:11









Samuel

1839




1839











  • Could you make it clearer what what it's printing means as well as what the variables mean?
    – Solomon Ucko
    Mar 27 at 2:08











  • @SolomonUcko Done!
    – Samuel
    Mar 27 at 21:04










  • Also, could you explain what all the variables are?
    – Solomon Ucko
    Mar 27 at 21:35










  • @SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
    – Samuel
    Mar 27 at 21:39










  • Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice that while (1): should probably be while True: and that you could probably use expit instead of sigma, as opposed to creating an 'alias'.
    – Solomon Ucko
    Mar 27 at 21:46

















  • Could you make it clearer what what it's printing means as well as what the variables mean?
    – Solomon Ucko
    Mar 27 at 2:08











  • @SolomonUcko Done!
    – Samuel
    Mar 27 at 21:04










  • Also, could you explain what all the variables are?
    – Solomon Ucko
    Mar 27 at 21:35










  • @SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
    – Samuel
    Mar 27 at 21:39










  • Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice that while (1): should probably be while True: and that you could probably use expit instead of sigma, as opposed to creating an 'alias'.
    – Solomon Ucko
    Mar 27 at 21:46
















Could you make it clearer what what it's printing means as well as what the variables mean?
– Solomon Ucko
Mar 27 at 2:08





Could you make it clearer what what it's printing means as well as what the variables mean?
– Solomon Ucko
Mar 27 at 2:08













@SolomonUcko Done!
– Samuel
Mar 27 at 21:04




@SolomonUcko Done!
– Samuel
Mar 27 at 21:04












Also, could you explain what all the variables are?
– Solomon Ucko
Mar 27 at 21:35




Also, could you explain what all the variables are?
– Solomon Ucko
Mar 27 at 21:35












@SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
– Samuel
Mar 27 at 21:39




@SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
– Samuel
Mar 27 at 21:39












Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice that while (1): should probably be while True: and that you could probably use expit instead of sigma, as opposed to creating an 'alias'.
– Solomon Ucko
Mar 27 at 21:46





Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice that while (1): should probably be while True: and that you could probably use expit instead of sigma, as opposed to creating an 'alias'.
– Solomon Ucko
Mar 27 at 21:46











1 Answer
1






active

oldest

votes

















up vote
6
down vote



accepted
+100










The reason why your epochs are so slow is because you are iterating over each example in the batch and calculating the gradients in a for loop. The key to speeding this up is realizing that you are performing the same operations over every example in the batch, and so you can stack your examples into a matrix and calculate the gradients for all examples in a single matrix operation.



Let's break that down. For a single example, you start with a feature vector of size (1000,) and that is linearly transformed by multiply it by a weight matrix of size (1000,900) resulting in a (900,1) vector that is then added with a bias vector of size (900,1). This is then non-linearly transformed, which does not affect the dimensions, to result in the first hidden layer of size (900,1).



This is 900 hidden nodes for the first example.



However, since we are performing the same operations to every example in the batch, we can stack the 100 examples to form a matrix of size (100,1000) instead of (1,1000) then take the dot product of this input matrix with the transpose of the weight matrix, (1000,900) for a resulting matrix of (100,900). Add the bias (1,900), which is broadcasted automatically in numpy to a matrix of size (100,900) (it's the same bias vector stacked 100 times) and apply the non-linear transform for a final matrix of size (100,900). This is 900 hidden nodes each for 100 examples.



This can be applied to each hidden layer in the network.



Factoring the original code:



for i in batch: 
# Feedforward
a[0] = array([I[i]]).T
for l in range(nn_size-1):
z[l] = dot(W[l], a[l]) + b[l]
a[l+1] = sigma(z[l])

# Backpropagation
delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += dot(delta, a[nn_size-2].T)
dW[nn_size-2] += delta.dot(a[nn_size-2].T)
db[nn_size-2] += delta
for l in reversed(range(nn_size-2)):
delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
dW[l] += dot(delta, a[l].T)
db[l] += delta


into matrix math form:



a[0] = I[batch]
for l in range(nn_size-1):
z[l] = a[l].dot(W[l].T) + b[l].T
a[l+1] = sigma(z[l])

delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)

for l in reversed(range(nn_size-2)):
delta = delta.dot(W[l+1]) * sigma_prime(z[l])
dW[l] += a[l].T.dot(delta).T
db[l] += np.sum(delta.T, axis=1, keepdims=True)


When calculating gradients over batches, taking the sum or average of the gradients over the examples both work, but Andrew Ng suggests using the average over the batch as explained in his course and here: https://stats.stackexchange.com/questions/183840/sum-or-average-of-gradients-in-mini-batch-gradient-decent



In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.



With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.



With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.



Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you are determined on pursuing that, I would suggest you read some papers on the topic:



Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training system https://arxiv.org/abs/1708.04801



Zinkevich 2010: Parallelized Stochastic Gradient Descent http://martin.zinkevich.org/publications/nips2010.pdf



Or for a higher level overview:
http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent






share|improve this answer























    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "196"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f190274%2fdeep-neural-network-in-python%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    6
    down vote



    accepted
    +100










    The reason why your epochs are so slow is because you are iterating over each example in the batch and calculating the gradients in a for loop. The key to speeding this up is realizing that you are performing the same operations over every example in the batch, and so you can stack your examples into a matrix and calculate the gradients for all examples in a single matrix operation.



    Let's break that down. For a single example, you start with a feature vector of size (1000,) and that is linearly transformed by multiply it by a weight matrix of size (1000,900) resulting in a (900,1) vector that is then added with a bias vector of size (900,1). This is then non-linearly transformed, which does not affect the dimensions, to result in the first hidden layer of size (900,1).



    This is 900 hidden nodes for the first example.



    However, since we are performing the same operations to every example in the batch, we can stack the 100 examples to form a matrix of size (100,1000) instead of (1,1000) then take the dot product of this input matrix with the transpose of the weight matrix, (1000,900) for a resulting matrix of (100,900). Add the bias (1,900), which is broadcasted automatically in numpy to a matrix of size (100,900) (it's the same bias vector stacked 100 times) and apply the non-linear transform for a final matrix of size (100,900). This is 900 hidden nodes each for 100 examples.



    This can be applied to each hidden layer in the network.



    Factoring the original code:



    for i in batch: 
    # Feedforward
    a[0] = array([I[i]]).T
    for l in range(nn_size-1):
    z[l] = dot(W[l], a[l]) + b[l]
    a[l+1] = sigma(z[l])

    # Backpropagation
    delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
    dW[nn_size-2] += dot(delta, a[nn_size-2].T)
    dW[nn_size-2] += delta.dot(a[nn_size-2].T)
    db[nn_size-2] += delta
    for l in reversed(range(nn_size-2)):
    delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
    dW[l] += dot(delta, a[l].T)
    db[l] += delta


    into matrix math form:



    a[0] = I[batch]
    for l in range(nn_size-1):
    z[l] = a[l].dot(W[l].T) + b[l].T
    a[l+1] = sigma(z[l])

    delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
    dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
    db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)

    for l in reversed(range(nn_size-2)):
    delta = delta.dot(W[l+1]) * sigma_prime(z[l])
    dW[l] += a[l].T.dot(delta).T
    db[l] += np.sum(delta.T, axis=1, keepdims=True)


    When calculating gradients over batches, taking the sum or average of the gradients over the examples both work, but Andrew Ng suggests using the average over the batch as explained in his course and here: https://stats.stackexchange.com/questions/183840/sum-or-average-of-gradients-in-mini-batch-gradient-decent



    In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.



    With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.



    With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.



    Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you are determined on pursuing that, I would suggest you read some papers on the topic:



    Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training system https://arxiv.org/abs/1708.04801



    Zinkevich 2010: Parallelized Stochastic Gradient Descent http://martin.zinkevich.org/publications/nips2010.pdf



    Or for a higher level overview:
    http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent






    share|improve this answer



























      up vote
      6
      down vote



      accepted
      +100










      The reason why your epochs are so slow is because you are iterating over each example in the batch and calculating the gradients in a for loop. The key to speeding this up is realizing that you are performing the same operations over every example in the batch, and so you can stack your examples into a matrix and calculate the gradients for all examples in a single matrix operation.



      Let's break that down. For a single example, you start with a feature vector of size (1000,) and that is linearly transformed by multiply it by a weight matrix of size (1000,900) resulting in a (900,1) vector that is then added with a bias vector of size (900,1). This is then non-linearly transformed, which does not affect the dimensions, to result in the first hidden layer of size (900,1).



      This is 900 hidden nodes for the first example.



      However, since we are performing the same operations to every example in the batch, we can stack the 100 examples to form a matrix of size (100,1000) instead of (1,1000) then take the dot product of this input matrix with the transpose of the weight matrix, (1000,900) for a resulting matrix of (100,900). Add the bias (1,900), which is broadcasted automatically in numpy to a matrix of size (100,900) (it's the same bias vector stacked 100 times) and apply the non-linear transform for a final matrix of size (100,900). This is 900 hidden nodes each for 100 examples.



      This can be applied to each hidden layer in the network.



      Factoring the original code:



      for i in batch: 
      # Feedforward
      a[0] = array([I[i]]).T
      for l in range(nn_size-1):
      z[l] = dot(W[l], a[l]) + b[l]
      a[l+1] = sigma(z[l])

      # Backpropagation
      delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
      dW[nn_size-2] += dot(delta, a[nn_size-2].T)
      dW[nn_size-2] += delta.dot(a[nn_size-2].T)
      db[nn_size-2] += delta
      for l in reversed(range(nn_size-2)):
      delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
      dW[l] += dot(delta, a[l].T)
      db[l] += delta


      into matrix math form:



      a[0] = I[batch]
      for l in range(nn_size-1):
      z[l] = a[l].dot(W[l].T) + b[l].T
      a[l+1] = sigma(z[l])

      delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
      dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
      db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)

      for l in reversed(range(nn_size-2)):
      delta = delta.dot(W[l+1]) * sigma_prime(z[l])
      dW[l] += a[l].T.dot(delta).T
      db[l] += np.sum(delta.T, axis=1, keepdims=True)


      When calculating gradients over batches, taking the sum or average of the gradients over the examples both work, but Andrew Ng suggests using the average over the batch as explained in his course and here: https://stats.stackexchange.com/questions/183840/sum-or-average-of-gradients-in-mini-batch-gradient-decent



      In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.



      With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.



      With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.



      Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you are determined on pursuing that, I would suggest you read some papers on the topic:



      Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training system https://arxiv.org/abs/1708.04801



      Zinkevich 2010: Parallelized Stochastic Gradient Descent http://martin.zinkevich.org/publications/nips2010.pdf



      Or for a higher level overview:
      http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent






      share|improve this answer

























        up vote
        6
        down vote



        accepted
        +100







        up vote
        6
        down vote



        accepted
        +100




        +100




        The reason why your epochs are so slow is because you are iterating over each example in the batch and calculating the gradients in a for loop. The key to speeding this up is realizing that you are performing the same operations over every example in the batch, and so you can stack your examples into a matrix and calculate the gradients for all examples in a single matrix operation.



        Let's break that down. For a single example, you start with a feature vector of size (1000,) and that is linearly transformed by multiply it by a weight matrix of size (1000,900) resulting in a (900,1) vector that is then added with a bias vector of size (900,1). This is then non-linearly transformed, which does not affect the dimensions, to result in the first hidden layer of size (900,1).



        This is 900 hidden nodes for the first example.



        However, since we are performing the same operations to every example in the batch, we can stack the 100 examples to form a matrix of size (100,1000) instead of (1,1000) then take the dot product of this input matrix with the transpose of the weight matrix, (1000,900) for a resulting matrix of (100,900). Add the bias (1,900), which is broadcasted automatically in numpy to a matrix of size (100,900) (it's the same bias vector stacked 100 times) and apply the non-linear transform for a final matrix of size (100,900). This is 900 hidden nodes each for 100 examples.



        This can be applied to each hidden layer in the network.



        Factoring the original code:



        for i in batch: 
        # Feedforward
        a[0] = array([I[i]]).T
        for l in range(nn_size-1):
        z[l] = dot(W[l], a[l]) + b[l]
        a[l+1] = sigma(z[l])

        # Backpropagation
        delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
        dW[nn_size-2] += dot(delta, a[nn_size-2].T)
        dW[nn_size-2] += delta.dot(a[nn_size-2].T)
        db[nn_size-2] += delta
        for l in reversed(range(nn_size-2)):
        delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
        dW[l] += dot(delta, a[l].T)
        db[l] += delta


        into matrix math form:



        a[0] = I[batch]
        for l in range(nn_size-1):
        z[l] = a[l].dot(W[l].T) + b[l].T
        a[l+1] = sigma(z[l])

        delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
        dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
        db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)

        for l in reversed(range(nn_size-2)):
        delta = delta.dot(W[l+1]) * sigma_prime(z[l])
        dW[l] += a[l].T.dot(delta).T
        db[l] += np.sum(delta.T, axis=1, keepdims=True)


        When calculating gradients over batches, taking the sum or average of the gradients over the examples both work, but Andrew Ng suggests using the average over the batch as explained in his course and here: https://stats.stackexchange.com/questions/183840/sum-or-average-of-gradients-in-mini-batch-gradient-decent



        In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.



        With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.



        With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.



        Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you are determined on pursuing that, I would suggest you read some papers on the topic:



        Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training system https://arxiv.org/abs/1708.04801



        Zinkevich 2010: Parallelized Stochastic Gradient Descent http://martin.zinkevich.org/publications/nips2010.pdf



        Or for a higher level overview:
        http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent






        share|improve this answer















        The reason why your epochs are so slow is because you are iterating over each example in the batch and calculating the gradients in a for loop. The key to speeding this up is realizing that you are performing the same operations over every example in the batch, and so you can stack your examples into a matrix and calculate the gradients for all examples in a single matrix operation.



        Let's break that down. For a single example, you start with a feature vector of size (1000,) and that is linearly transformed by multiply it by a weight matrix of size (1000,900) resulting in a (900,1) vector that is then added with a bias vector of size (900,1). This is then non-linearly transformed, which does not affect the dimensions, to result in the first hidden layer of size (900,1).



        This is 900 hidden nodes for the first example.



        However, since we are performing the same operations to every example in the batch, we can stack the 100 examples to form a matrix of size (100,1000) instead of (1,1000) then take the dot product of this input matrix with the transpose of the weight matrix, (1000,900) for a resulting matrix of (100,900). Add the bias (1,900), which is broadcasted automatically in numpy to a matrix of size (100,900) (it's the same bias vector stacked 100 times) and apply the non-linear transform for a final matrix of size (100,900). This is 900 hidden nodes each for 100 examples.



        This can be applied to each hidden layer in the network.



        Factoring the original code:



        for i in batch: 
        # Feedforward
        a[0] = array([I[i]]).T
        for l in range(nn_size-1):
        z[l] = dot(W[l], a[l]) + b[l]
        a[l+1] = sigma(z[l])

        # Backpropagation
        delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
        dW[nn_size-2] += dot(delta, a[nn_size-2].T)
        dW[nn_size-2] += delta.dot(a[nn_size-2].T)
        db[nn_size-2] += delta
        for l in reversed(range(nn_size-2)):
        delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
        dW[l] += dot(delta, a[l].T)
        db[l] += delta


        into matrix math form:



        a[0] = I[batch]
        for l in range(nn_size-1):
        z[l] = a[l].dot(W[l].T) + b[l].T
        a[l+1] = sigma(z[l])

        delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
        dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
        db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)

        for l in reversed(range(nn_size-2)):
        delta = delta.dot(W[l+1]) * sigma_prime(z[l])
        dW[l] += a[l].T.dot(delta).T
        db[l] += np.sum(delta.T, axis=1, keepdims=True)


        When calculating gradients over batches, taking the sum or average of the gradients over the examples both work, but Andrew Ng suggests using the average over the batch as explained in his course and here: https://stats.stackexchange.com/questions/183840/sum-or-average-of-gradients-in-mini-batch-gradient-decent



        In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.



        With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.



        With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.



        Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you are determined on pursuing that, I would suggest you read some papers on the topic:



        Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training system https://arxiv.org/abs/1708.04801



        Zinkevich 2010: Parallelized Stochastic Gradient Descent http://martin.zinkevich.org/publications/nips2010.pdf



        Or for a higher level overview:
        http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent







        share|improve this answer















        share|improve this answer



        share|improve this answer








        edited Mar 29 at 2:56


























        answered Mar 29 at 1:42









        mochi

        1,0397




        1,0397






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f190274%2fdeep-neural-network-in-python%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            Greedy Best First Search implementation in Rust

            Function to Return a JSON Like Objects Using VBA Collections and Arrays

            C++11 CLH Lock Implementation