Deep Neural Network in Python

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
6
down vote

favorite

I have written a neural network in Python and focused on adaptability and performance. I want to use it to dive deeper into that field. I am far from being an expert in neural networks and the same goes for Python. I do not want to use Tensorflow since I really want to understand how a neural network works.

My questions are:

How can I increase the performance? At the moment it takes days to train the network

The code runs on a single core. But since every loop over the batches run independently it can be parallelized.

How can I parallelize the loop over the batches?

I found some tutorials on parallel loops in Python but I could not apply it to my problem.

Here is my tested code with some pseudo training data:

from numpy import random, zeros, array, dot
from scipy.special import expit
import time 

def sigma(x):
 return expit(x)

def sigma_prime(x):
 u = expit(x)
 return u-u*u 

def SGD(I, L, batch_size, eta):

 images = len(L)

 # Pre-activation
 z = [zeros((layer_size[l],1)) for l in range(1,nn_size)]

 # Activations
 a = [zeros((layer_size[l],1)) for l in range(nn_size)]

 # Ground truth 
 y = zeros((images, layer_size[-1]))
 for i in range(images):
 y[i,L[i]] = 1.0

 while (1):

 t0 = time.time()

 # Create random batch
 batch = random.randint(0,images,batch_size)

 dW = [zeros((layer_size[l+1], layer_size[l])) for l in range(nn_size-1)]
 db = [zeros((layer_size[l],1)) for l in range(1, nn_size)]

 for i in batch: 
 # Feedforward
 a[0] = array([I[i]]).T
 for l in range(nn_size-1):
 z[l] = dot(W[l], a[l]) + b[l]
 a[l+1] = sigma(z[l])

 # Backpropagation
 delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
 dW[nn_size-2] += dot(delta, a[nn_size-2].T)
 dW[nn_size-2] += delta.dot(a[nn_size-2].T)
 db[nn_size-2] += delta
 for l in reversed(range(nn_size-2)):
 delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
 dW[l] += dot(delta, a[l].T)
 db[l] += delta

 # Update Weights and Biases
 for l in range(nn_size-1):
 W[l] += - eta * dW[l] / batch_size
 b[l] += - eta * db[l] / batch_size

 print(time.time() - t0)

input_size = 1000
output_size = 10

layer_size = [input_size, 30**2, 30**2, 30**2, output_size]

nn_size = len(layer_size)
layer_size = layer_size

# Weights
W = [random.randn(layer_size[l+1],layer_size[l]) for l in range(nn_size-1)]

# Bias
b = [random.randn(layer_size[l],1) for l in range(1,nn_size)]

# Some random training data with label
size_training_data = 1000
# Random data I of size "size_training_data" x "input_size"
I = random.rand(size_training_data, input_size)
# Label for all training data
L = random.randint(0,10, input_size)

batch_size = 100
eta = 0.1
SGD(I, L, batch_size, eta)

The output shows the time needed for one batch of size batch_size.

edited Mar 27 at 21:03

asked Mar 23 at 8:11

Samuel

1839

Could you make it clearer what what it's printing means as well as what the variables mean?
â€“Â Solomon Ucko
Mar 27 at 2:08

@SolomonUcko Done!
â€“Â Samuel
Mar 27 at 21:04

Also, could you explain what all the variables are?
â€“Â Solomon Ucko
Mar 27 at 21:35

@SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
â€“Â Samuel
Mar 27 at 21:39

Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice that while (1): should probably be while True: and that you could probably use expit instead of sigma, as opposed to creating an 'alias'.
â€“Â Solomon Ucko
Mar 27 at 21:46

Â |Â
show 1 more comment

up vote
6
down vote

favorite

My questions are:

How can I increase the performance? At the moment it takes days to train the network

The code runs on a single core. But since every loop over the batches run independently it can be parallelized.

How can I parallelize the loop over the batches?

I found some tutorials on parallel loops in Python but I could not apply it to my problem.

Here is my tested code with some pseudo training data:

from numpy import random, zeros, array, dot
from scipy.special import expit
import time 

def sigma(x):
 return expit(x)

def sigma_prime(x):
 u = expit(x)
 return u-u*u 

def SGD(I, L, batch_size, eta):

 images = len(L)

 # Pre-activation
 z = [zeros((layer_size[l],1)) for l in range(1,nn_size)]

 # Activations
 a = [zeros((layer_size[l],1)) for l in range(nn_size)]

 # Ground truth 
 y = zeros((images, layer_size[-1]))
 for i in range(images):
 y[i,L[i]] = 1.0

 while (1):

 t0 = time.time()

 # Create random batch
 batch = random.randint(0,images,batch_size)

 dW = [zeros((layer_size[l+1], layer_size[l])) for l in range(nn_size-1)]
 db = [zeros((layer_size[l],1)) for l in range(1, nn_size)]

 for i in batch: 
 # Feedforward
 a[0] = array([I[i]]).T
 for l in range(nn_size-1):
 z[l] = dot(W[l], a[l]) + b[l]
 a[l+1] = sigma(z[l])

 # Backpropagation
 delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
 dW[nn_size-2] += dot(delta, a[nn_size-2].T)
 dW[nn_size-2] += delta.dot(a[nn_size-2].T)
 db[nn_size-2] += delta
 for l in reversed(range(nn_size-2)):
 delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
 dW[l] += dot(delta, a[l].T)
 db[l] += delta

 # Update Weights and Biases
 for l in range(nn_size-1):
 W[l] += - eta * dW[l] / batch_size
 b[l] += - eta * db[l] / batch_size

 print(time.time() - t0)

input_size = 1000
output_size = 10

layer_size = [input_size, 30**2, 30**2, 30**2, output_size]

nn_size = len(layer_size)
layer_size = layer_size

# Weights
W = [random.randn(layer_size[l+1],layer_size[l]) for l in range(nn_size-1)]

# Bias
b = [random.randn(layer_size[l],1) for l in range(1,nn_size)]

# Some random training data with label
size_training_data = 1000
# Random data I of size "size_training_data" x "input_size"
I = random.rand(size_training_data, input_size)
# Label for all training data
L = random.randint(0,10, input_size)

batch_size = 100
eta = 0.1
SGD(I, L, batch_size, eta)

The output shows the time needed for one batch of size batch_size.

edited Mar 27 at 21:03

asked Mar 23 at 8:11

Samuel

1839

Could you make it clearer what what it's printing means as well as what the variables mean?
â€“Â Solomon Ucko
Mar 27 at 2:08

@SolomonUcko Done!
â€“Â Samuel
Mar 27 at 21:04

Also, could you explain what all the variables are?
â€“Â Solomon Ucko
Mar 27 at 21:35

@SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
â€“Â Samuel
Mar 27 at 21:39

Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice that while (1): should probably be while True: and that you could probably use expit instead of sigma, as opposed to creating an 'alias'.
â€“Â Solomon Ucko
Mar 27 at 21:46

Â |Â
show 1 more comment

up vote
6
down vote

favorite

My questions are:

How can I increase the performance? At the moment it takes days to train the network

The code runs on a single core. But since every loop over the batches run independently it can be parallelized.

How can I parallelize the loop over the batches?

I found some tutorials on parallel loops in Python but I could not apply it to my problem.

Here is my tested code with some pseudo training data:

from numpy import random, zeros, array, dot
from scipy.special import expit
import time 

def sigma(x):
 return expit(x)

def sigma_prime(x):
 u = expit(x)
 return u-u*u 

def SGD(I, L, batch_size, eta):

 images = len(L)

 # Pre-activation
 z = [zeros((layer_size[l],1)) for l in range(1,nn_size)]

 # Activations
 a = [zeros((layer_size[l],1)) for l in range(nn_size)]

 # Ground truth 
 y = zeros((images, layer_size[-1]))
 for i in range(images):
 y[i,L[i]] = 1.0

 while (1):

 t0 = time.time()

 # Create random batch
 batch = random.randint(0,images,batch_size)

 dW = [zeros((layer_size[l+1], layer_size[l])) for l in range(nn_size-1)]
 db = [zeros((layer_size[l],1)) for l in range(1, nn_size)]

 for i in batch: 
 # Feedforward
 a[0] = array([I[i]]).T
 for l in range(nn_size-1):
 z[l] = dot(W[l], a[l]) + b[l]
 a[l+1] = sigma(z[l])

 # Backpropagation
 delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
 dW[nn_size-2] += dot(delta, a[nn_size-2].T)
 dW[nn_size-2] += delta.dot(a[nn_size-2].T)
 db[nn_size-2] += delta
 for l in reversed(range(nn_size-2)):
 delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
 dW[l] += dot(delta, a[l].T)
 db[l] += delta

 # Update Weights and Biases
 for l in range(nn_size-1):
 W[l] += - eta * dW[l] / batch_size
 b[l] += - eta * db[l] / batch_size

 print(time.time() - t0)

input_size = 1000
output_size = 10

layer_size = [input_size, 30**2, 30**2, 30**2, output_size]

nn_size = len(layer_size)
layer_size = layer_size

# Weights
W = [random.randn(layer_size[l+1],layer_size[l]) for l in range(nn_size-1)]

# Bias
b = [random.randn(layer_size[l],1) for l in range(1,nn_size)]

# Some random training data with label
size_training_data = 1000
# Random data I of size "size_training_data" x "input_size"
I = random.rand(size_training_data, input_size)
# Label for all training data
L = random.randint(0,10, input_size)

batch_size = 100
eta = 0.1
SGD(I, L, batch_size, eta)

The output shows the time needed for one batch of size batch_size.

edited Mar 27 at 21:03

asked Mar 23 at 8:11

Samuel

1839

My questions are:

How can I increase the performance? At the moment it takes days to train the network

The code runs on a single core. But since every loop over the batches run independently it can be parallelized.

How can I parallelize the loop over the batches?

I found some tutorials on parallel loops in Python but I could not apply it to my problem.

Here is my tested code with some pseudo training data:

from numpy import random, zeros, array, dot
from scipy.special import expit
import time 

def sigma(x):
 return expit(x)

def sigma_prime(x):
 u = expit(x)
 return u-u*u 

def SGD(I, L, batch_size, eta):

 images = len(L)

 # Pre-activation
 z = [zeros((layer_size[l],1)) for l in range(1,nn_size)]

 # Activations
 a = [zeros((layer_size[l],1)) for l in range(nn_size)]

 # Ground truth 
 y = zeros((images, layer_size[-1]))
 for i in range(images):
 y[i,L[i]] = 1.0

 while (1):

 t0 = time.time()

 # Create random batch
 batch = random.randint(0,images,batch_size)

 dW = [zeros((layer_size[l+1], layer_size[l])) for l in range(nn_size-1)]
 db = [zeros((layer_size[l],1)) for l in range(1, nn_size)]

 for i in batch: 
 # Feedforward
 a[0] = array([I[i]]).T
 for l in range(nn_size-1):
 z[l] = dot(W[l], a[l]) + b[l]
 a[l+1] = sigma(z[l])

 # Backpropagation
 delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
 dW[nn_size-2] += dot(delta, a[nn_size-2].T)
 dW[nn_size-2] += delta.dot(a[nn_size-2].T)
 db[nn_size-2] += delta
 for l in reversed(range(nn_size-2)):
 delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
 dW[l] += dot(delta, a[l].T)
 db[l] += delta

 # Update Weights and Biases
 for l in range(nn_size-1):
 W[l] += - eta * dW[l] / batch_size
 b[l] += - eta * db[l] / batch_size

 print(time.time() - t0)

input_size = 1000
output_size = 10

layer_size = [input_size, 30**2, 30**2, 30**2, output_size]

nn_size = len(layer_size)
layer_size = layer_size

# Weights
W = [random.randn(layer_size[l+1],layer_size[l]) for l in range(nn_size-1)]

# Bias
b = [random.randn(layer_size[l],1) for l in range(1,nn_size)]

# Some random training data with label
size_training_data = 1000
# Random data I of size "size_training_data" x "input_size"
I = random.rand(size_training_data, input_size)
# Label for all training data
L = random.randint(0,10, input_size)

batch_size = 100
eta = 0.1
SGD(I, L, batch_size, eta)

The output shows the time needed for one batch of size batch_size.

edited Mar 27 at 21:03

asked Mar 23 at 8:11

Samuel

1839

edited Mar 27 at 21:03

asked Mar 23 at 8:11

Samuel

1839

asked Mar 23 at 8:11

Samuel

1839

asked Mar 23 at 8:11

Samuel

1839

Could you make it clearer what what it's printing means as well as what the variables mean?
â€“Â Solomon Ucko
Mar 27 at 2:08

@SolomonUcko Done!
â€“Â Samuel
Mar 27 at 21:04

Also, could you explain what all the variables are?
â€“Â Solomon Ucko
Mar 27 at 21:35

@SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
â€“Â Samuel
Mar 27 at 21:39

Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice that while (1): should probably be while True: and that you could probably use expit instead of sigma, as opposed to creating an 'alias'.
â€“Â Solomon Ucko
Mar 27 at 21:46

Â |Â
show 1 more comment

Could you make it clearer what what it's printing means as well as what the variables mean?
â€“Â Solomon Ucko
Mar 27 at 2:08

@SolomonUcko Done!
â€“Â Samuel
Mar 27 at 21:04

Also, could you explain what all the variables are?
â€“Â Solomon Ucko
Mar 27 at 21:35

@SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
â€“Â Samuel
Mar 27 at 21:39

Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice that while (1): should probably be while True: and that you could probably use expit instead of sigma, as opposed to creating an 'alias'.
â€“Â Solomon Ucko
Mar 27 at 21:46

Could you make it clearer what what it's printing means as well as what the variables mean?
â€“Â Solomon Ucko
Mar 27 at 2:08

@SolomonUcko Done!
â€“Â Samuel
Mar 27 at 21:04

Also, could you explain what all the variables are?
â€“Â Solomon Ucko
Mar 27 at 21:35

@SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
â€“Â Samuel
Mar 27 at 21:39

Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice that while (1): should probably be while True: and that you could probably use expit instead of sigma, as opposed to creating an 'alias'.
â€“Â Solomon Ucko
Mar 27 at 21:46

Â |Â
show 1 more comment

1 Answer
1

active

oldest

votes

up vote
6
down vote

accepted

+100

The reason why your epochs are so slow is because you are iterating over each example in the batch and calculating the gradients in a for loop. The key to speeding this up is realizing that you are performing the same operations over every example in the batch, and so you can stack your examples into a matrix and calculate the gradients for all examples in a single matrix operation.

Let's break that down. For a single example, you start with a feature vector of size (1000,) and that is linearly transformed by multiply it by a weight matrix of size (1000,900) resulting in a (900,1) vector that is then added with a bias vector of size (900,1). This is then non-linearly transformed, which does not affect the dimensions, to result in the first hidden layer of size (900,1).

This is 900 hidden nodes for the first example.

However, since we are performing the same operations to every example in the batch, we can stack the 100 examples to form a matrix of size (100,1000) instead of (1,1000) then take the dot product of this input matrix with the transpose of the weight matrix, (1000,900) for a resulting matrix of (100,900). Add the bias (1,900), which is broadcasted automatically in numpy to a matrix of size (100,900) (it's the same bias vector stacked 100 times) and apply the non-linear transform for a final matrix of size (100,900). This is 900 hidden nodes each for 100 examples.

This can be applied to each hidden layer in the network.

Factoring the original code:

for i in batch: 
 # Feedforward
 a[0] = array([I[i]]).T
 for l in range(nn_size-1):
 z[l] = dot(W[l], a[l]) + b[l]
 a[l+1] = sigma(z[l])

 # Backpropagation
 delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
 dW[nn_size-2] += dot(delta, a[nn_size-2].T)
 dW[nn_size-2] += delta.dot(a[nn_size-2].T)
 db[nn_size-2] += delta
 for l in reversed(range(nn_size-2)):
 delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
 dW[l] += dot(delta, a[l].T)
 db[l] += delta

into matrix math form:

a[0] = I[batch]
for l in range(nn_size-1):
 z[l] = a[l].dot(W[l].T) + b[l].T
 a[l+1] = sigma(z[l])

delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)

for l in reversed(range(nn_size-2)):
 delta = delta.dot(W[l+1]) * sigma_prime(z[l])
 dW[l] += a[l].T.dot(delta).T
 db[l] += np.sum(delta.T, axis=1, keepdims=True)

When calculating gradients over batches, taking the sum or average of the gradients over the examples both work, but Andrew Ng suggests using the average over the batch as explained in his course and here: https://stats.stackexchange.com/questions/183840/sum-or-average-of-gradients-in-mini-batch-gradient-decent

In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.

With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.

With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.

Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you are determined on pursuing that, I would suggest you read some papers on the topic:

Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training system https://arxiv.org/abs/1708.04801

Zinkevich 2010: Parallelized Stochastic Gradient Descent http://martin.zinkevich.org/publications/nips2010.pdf

Or for a higher level overview:
http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent

edited Mar 29 at 2:56

answered Mar 29 at 1:42

mochi

1,0397

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f190274%2fdeep-neural-network-in-python%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
6
down vote

accepted

+100

This is 900 hidden nodes for the first example.

This can be applied to each hidden layer in the network.

Factoring the original code:

for i in batch: 
 # Feedforward
 a[0] = array([I[i]]).T
 for l in range(nn_size-1):
 z[l] = dot(W[l], a[l]) + b[l]
 a[l+1] = sigma(z[l])

 # Backpropagation
 delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
 dW[nn_size-2] += dot(delta, a[nn_size-2].T)
 dW[nn_size-2] += delta.dot(a[nn_size-2].T)
 db[nn_size-2] += delta
 for l in reversed(range(nn_size-2)):
 delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
 dW[l] += dot(delta, a[l].T)
 db[l] += delta

into matrix math form:

a[0] = I[batch]
for l in range(nn_size-1):
 z[l] = a[l].dot(W[l].T) + b[l].T
 a[l+1] = sigma(z[l])

delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)

for l in reversed(range(nn_size-2)):
 delta = delta.dot(W[l+1]) * sigma_prime(z[l])
 dW[l] += a[l].T.dot(delta).T
 db[l] += np.sum(delta.T, axis=1, keepdims=True)

In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.

With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.

With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.

Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training system https://arxiv.org/abs/1708.04801

Zinkevich 2010: Parallelized Stochastic Gradient Descent http://martin.zinkevich.org/publications/nips2010.pdf

Or for a higher level overview:
http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent

edited Mar 29 at 2:56

answered Mar 29 at 1:42

mochi

1,0397

add a commentÂ |Â

up vote
6
down vote

accepted

+100

This is 900 hidden nodes for the first example.

This can be applied to each hidden layer in the network.

Factoring the original code:

for i in batch: 
 # Feedforward
 a[0] = array([I[i]]).T
 for l in range(nn_size-1):
 z[l] = dot(W[l], a[l]) + b[l]
 a[l+1] = sigma(z[l])

 # Backpropagation
 delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
 dW[nn_size-2] += dot(delta, a[nn_size-2].T)
 dW[nn_size-2] += delta.dot(a[nn_size-2].T)
 db[nn_size-2] += delta
 for l in reversed(range(nn_size-2)):
 delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
 dW[l] += dot(delta, a[l].T)
 db[l] += delta

into matrix math form:

a[0] = I[batch]
for l in range(nn_size-1):
 z[l] = a[l].dot(W[l].T) + b[l].T
 a[l+1] = sigma(z[l])

delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)

for l in reversed(range(nn_size-2)):
 delta = delta.dot(W[l+1]) * sigma_prime(z[l])
 dW[l] += a[l].T.dot(delta).T
 db[l] += np.sum(delta.T, axis=1, keepdims=True)

In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.

With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.

With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.

Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training system https://arxiv.org/abs/1708.04801

Zinkevich 2010: Parallelized Stochastic Gradient Descent http://martin.zinkevich.org/publications/nips2010.pdf

Or for a higher level overview:
http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent

edited Mar 29 at 2:56

answered Mar 29 at 1:42

mochi

1,0397

add a commentÂ |Â

up vote
6
down vote

accepted

+100

up vote
6
down vote

accepted

+100

This is 900 hidden nodes for the first example.

This can be applied to each hidden layer in the network.

Factoring the original code:

for i in batch: 
 # Feedforward
 a[0] = array([I[i]]).T
 for l in range(nn_size-1):
 z[l] = dot(W[l], a[l]) + b[l]
 a[l+1] = sigma(z[l])

 # Backpropagation
 delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
 dW[nn_size-2] += dot(delta, a[nn_size-2].T)
 dW[nn_size-2] += delta.dot(a[nn_size-2].T)
 db[nn_size-2] += delta
 for l in reversed(range(nn_size-2)):
 delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
 dW[l] += dot(delta, a[l].T)
 db[l] += delta

into matrix math form:

a[0] = I[batch]
for l in range(nn_size-1):
 z[l] = a[l].dot(W[l].T) + b[l].T
 a[l+1] = sigma(z[l])

delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)

for l in reversed(range(nn_size-2)):
 delta = delta.dot(W[l+1]) * sigma_prime(z[l])
 dW[l] += a[l].T.dot(delta).T
 db[l] += np.sum(delta.T, axis=1, keepdims=True)

In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.

With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.

With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.

Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training system https://arxiv.org/abs/1708.04801

Zinkevich 2010: Parallelized Stochastic Gradient Descent http://martin.zinkevich.org/publications/nips2010.pdf

Or for a higher level overview:
http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent

edited Mar 29 at 2:56

answered Mar 29 at 1:42

mochi

1,0397

This is 900 hidden nodes for the first example.

This can be applied to each hidden layer in the network.

Factoring the original code:

for i in batch: 
 # Feedforward
 a[0] = array([I[i]]).T
 for l in range(nn_size-1):
 z[l] = dot(W[l], a[l]) + b[l]
 a[l+1] = sigma(z[l])

 # Backpropagation
 delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
 dW[nn_size-2] += dot(delta, a[nn_size-2].T)
 dW[nn_size-2] += delta.dot(a[nn_size-2].T)
 db[nn_size-2] += delta
 for l in reversed(range(nn_size-2)):
 delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
 dW[l] += dot(delta, a[l].T)
 db[l] += delta

into matrix math form:

a[0] = I[batch]
for l in range(nn_size-1):
 z[l] = a[l].dot(W[l].T) + b[l].T
 a[l+1] = sigma(z[l])

delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)

for l in reversed(range(nn_size-2)):
 delta = delta.dot(W[l+1]) * sigma_prime(z[l])
 dW[l] += a[l].T.dot(delta).T
 db[l] += np.sum(delta.T, axis=1, keepdims=True)

In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.

With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.

With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.

Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training system https://arxiv.org/abs/1708.04801

Zinkevich 2010: Parallelized Stochastic Gradient Descent http://martin.zinkevich.org/publications/nips2010.pdf

Or for a higher level overview:
http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent

edited Mar 29 at 2:56

answered Mar 29 at 1:42

mochi

1,0397

edited Mar 29 at 2:56

answered Mar 29 at 1:42

mochi

1,0397

answered Mar 29 at 1:42

mochi

1,0397

answered Mar 29 at 1:42

mochi

1,0397

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr