Deep Neural Network in Python
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
6
down vote
favorite
I have written a neural network in Python and focused on adaptability and performance. I want to use it to dive deeper into that field. I am far from being an expert in neural networks and the same goes for Python. I do not want to use Tensorflow since I really want to understand how a neural network works.
My questions are:
- How can I increase the performance? At the moment it takes days to train the network
The code runs on a single core. But since every loop over the batches run independently it can be parallelized.
- How can I parallelize the loop over the batches?
I found some tutorials on parallel loops in Python but I could not apply it to my problem.
Here is my tested code with some pseudo training data:
from numpy import random, zeros, array, dot
from scipy.special import expit
import time
def sigma(x):
return expit(x)
def sigma_prime(x):
u = expit(x)
return u-u*u
def SGD(I, L, batch_size, eta):
images = len(L)
# Pre-activation
z = [zeros((layer_size[l],1)) for l in range(1,nn_size)]
# Activations
a = [zeros((layer_size[l],1)) for l in range(nn_size)]
# Ground truth
y = zeros((images, layer_size[-1]))
for i in range(images):
y[i,L[i]] = 1.0
while (1):
t0 = time.time()
# Create random batch
batch = random.randint(0,images,batch_size)
dW = [zeros((layer_size[l+1], layer_size[l])) for l in range(nn_size-1)]
db = [zeros((layer_size[l],1)) for l in range(1, nn_size)]
for i in batch:
# Feedforward
a[0] = array([I[i]]).T
for l in range(nn_size-1):
z[l] = dot(W[l], a[l]) + b[l]
a[l+1] = sigma(z[l])
# Backpropagation
delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += dot(delta, a[nn_size-2].T)
dW[nn_size-2] += delta.dot(a[nn_size-2].T)
db[nn_size-2] += delta
for l in reversed(range(nn_size-2)):
delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
dW[l] += dot(delta, a[l].T)
db[l] += delta
# Update Weights and Biases
for l in range(nn_size-1):
W[l] += - eta * dW[l] / batch_size
b[l] += - eta * db[l] / batch_size
print(time.time() - t0)
input_size = 1000
output_size = 10
layer_size = [input_size, 30**2, 30**2, 30**2, output_size]
nn_size = len(layer_size)
layer_size = layer_size
# Weights
W = [random.randn(layer_size[l+1],layer_size[l]) for l in range(nn_size-1)]
# Bias
b = [random.randn(layer_size[l],1) for l in range(1,nn_size)]
# Some random training data with label
size_training_data = 1000
# Random data I of size "size_training_data" x "input_size"
I = random.rand(size_training_data, input_size)
# Label for all training data
L = random.randint(0,10, input_size)
batch_size = 100
eta = 0.1
SGD(I, L, batch_size, eta)
The output shows the time needed for one batch of size batch_size
.
python performance neural-network
 |Â
show 1 more comment
up vote
6
down vote
favorite
I have written a neural network in Python and focused on adaptability and performance. I want to use it to dive deeper into that field. I am far from being an expert in neural networks and the same goes for Python. I do not want to use Tensorflow since I really want to understand how a neural network works.
My questions are:
- How can I increase the performance? At the moment it takes days to train the network
The code runs on a single core. But since every loop over the batches run independently it can be parallelized.
- How can I parallelize the loop over the batches?
I found some tutorials on parallel loops in Python but I could not apply it to my problem.
Here is my tested code with some pseudo training data:
from numpy import random, zeros, array, dot
from scipy.special import expit
import time
def sigma(x):
return expit(x)
def sigma_prime(x):
u = expit(x)
return u-u*u
def SGD(I, L, batch_size, eta):
images = len(L)
# Pre-activation
z = [zeros((layer_size[l],1)) for l in range(1,nn_size)]
# Activations
a = [zeros((layer_size[l],1)) for l in range(nn_size)]
# Ground truth
y = zeros((images, layer_size[-1]))
for i in range(images):
y[i,L[i]] = 1.0
while (1):
t0 = time.time()
# Create random batch
batch = random.randint(0,images,batch_size)
dW = [zeros((layer_size[l+1], layer_size[l])) for l in range(nn_size-1)]
db = [zeros((layer_size[l],1)) for l in range(1, nn_size)]
for i in batch:
# Feedforward
a[0] = array([I[i]]).T
for l in range(nn_size-1):
z[l] = dot(W[l], a[l]) + b[l]
a[l+1] = sigma(z[l])
# Backpropagation
delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += dot(delta, a[nn_size-2].T)
dW[nn_size-2] += delta.dot(a[nn_size-2].T)
db[nn_size-2] += delta
for l in reversed(range(nn_size-2)):
delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
dW[l] += dot(delta, a[l].T)
db[l] += delta
# Update Weights and Biases
for l in range(nn_size-1):
W[l] += - eta * dW[l] / batch_size
b[l] += - eta * db[l] / batch_size
print(time.time() - t0)
input_size = 1000
output_size = 10
layer_size = [input_size, 30**2, 30**2, 30**2, output_size]
nn_size = len(layer_size)
layer_size = layer_size
# Weights
W = [random.randn(layer_size[l+1],layer_size[l]) for l in range(nn_size-1)]
# Bias
b = [random.randn(layer_size[l],1) for l in range(1,nn_size)]
# Some random training data with label
size_training_data = 1000
# Random data I of size "size_training_data" x "input_size"
I = random.rand(size_training_data, input_size)
# Label for all training data
L = random.randint(0,10, input_size)
batch_size = 100
eta = 0.1
SGD(I, L, batch_size, eta)
The output shows the time needed for one batch of size batch_size
.
python performance neural-network
Could you make it clearer what what it's printing means as well as what the variables mean?
â Solomon Ucko
Mar 27 at 2:08
@SolomonUcko Done!
â Samuel
Mar 27 at 21:04
Also, could you explain what all the variables are?
â Solomon Ucko
Mar 27 at 21:35
@SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
â Samuel
Mar 27 at 21:39
Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice thatwhile (1):
should probably bewhile True:
and that you could probably useexpit
instead ofsigma
, as opposed to creating an 'alias'.
â Solomon Ucko
Mar 27 at 21:46
 |Â
show 1 more comment
up vote
6
down vote
favorite
up vote
6
down vote
favorite
I have written a neural network in Python and focused on adaptability and performance. I want to use it to dive deeper into that field. I am far from being an expert in neural networks and the same goes for Python. I do not want to use Tensorflow since I really want to understand how a neural network works.
My questions are:
- How can I increase the performance? At the moment it takes days to train the network
The code runs on a single core. But since every loop over the batches run independently it can be parallelized.
- How can I parallelize the loop over the batches?
I found some tutorials on parallel loops in Python but I could not apply it to my problem.
Here is my tested code with some pseudo training data:
from numpy import random, zeros, array, dot
from scipy.special import expit
import time
def sigma(x):
return expit(x)
def sigma_prime(x):
u = expit(x)
return u-u*u
def SGD(I, L, batch_size, eta):
images = len(L)
# Pre-activation
z = [zeros((layer_size[l],1)) for l in range(1,nn_size)]
# Activations
a = [zeros((layer_size[l],1)) for l in range(nn_size)]
# Ground truth
y = zeros((images, layer_size[-1]))
for i in range(images):
y[i,L[i]] = 1.0
while (1):
t0 = time.time()
# Create random batch
batch = random.randint(0,images,batch_size)
dW = [zeros((layer_size[l+1], layer_size[l])) for l in range(nn_size-1)]
db = [zeros((layer_size[l],1)) for l in range(1, nn_size)]
for i in batch:
# Feedforward
a[0] = array([I[i]]).T
for l in range(nn_size-1):
z[l] = dot(W[l], a[l]) + b[l]
a[l+1] = sigma(z[l])
# Backpropagation
delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += dot(delta, a[nn_size-2].T)
dW[nn_size-2] += delta.dot(a[nn_size-2].T)
db[nn_size-2] += delta
for l in reversed(range(nn_size-2)):
delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
dW[l] += dot(delta, a[l].T)
db[l] += delta
# Update Weights and Biases
for l in range(nn_size-1):
W[l] += - eta * dW[l] / batch_size
b[l] += - eta * db[l] / batch_size
print(time.time() - t0)
input_size = 1000
output_size = 10
layer_size = [input_size, 30**2, 30**2, 30**2, output_size]
nn_size = len(layer_size)
layer_size = layer_size
# Weights
W = [random.randn(layer_size[l+1],layer_size[l]) for l in range(nn_size-1)]
# Bias
b = [random.randn(layer_size[l],1) for l in range(1,nn_size)]
# Some random training data with label
size_training_data = 1000
# Random data I of size "size_training_data" x "input_size"
I = random.rand(size_training_data, input_size)
# Label for all training data
L = random.randint(0,10, input_size)
batch_size = 100
eta = 0.1
SGD(I, L, batch_size, eta)
The output shows the time needed for one batch of size batch_size
.
python performance neural-network
I have written a neural network in Python and focused on adaptability and performance. I want to use it to dive deeper into that field. I am far from being an expert in neural networks and the same goes for Python. I do not want to use Tensorflow since I really want to understand how a neural network works.
My questions are:
- How can I increase the performance? At the moment it takes days to train the network
The code runs on a single core. But since every loop over the batches run independently it can be parallelized.
- How can I parallelize the loop over the batches?
I found some tutorials on parallel loops in Python but I could not apply it to my problem.
Here is my tested code with some pseudo training data:
from numpy import random, zeros, array, dot
from scipy.special import expit
import time
def sigma(x):
return expit(x)
def sigma_prime(x):
u = expit(x)
return u-u*u
def SGD(I, L, batch_size, eta):
images = len(L)
# Pre-activation
z = [zeros((layer_size[l],1)) for l in range(1,nn_size)]
# Activations
a = [zeros((layer_size[l],1)) for l in range(nn_size)]
# Ground truth
y = zeros((images, layer_size[-1]))
for i in range(images):
y[i,L[i]] = 1.0
while (1):
t0 = time.time()
# Create random batch
batch = random.randint(0,images,batch_size)
dW = [zeros((layer_size[l+1], layer_size[l])) for l in range(nn_size-1)]
db = [zeros((layer_size[l],1)) for l in range(1, nn_size)]
for i in batch:
# Feedforward
a[0] = array([I[i]]).T
for l in range(nn_size-1):
z[l] = dot(W[l], a[l]) + b[l]
a[l+1] = sigma(z[l])
# Backpropagation
delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += dot(delta, a[nn_size-2].T)
dW[nn_size-2] += delta.dot(a[nn_size-2].T)
db[nn_size-2] += delta
for l in reversed(range(nn_size-2)):
delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
dW[l] += dot(delta, a[l].T)
db[l] += delta
# Update Weights and Biases
for l in range(nn_size-1):
W[l] += - eta * dW[l] / batch_size
b[l] += - eta * db[l] / batch_size
print(time.time() - t0)
input_size = 1000
output_size = 10
layer_size = [input_size, 30**2, 30**2, 30**2, output_size]
nn_size = len(layer_size)
layer_size = layer_size
# Weights
W = [random.randn(layer_size[l+1],layer_size[l]) for l in range(nn_size-1)]
# Bias
b = [random.randn(layer_size[l],1) for l in range(1,nn_size)]
# Some random training data with label
size_training_data = 1000
# Random data I of size "size_training_data" x "input_size"
I = random.rand(size_training_data, input_size)
# Label for all training data
L = random.randint(0,10, input_size)
batch_size = 100
eta = 0.1
SGD(I, L, batch_size, eta)
The output shows the time needed for one batch of size batch_size
.
python performance neural-network
edited Mar 27 at 21:03
asked Mar 23 at 8:11
Samuel
1839
1839
Could you make it clearer what what it's printing means as well as what the variables mean?
â Solomon Ucko
Mar 27 at 2:08
@SolomonUcko Done!
â Samuel
Mar 27 at 21:04
Also, could you explain what all the variables are?
â Solomon Ucko
Mar 27 at 21:35
@SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
â Samuel
Mar 27 at 21:39
Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice thatwhile (1):
should probably bewhile True:
and that you could probably useexpit
instead ofsigma
, as opposed to creating an 'alias'.
â Solomon Ucko
Mar 27 at 21:46
 |Â
show 1 more comment
Could you make it clearer what what it's printing means as well as what the variables mean?
â Solomon Ucko
Mar 27 at 2:08
@SolomonUcko Done!
â Samuel
Mar 27 at 21:04
Also, could you explain what all the variables are?
â Solomon Ucko
Mar 27 at 21:35
@SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
â Samuel
Mar 27 at 21:39
Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice thatwhile (1):
should probably bewhile True:
and that you could probably useexpit
instead ofsigma
, as opposed to creating an 'alias'.
â Solomon Ucko
Mar 27 at 21:46
Could you make it clearer what what it's printing means as well as what the variables mean?
â Solomon Ucko
Mar 27 at 2:08
Could you make it clearer what what it's printing means as well as what the variables mean?
â Solomon Ucko
Mar 27 at 2:08
@SolomonUcko Done!
â Samuel
Mar 27 at 21:04
@SolomonUcko Done!
â Samuel
Mar 27 at 21:04
Also, could you explain what all the variables are?
â Solomon Ucko
Mar 27 at 21:35
Also, could you explain what all the variables are?
â Solomon Ucko
Mar 27 at 21:35
@SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
â Samuel
Mar 27 at 21:39
@SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
â Samuel
Mar 27 at 21:39
Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice that
while (1):
should probably be while True:
and that you could probably use expit
instead of sigma
, as opposed to creating an 'alias'.â Solomon Ucko
Mar 27 at 21:46
Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice that
while (1):
should probably be while True:
and that you could probably use expit
instead of sigma
, as opposed to creating an 'alias'.â Solomon Ucko
Mar 27 at 21:46
 |Â
show 1 more comment
1 Answer
1
active
oldest
votes
up vote
6
down vote
accepted
The reason why your epochs are so slow is because you are iterating over each example in the batch and calculating the gradients in a for loop. The key to speeding this up is realizing that you are performing the same operations over every example in the batch, and so you can stack your examples into a matrix and calculate the gradients for all examples in a single matrix operation.
Let's break that down. For a single example, you start with a feature vector of size (1000,) and that is linearly transformed by multiply it by a weight matrix of size (1000,900) resulting in a (900,1) vector that is then added with a bias vector of size (900,1). This is then non-linearly transformed, which does not affect the dimensions, to result in the first hidden layer of size (900,1).
This is 900 hidden nodes for the first example.
However, since we are performing the same operations to every example in the batch, we can stack the 100 examples to form a matrix of size (100,1000) instead of (1,1000) then take the dot product of this input matrix with the transpose of the weight matrix, (1000,900) for a resulting matrix of (100,900). Add the bias (1,900), which is broadcasted automatically in numpy to a matrix of size (100,900) (it's the same bias vector stacked 100 times) and apply the non-linear transform for a final matrix of size (100,900). This is 900 hidden nodes each for 100 examples.
This can be applied to each hidden layer in the network.
Factoring the original code:
for i in batch:
# Feedforward
a[0] = array([I[i]]).T
for l in range(nn_size-1):
z[l] = dot(W[l], a[l]) + b[l]
a[l+1] = sigma(z[l])
# Backpropagation
delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += dot(delta, a[nn_size-2].T)
dW[nn_size-2] += delta.dot(a[nn_size-2].T)
db[nn_size-2] += delta
for l in reversed(range(nn_size-2)):
delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
dW[l] += dot(delta, a[l].T)
db[l] += delta
into matrix math form:
a[0] = I[batch]
for l in range(nn_size-1):
z[l] = a[l].dot(W[l].T) + b[l].T
a[l+1] = sigma(z[l])
delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)
for l in reversed(range(nn_size-2)):
delta = delta.dot(W[l+1]) * sigma_prime(z[l])
dW[l] += a[l].T.dot(delta).T
db[l] += np.sum(delta.T, axis=1, keepdims=True)
When calculating gradients over batches, taking the sum or average of the gradients over the examples both work, but Andrew Ng suggests using the average over the batch as explained in his course and here: https://stats.stackexchange.com/questions/183840/sum-or-average-of-gradients-in-mini-batch-gradient-decent
In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.
With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.
With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.
Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you are determined on pursuing that, I would suggest you read some papers on the topic:
Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training system https://arxiv.org/abs/1708.04801
Zinkevich 2010: Parallelized Stochastic Gradient Descent http://martin.zinkevich.org/publications/nips2010.pdf
Or for a higher level overview:
http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
6
down vote
accepted
The reason why your epochs are so slow is because you are iterating over each example in the batch and calculating the gradients in a for loop. The key to speeding this up is realizing that you are performing the same operations over every example in the batch, and so you can stack your examples into a matrix and calculate the gradients for all examples in a single matrix operation.
Let's break that down. For a single example, you start with a feature vector of size (1000,) and that is linearly transformed by multiply it by a weight matrix of size (1000,900) resulting in a (900,1) vector that is then added with a bias vector of size (900,1). This is then non-linearly transformed, which does not affect the dimensions, to result in the first hidden layer of size (900,1).
This is 900 hidden nodes for the first example.
However, since we are performing the same operations to every example in the batch, we can stack the 100 examples to form a matrix of size (100,1000) instead of (1,1000) then take the dot product of this input matrix with the transpose of the weight matrix, (1000,900) for a resulting matrix of (100,900). Add the bias (1,900), which is broadcasted automatically in numpy to a matrix of size (100,900) (it's the same bias vector stacked 100 times) and apply the non-linear transform for a final matrix of size (100,900). This is 900 hidden nodes each for 100 examples.
This can be applied to each hidden layer in the network.
Factoring the original code:
for i in batch:
# Feedforward
a[0] = array([I[i]]).T
for l in range(nn_size-1):
z[l] = dot(W[l], a[l]) + b[l]
a[l+1] = sigma(z[l])
# Backpropagation
delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += dot(delta, a[nn_size-2].T)
dW[nn_size-2] += delta.dot(a[nn_size-2].T)
db[nn_size-2] += delta
for l in reversed(range(nn_size-2)):
delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
dW[l] += dot(delta, a[l].T)
db[l] += delta
into matrix math form:
a[0] = I[batch]
for l in range(nn_size-1):
z[l] = a[l].dot(W[l].T) + b[l].T
a[l+1] = sigma(z[l])
delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)
for l in reversed(range(nn_size-2)):
delta = delta.dot(W[l+1]) * sigma_prime(z[l])
dW[l] += a[l].T.dot(delta).T
db[l] += np.sum(delta.T, axis=1, keepdims=True)
When calculating gradients over batches, taking the sum or average of the gradients over the examples both work, but Andrew Ng suggests using the average over the batch as explained in his course and here: https://stats.stackexchange.com/questions/183840/sum-or-average-of-gradients-in-mini-batch-gradient-decent
In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.
With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.
With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.
Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you are determined on pursuing that, I would suggest you read some papers on the topic:
Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training system https://arxiv.org/abs/1708.04801
Zinkevich 2010: Parallelized Stochastic Gradient Descent http://martin.zinkevich.org/publications/nips2010.pdf
Or for a higher level overview:
http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent
add a comment |Â
up vote
6
down vote
accepted
The reason why your epochs are so slow is because you are iterating over each example in the batch and calculating the gradients in a for loop. The key to speeding this up is realizing that you are performing the same operations over every example in the batch, and so you can stack your examples into a matrix and calculate the gradients for all examples in a single matrix operation.
Let's break that down. For a single example, you start with a feature vector of size (1000,) and that is linearly transformed by multiply it by a weight matrix of size (1000,900) resulting in a (900,1) vector that is then added with a bias vector of size (900,1). This is then non-linearly transformed, which does not affect the dimensions, to result in the first hidden layer of size (900,1).
This is 900 hidden nodes for the first example.
However, since we are performing the same operations to every example in the batch, we can stack the 100 examples to form a matrix of size (100,1000) instead of (1,1000) then take the dot product of this input matrix with the transpose of the weight matrix, (1000,900) for a resulting matrix of (100,900). Add the bias (1,900), which is broadcasted automatically in numpy to a matrix of size (100,900) (it's the same bias vector stacked 100 times) and apply the non-linear transform for a final matrix of size (100,900). This is 900 hidden nodes each for 100 examples.
This can be applied to each hidden layer in the network.
Factoring the original code:
for i in batch:
# Feedforward
a[0] = array([I[i]]).T
for l in range(nn_size-1):
z[l] = dot(W[l], a[l]) + b[l]
a[l+1] = sigma(z[l])
# Backpropagation
delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += dot(delta, a[nn_size-2].T)
dW[nn_size-2] += delta.dot(a[nn_size-2].T)
db[nn_size-2] += delta
for l in reversed(range(nn_size-2)):
delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
dW[l] += dot(delta, a[l].T)
db[l] += delta
into matrix math form:
a[0] = I[batch]
for l in range(nn_size-1):
z[l] = a[l].dot(W[l].T) + b[l].T
a[l+1] = sigma(z[l])
delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)
for l in reversed(range(nn_size-2)):
delta = delta.dot(W[l+1]) * sigma_prime(z[l])
dW[l] += a[l].T.dot(delta).T
db[l] += np.sum(delta.T, axis=1, keepdims=True)
When calculating gradients over batches, taking the sum or average of the gradients over the examples both work, but Andrew Ng suggests using the average over the batch as explained in his course and here: https://stats.stackexchange.com/questions/183840/sum-or-average-of-gradients-in-mini-batch-gradient-decent
In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.
With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.
With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.
Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you are determined on pursuing that, I would suggest you read some papers on the topic:
Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training system https://arxiv.org/abs/1708.04801
Zinkevich 2010: Parallelized Stochastic Gradient Descent http://martin.zinkevich.org/publications/nips2010.pdf
Or for a higher level overview:
http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent
add a comment |Â
up vote
6
down vote
accepted
up vote
6
down vote
accepted
The reason why your epochs are so slow is because you are iterating over each example in the batch and calculating the gradients in a for loop. The key to speeding this up is realizing that you are performing the same operations over every example in the batch, and so you can stack your examples into a matrix and calculate the gradients for all examples in a single matrix operation.
Let's break that down. For a single example, you start with a feature vector of size (1000,) and that is linearly transformed by multiply it by a weight matrix of size (1000,900) resulting in a (900,1) vector that is then added with a bias vector of size (900,1). This is then non-linearly transformed, which does not affect the dimensions, to result in the first hidden layer of size (900,1).
This is 900 hidden nodes for the first example.
However, since we are performing the same operations to every example in the batch, we can stack the 100 examples to form a matrix of size (100,1000) instead of (1,1000) then take the dot product of this input matrix with the transpose of the weight matrix, (1000,900) for a resulting matrix of (100,900). Add the bias (1,900), which is broadcasted automatically in numpy to a matrix of size (100,900) (it's the same bias vector stacked 100 times) and apply the non-linear transform for a final matrix of size (100,900). This is 900 hidden nodes each for 100 examples.
This can be applied to each hidden layer in the network.
Factoring the original code:
for i in batch:
# Feedforward
a[0] = array([I[i]]).T
for l in range(nn_size-1):
z[l] = dot(W[l], a[l]) + b[l]
a[l+1] = sigma(z[l])
# Backpropagation
delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += dot(delta, a[nn_size-2].T)
dW[nn_size-2] += delta.dot(a[nn_size-2].T)
db[nn_size-2] += delta
for l in reversed(range(nn_size-2)):
delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
dW[l] += dot(delta, a[l].T)
db[l] += delta
into matrix math form:
a[0] = I[batch]
for l in range(nn_size-1):
z[l] = a[l].dot(W[l].T) + b[l].T
a[l+1] = sigma(z[l])
delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)
for l in reversed(range(nn_size-2)):
delta = delta.dot(W[l+1]) * sigma_prime(z[l])
dW[l] += a[l].T.dot(delta).T
db[l] += np.sum(delta.T, axis=1, keepdims=True)
When calculating gradients over batches, taking the sum or average of the gradients over the examples both work, but Andrew Ng suggests using the average over the batch as explained in his course and here: https://stats.stackexchange.com/questions/183840/sum-or-average-of-gradients-in-mini-batch-gradient-decent
In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.
With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.
With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.
Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you are determined on pursuing that, I would suggest you read some papers on the topic:
Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training system https://arxiv.org/abs/1708.04801
Zinkevich 2010: Parallelized Stochastic Gradient Descent http://martin.zinkevich.org/publications/nips2010.pdf
Or for a higher level overview:
http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent
The reason why your epochs are so slow is because you are iterating over each example in the batch and calculating the gradients in a for loop. The key to speeding this up is realizing that you are performing the same operations over every example in the batch, and so you can stack your examples into a matrix and calculate the gradients for all examples in a single matrix operation.
Let's break that down. For a single example, you start with a feature vector of size (1000,) and that is linearly transformed by multiply it by a weight matrix of size (1000,900) resulting in a (900,1) vector that is then added with a bias vector of size (900,1). This is then non-linearly transformed, which does not affect the dimensions, to result in the first hidden layer of size (900,1).
This is 900 hidden nodes for the first example.
However, since we are performing the same operations to every example in the batch, we can stack the 100 examples to form a matrix of size (100,1000) instead of (1,1000) then take the dot product of this input matrix with the transpose of the weight matrix, (1000,900) for a resulting matrix of (100,900). Add the bias (1,900), which is broadcasted automatically in numpy to a matrix of size (100,900) (it's the same bias vector stacked 100 times) and apply the non-linear transform for a final matrix of size (100,900). This is 900 hidden nodes each for 100 examples.
This can be applied to each hidden layer in the network.
Factoring the original code:
for i in batch:
# Feedforward
a[0] = array([I[i]]).T
for l in range(nn_size-1):
z[l] = dot(W[l], a[l]) + b[l]
a[l+1] = sigma(z[l])
# Backpropagation
delta = (a[nn_size-1]-array([y[i]]).T) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += dot(delta, a[nn_size-2].T)
dW[nn_size-2] += delta.dot(a[nn_size-2].T)
db[nn_size-2] += delta
for l in reversed(range(nn_size-2)):
delta = dot(W[l+1].T, delta) * sigma_prime(z[l])
dW[l] += dot(delta, a[l].T)
db[l] += delta
into matrix math form:
a[0] = I[batch]
for l in range(nn_size-1):
z[l] = a[l].dot(W[l].T) + b[l].T
a[l+1] = sigma(z[l])
delta = (a[nn_size-1] - y[batch]) * sigma_prime(z[nn_size-2])
dW[nn_size-2] += 2*delta.T.dot(a[nn_size-2])
db[nn_size-2] += np.sum(delta.T, axis=1, keepdims=True)
for l in reversed(range(nn_size-2)):
delta = delta.dot(W[l+1]) * sigma_prime(z[l])
dW[l] += a[l].T.dot(delta).T
db[l] += np.sum(delta.T, axis=1, keepdims=True)
When calculating gradients over batches, taking the sum or average of the gradients over the examples both work, but Andrew Ng suggests using the average over the batch as explained in his course and here: https://stats.stackexchange.com/questions/183840/sum-or-average-of-gradients-in-mini-batch-gradient-decent
In this case, since you divide the gradients by batch size, you can just sum the gradients over the batch.
With the original for loop implementation over 10 epochs, each epoch takes anywhere between 1.75 - 2.25s with an average of 1.91s per epoch.
With the matrix implementation over 100 epochs, each epoch takes between 0.06 - 0.25s with an average of 0.08s per epoch.
Parallelizing sgd is also somewhat non-trivial as SGD is a sequential algorithm. For example, imagine that you are standing on the top of a hill and are trying to find the quickest way down (without being able to see it). The only way to make your way down the hill is by actually taking steps downwards and re-evaluating your options at each step. Even if there were multiple clones of you, all your clones would be standing at the same spot on the hill and so the information gain would be limited. Nevertheless, there are techniques for parallel sgd and so if you are determined on pursuing that, I would suggest you read some papers on the topic:
Cheng 2017: Weighted parallel SGD for distributed unbalanced-workload training system https://arxiv.org/abs/1708.04801
Zinkevich 2010: Parallelized Stochastic Gradient Descent http://martin.zinkevich.org/publications/nips2010.pdf
Or for a higher level overview:
http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent
edited Mar 29 at 2:56
answered Mar 29 at 1:42
mochi
1,0397
1,0397
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f190274%2fdeep-neural-network-in-python%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Could you make it clearer what what it's printing means as well as what the variables mean?
â Solomon Ucko
Mar 27 at 2:08
@SolomonUcko Done!
â Samuel
Mar 27 at 21:04
Also, could you explain what all the variables are?
â Solomon Ucko
Mar 27 at 21:35
@SolomonUcko I do not think that the meaning of variables are important. I think this not the right place to explain how neural nets work. The operations are simple but the performance is poor. One does not have to understand the variables to improve the performance.
â Samuel
Mar 27 at 21:39
Good point. I'm not really sure how you can optimize it. I was thinking that it would be easier to optimize it if I could understand what each thing is supposed to do. I did notice that
while (1):
should probably bewhile True:
and that you could probably useexpit
instead ofsigma
, as opposed to creating an 'alias'.â Solomon Ucko
Mar 27 at 21:46