An artificial neural network is modeled by biological neural networks like ones in the human brain. It is a type of machine learning structure that via a set of algorithms allows the computer to learn.  We wanted to implement a tool that supports working with generic Artificial Neural Network architectures. The point was not to use frameworks that are available in the market but to dig into the math that is behind modern Artificial Intelligence. We believe that in order to truly understand something as complex as ANN-s, you need to understand the theory and implementation that is behind the frameworks such as Keras and PyTorch.  The main goal was to break the big complex problem into small, simple, and understandable pieces. As Albert Einstein once said: ”If you can’t explain it simply, you don’t understand it well enough.”.  

We don’t do sales, but given the circumstances and the severity of the situation, we decided to change that. Don’t be fooled, this sale isn’t meant for profit and it’s most definitely not planned.  This sale is here to help people who want to become better, learn new skills and be more productive than ever before. Our book offers are on a 50% sale.

So, what we have implemented an open-source custom deep learning framework – EduNN, inspired by Keras and PyTorch. In general, the goal was to implement an educational platform, using which we and our students can learn and experiment with neural networks. The complete framework code can be found here. We hope that you will like our approach and that you will try out our framework. Any feedback is welcomed as we look for ways to improve our solution.

EduNN Implementation

We decided to implement the library using Python, as it is the most popular technology choice when it comes to data science. Because weights and biases are represented as matrices, operations over them are easily performed thanks to the NumPy module.  The library consists of a Model that represents the main component, Neural Network that acts as a container of parameters, Initializers, Operators, Losses, Optimizers, and Regularizers.  

The model is implemented as Feed-Forward Multilayer Perceptron (MLP), which means that each node (neuron) from every layer is connected to each node in the previous and following layer. It supports only the Supervised Learning paradigm, in which the desired outputs are known and the model is trained to predict future outcomes (output) depending on given (input) data.

Model

This is the main component used to bind all components together. It defines a public API through which users can interact with the library, define the network’s architecture and train the network.  

model=Model([[4], [2, "relu"], [3, "softmax"]], loss_function='cross_entropy', learning_rate=0.001)

Training the model is done by passing a training data, together with desired (target) outputs.

model.train("gradient_descent", inputs=np.array(inputs), \
			labels=np.array(labels), epochs=500, batch_size=1, shuffle_data=True)
 
model.train("genetic_algorithm", np.array(x), np.array(y), epochs=50, shuffle_data=True, \
		      batch_size=10, elitism_num=3, population_size=20, mutation_probability=0.01)

Initializers

They provide the initial values for the model parameters at the start of training. Initialization plays an important role in training deep neural networks, as bad parameter initialization can lead to slow or no convergence. There are many ways one can initialize the network weights like small random weights drawn from the normal distribution. Our implementation initializes weights to random values between 0 and 1 and biases to 0. 

self._nn = Initializer.init(layers, loss_function, learning_rate)

Operators

They are the basic building blocks of any neural network. Operators are vector-valued functions that transform the data. Some commonly used operators are layers like linear, convolution, and pooling, and activation functions like ReLU and Sigmoid.  Since the library does not support Convolutional Neural Networks, the only type of operators that are implemented is activation functions. They are used to normalize the output of each neuron in the desired range. Implemented operators and their ranges are: Sigmoid (0, 1), ReLu [0, inf), SoftMax (0, 1), TanH (-1, 1),  SoftPlus (0, inf), Gaussian (0, 1], Sinusoid [-1, 1] and BinaryStep {0, 1}.

 Example of sigmoid function:

def function(x):
   return 1 / (1 + np.exp(-x))
 
def derivative(x):
   return np.multiply(x, (1 - x))

Losses

In math, losses are closed-form and differentiable mathematical expressions that are used as surrogates for the optimization objective of the problem at hand. Loss functions provide feedback on how the training is going. They map a vector of values to a number that represents the quality of the network at a certain moment of training.  

The cost or loss function has an important job in that it must faithfully distill all aspects of the model down into a single number in such a way that improvements in that number are a sign of a better model.”  

Implemented losses are Mean Squared Error (MSE), Mean Absolute Error (MAE), Cross-Entropy (CE) and Binary Cross-Entropy (BCE).

def MeanSquaredError(targets: np.array, predictions: np.array):
   return np.sum(np.power((targets - predictions), 2))
 
def MeanAbsoluteError(targets: np.array, predictions: np.array):
   return np.sum(np.abs(targets - predictions))
 
def CrossEntropy(targets: np.array, predictions: np.array):
   indices = np.argmax(predictions, axis=1).astype(int)
   probability = targets[np.arange(len(targets)), indices]
   log = np.log(probability)
   loss = -1.0 * np.sum(log) / len(log)
   return loss
 
def BinaryCrossEntropy(targets: np.array, predictions: np.array):
   n = predictions.shape[0]
   ce = -np.sum(targets * np.log(predictions)) / n
   return ce

Regulizers

This component provides the necessary control mechanism to avoid overfitting and promote generalization. L2 weight regularization is implemented that makes the weights sparser and uniform respectively.

def regularize(self, weights: list, weight_gradients: list):
   for i in range(len(weights)):
       weight_gradients[i] += 0.01 * weights[i]
   return weight_gradients

Optimizers

Optimizers provide the necessary recipe to update model parameters with respect to the optimization objective. In this context, the optimization objective is to minimize the error function that depends on the network’s parameters (weights and biases). The parameters are considered optimal when the error function reaches its global minimum.  

Gradient Descent and Genetic Algorithm are two optimizers used for fitting the model.

Gradient Descent

For network training using gradient descent, we implemented the following steps:   

  1. Creating a mini-batch – In this step, the portion of data is taken from the dataset, which is used in the present iteration. The size of the portion is equal to the batch size option that the user provides.  
def _create_mini_batch(self, iteration, inputs, labels):
   inputs_batch = np.array(inputs[iteration:iteration + self._batch_size])
   labels_batch = np.array(labels[iteration:iteration + self._batch_size])
   return inputs_batch, labels_batch
  1. Forward pass and Calculating Loss  The feed-forward function is used to calculate the output of each layer. Outputs are stored in the python list and are later used for calculating the gradients. Next, the value of loss function is calculated based on the output of the output layer. The loss function is retrieved from losses dictionary.
def feed_forward(self):
   layers = []
   activated = None
   for i in range(len(self._layer_sizes) - 1):
       if i == 0:
           batch = self._inputs_batch
       else:
           batch = activated
       activation = self._operators[i].function
       z = np.dot(batch, self._weights[i]) + self._biases[i]
       activated = activation(z)
       layers.append(activated)
   return layers
  1. Backward pass – In this step, the error is being calculated and passed backward through the network. After the backward pass, deltas have been calculated (the portion of error) for each layer. It is important to apply the derivative of activation function to each delta because of the chain rule. Deltas of each layer are stored in a list.  
def _backward_pass(self, layers, labels_batch):
   delta = None
   deltas = []
   for i in range(len(layers) - 1, -1, -1):
       is_output_layer = i == len(layers) - 1
       if is_output_layer:
           delta = (layers[i] - labels_batch) / self._batch_size
       else:
           delta = np.dot(delta, self._weights[i + 1].T)
           if isinstance(self._nn.get_operator(i), ReLU):
               delta = self._nn.get_operator(i).derivative(delta, layers[i])
           else:
               delta = self._nn.get_operator(i).derivative(layers[i])
       deltas.append(delta)
   return deltas
  1. Backpropagation – This is the main step in which the gradients are calculated based on deltas and the outputs of the layer from the feed-forward pass.  
def _backpropagation(cls, inputs_batch, deltas, layers):
   w_grad = np.dot(inputs_batch.T, deltas[len(deltas) - 1])  # forward * backward
   b_grad = np.sum(deltas[len(deltas) - 1], axis=0, keepdims=True)
   w_gradients = [w_grad]
   b_gradients = [b_grad]
   k = 0
   for i in range(len(layers) - 2, -1, -1):
       w_grad = np.dot(layers[k].T, deltas[i])  # forward * backward
       w_gradients.append(w_grad)
       b_grad = np.sum(deltas[i], axis=0, keepdims=True)
       b_gradients.append(b_grad)
       k += 1
   return w_gradients, b_gradients

5. Adjusting Weights and Biases – This is the step in which parameters are being updated according to the calculated gradients. The learning rate determines how big the update actually is. If the learning rate is too big it would diverge and never find the minimum. If it is too small, it would take too long to find the minimum.  

def _update_network(self):
   self._nn.set_weights_and_biases(self._weights, self._biases)

Genetic Algorithm 

Genetic Algorithm is based on the principle of Darwin’s evolution theory. Each population represents a certain number of individuals. Every individual is a solution to the problem (neural network). They consist of a set of genes (parameters), which is described in the Python list. During every epoch, each individual in the population is modified to create a better solution for a given problem. Following these steps, the network is trained using the genetic algorithm:   

  1. Creating an initial population – Genetic Algorithm creates an initial population of individuals. Each individual represents one instance of the Neural Network with random generated weights and biases.
def _generate_population(self):
   self._population = []
   for _ in range(self._population_size):
       nn = Initializer.init(self._layers, self._loss, self._learning_rate)
       self._population.append(nn)
  1. Calculating fitness score  – After the initialization, the first thing to be done is to calculate fitness scores using a fitness function. Our implementation of the fitness function is based on calculating how many times feed-forward pass output and given labels are the same. Each time they match, the score is incremented by one. After iterating through the whole data set, the resulting score is divided by the length of the input. This way we get the percentage of the correctly predicted outputs. 
def _calculate_fitness(nn: NeuralNetwork, inputs_: list, labels_: list):
   scored = 0
   for j in range(len(inputs_)):
       inputs = inputs_[j]
       data_label = labels_[j]
       prediction = nn.predict(inputs)
       predicted_index = np.argmax(prediction)
       label_index = np.argmax(data_label)
       if predicted_index == label_index:
           scored += 1
   return scored / len(inputs_)
  1. Elitism – After we’ve calculated fitness, we choose 2 individuals with the highest fitness and pass them directly to the children list, while deleting them from the current population. Removing units is done by iterating through the population and checking if the index is in the self.best scores list. If it isn’t in the list, we add it to the population for further modifications, otherwise, we delete its score from the total score. This ensures that the population size will stay the same 
def _elitism(self):
   self._children = []
 
   self._best_scores = sorted(range(len(self._scores)),\
					    key=lambda sub: self._scores[sub])[-self._elitism_num:]

   for i in range(len(self._population)):
       if i in self._best_scores:
           weights, biases = self._population[i].get_weights_and_biases()
           self._children.append([weights, biases])
   self._remove_units()
  1. Selection – Selection is implemented using NumPy random.choice method, where we provided a self. scores list whose elements sum up to one. Self. scores are calculated for each individual, by dividing each score(output from self. calculate fitness) with total score sum(self. score sum). Selection returns a pair of two individuals(parents).
def _selection(self):
   return np.random.choice(len(self._population), 2, p=self._scores)
  1. Crossover – After selection, parameters from parents are put into the child’s genes using a random method with a 50% probability for each gene from one parent to be used. After the process, parameters are appended to the offspring(self. children).
  1. Mutation – A user provides a probability for mutation, and according to that number, a random gene is being multiplied with some random float number between 0.01 and 0.5. Those boundaries are selected because the more drastic change in an individual’s genes could lead to divergence.

After these steps are finished, a new population is generated, using self.children list, where all updated weights and biases are stored. This is one epoch, and it is repeated until the defined number of epochs is reached. The best-fitting individual from the last population is the result of the algorithm.  

Training and Tests

Iris dataset

This is the first dataset we’ve trained our network on. This dataset includes three iris species, with 50 samples each, and also some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other. Columns that this dataset contains are Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species. The last one, Species, are representing the names of the flowers, and that is used as our targeting label. Because it is the type of string, we had to convert it into number form, so we could compare results from the predictions. Scores are: 

  • Using Gradient Descent 97% of accuracy after 500 epochs
  • Using Genetic Algorithm 87% of accuracy after 700 epochs

Optimal network architecture:

model=Model([[4], [2, "relu"], [3, "softmax"]], loss_function='cross_entropy', learning_rate=0.001)
 
model.train("gradient_descent", inputs=np.array(inputs), labels=np.array(labels), epochs=500, \
								  batch_size=1, shuffle_data=True)
          precision  recall   f1-score   support
 
    0       1.00      1.00      1.00        50
    1       1.00      0.92      0.96        50
    2       0.93      1.00      0.96        50
 
    accuracy                           0.97       150
   macro avg       0.98      0.97      0.97       150
weighted avg       0.98      0.97      0.97       150

Too big learning rate:

model = Model([[4], [2, "relu"], [3, "softmax"]], loss_function='cross_entropy', learning_rate=0.5)
 
model.train("gradient_descent", inputs=np.array(inputs), labels=np.array(labels), epochs=500, \
								batch_size=1, shuffle_data=True)
          precision  recall   f1-score   support
 
    0       0.00      0.00      0.00        50
    1       0.33      1.00      0.50        50
    2       0.00      0.00      0.00        50
 
    accuracy                           0.33       150
   macro avg       0.11      0.33      0.17       150
weighted avg       0.11      0.33      0.17       150

Too many hidden layers:

model=Model([[4], [4, "relu"], [2, "relu"], [3, "softmax"]], loss_function='cross_entropy', \
								              learning_rate=0.001)
 
model.train("gradient_descent", inputs=np.array(inputs), labels=np.array(labels), epochs=500, \ 
								   batch_size=1, shuffle_data=True)
          precision  recall   f1-score   support
 
     0       0.33      1.00      0.50        50
     1       0.00      0.00      0.00        50
     2       0.00      0.00      0.00        50
 
    accuracy                           0.33       150
   macro avg       0.11      0.33      0.17       150
weighted avg       0.11      0.33      0.17       150

Not enough epochs:

model=Model([[4], [2, "relu"], [3, "softmax"]], loss_function='cross_entropy', learning_rate=0.001)
 
model.train("gradient_descent", inputs=np.array(inputs), labels=np.array(labels), epochs=100, \ 
								   batch_size=1, shuffle_data=True)
          precision  recall   f1-score   support
 
     0       1.00      1.00      1.00        50
     1       0.49      0.96      0.65        50
     2       0.00      0.00      0.00        50
 
    accuracy                           0.65       150
   macro avg       0.50      0.65      0.55       150
weighted avg       0.50      0.65      0.55       150

Bigger batch size:

model=Model([[4], [2, "relu"], [3, "softmax"]], loss_function='cross_entropy', learning_rate=0.001)
 
model.train("gradient_descent", inputs=np.array(inputs), labels=np.array(labels), epochs=100, \ 
								  batch_size=50, shuffle_data=True)
          precision  recall   f1-score   support
 
     0       0.93      1.00      0.96        50
     1       0.94      0.30      0.45        50
     2       0.61      0.98      0.75        50
 
    accuracy                           0.76       150
   macro avg       0.83      0.76      0.72       150
weighted avg       0.83      0.76      0.72       150

Heart disease dataset

This problem can be treated as a regression problem, but in this case, it is implemented as a classification problem. The output is boolean, either the person has or does not have heart disease. In the actual dataset, there are 76 features, but only 14, the most influential factors are chosen, including age, sex, blood pressure, maximum heart rate achieved, chest pain type, thalassemia… According to those parameters, we can train the network to predict if the individual has or does not have any type of heart disease. Our network scored: 

  • Using Gradient Descent 91% of accuracy after 1000 epochs. 
  • Using Genetic Algorithm 68% of accuracy after 50 epochs.

Optimal network architecture:

model=Model([[13], [6, "relu"], [2, "softmax"]], loss_function="cross_entropy", learning_rate=0.005)
 
model.train("genetic_algorithm", np.array(x), np.array(y), epochs=50, shuffle_data=True, \ 
		      batch_size=10, elitism_num=3, population_size=20, mutation_probability=0.01)
          precision  recall   f1-score   support
 
     0       0.76      0.61      0.68        165
     1       0.62      0.77      0.69        138
 
    accuracy                           0.68       303
   macro avg       0.69      0.69      0.68       303
weighted avg       0.70      0.68      0.68       303

MNIST Handwritten Digits dataset

 

The MNIST database of handwritten digits has a training set of 60,000 examples and a test set of 10,000 examples. The digits have been size-normalized and centered in a fixed-size image. Each training example is a 28×28 pixel grayscale image. Each pixel is a number from 0-255. To normalize the input values to have a smaller difference, we divided each pixel by 255, so that the values are from 0 to 1. The images are reshaped into a 1×784 vector. We did not train this dataset using the Genetic Algorithm because it would take too much time to train the network. Each individual in the population would have to iterate through 60k of training examples in order to calculate fitness. Using Gradient Descent we got 93% accuracy after 100 epochs. The problem with this is that MLP does not perform well on image datasets. Because it doesn’t have the ability to find patterns in an image, but just to determine the output based on the value of activated pixels. As a consequence, the user could not get valid predictions for the drawn number. The following example is implemented using React.js on the frontend and Flask on the server-side. The user can send an HTTP POST request to route “/predict” and the application will respond with the Model’s prediction for the provided input.

@app.route('/predict', methods=['POST'])
def predict():
   data = request.get_json()
   model = Model.from_file("../../saved/users/" + data['network_name'] + ".pck")
   input_array = np.array(data['input_array'])
   prediction = model.predict(input_array)
   response_data = {'prediction': prediction.tolist(), 'code': 'SUCCESS'}
   return make_response(jsonify(response_data), 200)

Drawing is performed in HTML canvas, which reads pixels and their values and then sends data to the backend. You can see the model’s prediction for the drawn number and the percentage for model confidence in the predicted number.

Conclusion

This project helped us understand the background of neural networks, and how they work under the hood. From training, we’ve learned what the optimal network architecture looks like, for training using either Gradient Descent or Genetic Algorithm. Each input argument, such as learning rate, batch size or a number of epochs, has an impact on the time spent training and the quality of the results. The main goal is to find the best proportion of those two factors.

Knowing that representability and trainability are two main attributes that describe the network, layer structure has a huge impact on the network’s performance. The network is “representable” if it can represent the problem with a certain level of complexity. The higher the number of layers, the more complex problems network can represent. The problem comes with us being able to train the network. Too many layers and a number of nodes in each layer can lead to slow convergence or high computational power demands.

The learning rate determines how big of a step is taken towards the global optimum. If the step is too big, the network might never converge because it would “jump over” the optimum. On the other hand, too small learning rate leads to slow convergence and long time training. So, the conclusion is that there is no “one size fits all”, but the user must experiment with different architectures to find the best solution for the problem.

The Gradient Descent algorithm has shown better results than the Genetic Algorithm. It takes less time to train the neural network, converges faster and requires significantly less computing power. While being slower, GAs are more suited for multi-criteria problems.

Thank you for reading!

 

Luka Bjelica

Luka Bjelica

Student

Luka Bjelica, a second-year Software Engineering student in the Faculty of Technical Sciences, University of Novi Sad.

Rubik’s Code is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the services we provide.

Svetozar Vulin

Svetozar Vulin

Student

Svetozar Vulin, a second-year Software Engineering student in the Faculty of Technical Sciences, University of Novi Sad.

Rubik’s Code is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the services we provide.