An artificial neural network is modeled by biological neural networks like ones in the human brain. It is a type of machine learning structure that via a set of algorithms allows the computer to **learn**. We wanted to implement a tool that supports working with **generic** Artificial Neural Network architectures. The point was not to use frameworks that are available in the market but to dig into the math that is behind modern Artificial Intelligence. We believe that in order to truly **understand** something as complex as ANN-s, you need to understand the theory and implementation that is behind the frameworks such as Keras and PyTorch. The main goal was to break the big complex problem into small, simple, and understandable pieces. As Albert Einstein once said: ”If you can’t explain it simply, you don’t understand it well enough.”.

**We don’t do sales, but given the circumstances and the severity of the situation, we decided to change that. Don’t be fooled, this sale isn’t meant for profit and it’s most definitely not planned. This sale is here to help people who want to become better, learn new skills and be more productive than ever before. Our book offers are on a 50% sale.**

So, what we have implemented an **open-source** custom deep learning framework – **EduNN**, inspired by Keras and PyTorch. In general, the goal was to implement an **educational** platform, using which we and our students can learn and experiment with neural networks. The complete framework code can be found **here**. We hope that you will like our approach and that you will try out our framework. Any feedback is welcomed as we look for ways to improve our solution.

## EduNN Implementation

We decided to implement the library using *Python*, as it is the most popular technology choice when it comes to **data science**. Because weights and biases are represented as matrices, operations over them are easily performed thanks to the *NumPy* module. The library consists of a *Model* that represents the main component, Neural Network that acts as a container of parameters, *Initializers*, *Operators*, *Losses*, *Optimizers*, and *Regularizers*.

The model is implemented as **Feed-Forward Multilayer Perceptron (MLP)**, which means that each node (neuron) from every layer is connected to each node in the previous and following layer. It supports only the **Supervised Learning** paradigm, in which the desired outputs are known and the model is trained to predict future outcomes (*output*) depending on given (*input*) data.

### Model

This is the main **component** used to bind all components together. It defines a public API through which users can interact with the library, define the network’s architecture and train the network.

Training the model is done by passing a training data, together with desired (target) outputs.

### Initializers

They provide the initial values for the model parameters at the start of training. Initialization plays an **important** role in training deep neural networks, as bad parameter initialization can lead to slow or no convergence. There are many ways one can initialize the network weights like small random weights drawn from the normal distribution. Our implementation initializes weights to random values between 0 and 1 and biases to 0.

### Operators

They are the **basic** building blocks of any neural network. Operators are vector-valued functions that transform the data. Some commonly used operators are layers like linear, convolution, and pooling, and activation functions like ReLU and Sigmoid. Since the library does not support **Convolutional Neural Networks**, the only type of operators that are implemented is activation functions. They are used to normalize the output of each neuron in the desired range. Implemented operators and their ranges are: *Sigmoid* (0, 1), *ReLu* [0, inf), *SoftMax* (0, 1), *TanH* (-1, 1), *SoftPlus* (0, inf), *Gaussian* (0, 1], *Sinusoid* [-1, 1] and *BinaryStep* {0, 1}.

Example of sigmoid function:

### Losses

In math, losses are **closed-form** and **differentiable** mathematical expressions that are used as surrogates for the optimization objective of the problem at hand. Loss functions provide feedback on how the training is going. They map a vector of values to a number that represents the quality of the network at a certain moment of training.

“*The cost or loss function has an important job in that it must faithfully distill all aspects of the model down into a single number in such a way that improvements in that number are a sign of a better model.*”

Implemented losses are* Mean Squared Error* (MSE), *Mean Absolute Error* (MAE), *Cross-Entropy* (CE) and *Binary Cross-Entropy* (BCE).

### Regulizers

This component provides the necessary control mechanism to avoid **overfitting** and promote **generalization**. L2 weight regularization is implemented that makes the weights sparser and uniform respectively.

### Optimizers

Optimizers provide the necessary recipe to update model parameters with respect to the optimization **objective**. In this context, the optimization objective is to **minimize** the error function that depends on the network’s parameters (*weights* and *biases*). The parameters are considered optimal when the error function reaches its **global minimum**.

Gradient Descent and Genetic Algorithm are two optimizers used for fitting the model.

#### Gradient Descent

For network training using gradient descent, we implemented the following steps:

**Creating a mini-batch**– In this step, the portion of data is taken from the dataset, which is used in the present iteration. The size of the portion is equal to the batch size option that the user provides.

**Forward pass and Calculating Loss**–

**Backward pass –**In this step, the error is being calculated and passed backward through the network. After the backward pass, deltas have been calculated (the portion of error) for each layer. It is important to apply the derivative of activation function to each delta because of the chain rule. Deltas of each layer are stored in a list.

**Backpropagation –**This is the main step in which the gradients are calculated based on deltas and the outputs of the layer from the feed-forward pass.

5. **Adjusting Weights and Biases** – This is the step in which parameters are being updated according to the calculated gradients. The learning rate determines how big the update actually is. If the learning rate is too big it would diverge and never find the minimum. If it is too small, it would take too long to find the minimum.

### Genetic Algorithm

**Genetic Algorithm** is based on the principle of *Darwin’s evolution theory*. Each population represents a certain number of individuals. Every individual is a solution to the problem (neural network). They consist of a set of **genes** (parameters), which is described in the *Python* list. During every epoch, each individual in the population is modified to create a better solution for a given problem. Following these steps, the network is trained using the genetic algorithm:

**Creating an initial population –**Genetic Algorithm creates an initial population of individuals. Each individual represents one instance of the Neural Network with random generated weights and biases.

**Calculating fitness score –**After the initialization, the first thing to be done is to calculate fitness scores using a fitness function. Our implementation of the fitness function is based on calculating how many times feed-forward pass output and given labels are the same. Each time they match, the score is incremented by one. After iterating through the whole data set, the resulting score is divided by the length of the input. This way we get the percentage of the correctly predicted outputs.

**Elitism –**After we’ve calculated fitness, we choose 2 individuals with the highest fitness and pass them directly to the children list, while deleting them from the current population. Removing units is done by iterating through the population and checking if the index is in the self.best scores list. If it isn’t in the list, we add it to the population for further modifications, otherwise, we delete its score from the total score. This ensures that the population size will stay the same

**Selection –**Selection is implemented using NumPy random.choice method, where we provided a self. scores list whose elements sum up to one. Self. scores are calculated for each individual, by dividing each score(output from self. calculate fitness) with total score sum(self. score sum). Selection returns a pair of two individuals(parents).

**Crossover –**After selection, parameters from parents are put into the child’s genes using a random method with a 50% probability for each gene from one parent to be used. After the process, parameters are appended to the offspring(self. children).

**Mutation –**A user provides a probability for mutation, and according to that number, a random gene is being multiplied with some random float number between 0.01 and 0.5. Those boundaries are selected because the more drastic change in an individual’s genes could lead to divergence.

After these steps are finished, a new population is generated, using *self.children* list, where all updated weights and biases are stored. This is one epoch, and it is repeated until the defined number of epochs is reached. The best-fitting individual from the last population is the result of the algorithm.

## Training and Tests

### Iris dataset

This is the first dataset we’ve trained our network on. This dataset includes three iris species, with 50 samples each, and also some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other. Columns that this dataset contains are *Id*, *SepalLengthCm*, *SepalWidthCm*, *PetalLengthCm*, *PetalWidthCm*, *Species*. The last one, *Species*, are representing the names of the flowers, and that is used as our **targeting label**. Because it is the type of string, we had to convert it into number form, so we could compare results from the predictions. Scores are:

- Using Gradient Descent 97% of accuracy after 500 epochs
- Using Genetic Algorithm 87% of accuracy after 700 epochs

Optimal network architecture:

Too big learning rate:

Too many hidden layers:

Not enough epochs:

Bigger batch size:

### Heart disease dataset

This problem can be treated as a **regression** problem, but in this case, it is implemented as a **classification** problem. The output is boolean, either the person has or does not have heart disease. In the actual dataset, there are 76 features, but only 14, the most influential factors are chosen, including age, sex, blood pressure, maximum heart rate achieved, chest pain type, thalassemia… According to those parameters, we can train the network to predict if the individual has or does not have any type of heart disease. Our network scored:

- Using Gradient Descent 91% of accuracy after 1000 epochs.
- Using Genetic Algorithm 68% of accuracy after 50 epochs.

Optimal network architecture:

### MNIST Handwritten Digits dataset

The **MNIST** database of handwritten digits has a training set of 60,000 examples and a test set of 10,000 examples. The digits have been size-normalized and centered in a fixed-size image. Each training example is a 28×28 pixel grayscale image. Each pixel is a number from 0-255. To normalize the input values to have a smaller difference, we divided each pixel by 255, so that the values are from 0 to 1. The images are reshaped into a 1×784 vector. We did not train this dataset using the *Genetic Algorithm* because it would take too much time to train the network. Each individual in the population would have to iterate through 60k of training examples in order to calculate fitness. Using *Gradient Descent* we got 93% accuracy after 100 epochs. The problem with this is that MLP does **not** perform well on image datasets. Because it doesn’t have the ability to find patterns in an image, but just to determine the output based on the value of activated pixels. As a consequence, the user could not get valid predictions for the drawn number. The following example is implemented using *React.js* on the frontend and *Flask *on the server-side. The user can send an **HTTP POST** request to route “/predict” and the application will respond with the Model’s prediction for the provided input.

Drawing is performed in HTML canvas, which reads pixels and their values and then sends data to the backend. You can see the model’s prediction for the drawn number and the percentage for model confidence in the predicted number.

## Conclusion

This project helped us understand the **background** of neural networks, and how they work under the hood. From training, we’ve learned what the **optimal** network architecture looks like, for training using either Gradient Descent or Genetic Algorithm. Each input argument, such as learning rate, batch size or a number of epochs, has an impact on the time spent training and the **quality** of the results. The main goal is to find the best proportion of those two factors.

Knowing that representability and trainability are two main attributes that describe the network, layer structure has a huge impact on the network’s performance. The network is “**representable**” if it can represent the problem with a certain level of complexity. The higher the number of layers, the more complex problems network can represent. The problem comes with us being able to train the network. Too many layers and a number of nodes in each layer can lead to **slow** convergence or **high** computational power demands.

The learning rate determines how big of a step is taken towards the global optimum. If the step is too big, the network might never converge because it would “jump over” the **optimum**. On the other hand, too small learning rate leads to slow convergence and long time training. So, the conclusion is that there is no “one size fits all”, but the user must **experiment** with different architectures to find the best solution for the problem.

The Gradient Descent algorithm has shown better results than the Genetic Algorithm. It takes less time to train the neural network, converges faster and requires significantly less computing power. While being slower, GAs are more suited for **multi-criteria** problems.

Thank you for reading!

#### Luka Bjelica

Student

**Luka Bjelica, a second-year Software Engineering student in the Faculty of Technical Sciences, University of Novi Sad.**

**Rubik’s Code** is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the **services **we provide.

#### Svetozar Vulin

Student

Svetozar Vulin, a second-year Software Engineering student in the Faculty of Technical Sciences, University of Novi Sad.

**Rubik’s Code** is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the **services **we provide.