# Neural Networks Explained

The artificial neural network concept built upon the way human brain works to solve complex problem tasks. Neural Network is a set of connected input/output units in which each connection associated with weights. The network learns by adjusting the weights of the model to predict the correct class label.

Neural Network is a vast, complex and mathematical concept but this article is tried to give basic understanding of process/terms used to build NN model.

Neural Network consists following components – input layer, hidden layers, output layer, set of weights and biases between each layer, activation function for each hidden layer. If network has more than one hidden layer, we call the network as a deep artificial neural network. There are two types of neural networks – supervised NN (use for prediction), unsupervised NN (use for pattern recognition).

Steps to build standard artificial neural network (ANN) / multilayer perceptron model (MLP)

- Start with input layer, forward propagate the patterns of the training data through network to generate an output
- Based on output, calculate error that we want to minimize using loss function
- Backpropagate the error, find its derivative with respect to each weight and update the model.

The **weights** that connect variables in NN are partially analogous to coefficients in regression model and can be used to describe the importance of variables. We optimize an objective function (Sum of Squared Errors (SSE) cost function) to find the optimal weights of the model. Bias is a constant term added at the summation junction.

Calculating the predicted output y known as feedforward while updating the weights and biases known as backpropagation. The objective of the algorithm is to find the best set of weights and biases that minimize the loss function.

**Activation function** brings non-linearity in the network, add ability to learn complex data. We can use any activation function (sigmoid, tanh) in ANN as long as it is differentiable. However, logistic function can be problematic in case of highly negative input since output would be close to 0. Sigmoid ranges between 0 and 1. Tanh ranges between -1 and +1. Sigmoid is natural choice when output is binary.

**Backpropagation** is widely used algorithm to train artificial neural network. It is computationally efficient approach to compute partial derivatives of cost function and use those derivatives to learn the weight coefficients for parametrizing ANN.

**Gradient descent** is an optimization algorithm which is used to find the weights that minimize the cost function. It climb down to reach global cost minimum where the step size is determined by the value of learning rate.

Mean squared error and cross-entropy are the two main types of loss functions (cost or error function) used to train NN model. NN requires a loss function to calculate error.

The small network result in underfitting while the large network result in overfitting.

**Underfitting** – Model is not complex enough to capture pattern in the training data and consequently not perform in test data. Model has high bias which can be due to its simplicity.

**Overfitting** – Model perform well on training data but not on test data. Model has high variance which can be due to large number of parameters or complexity.

To check whether model has overfitting issue, just cross-check train sample accuracy and test sample accuracy. If train accuracy is quite higher than test accuracy then model is overfitted.

We do **regularization** to find a good bias-variance trade-off. Regularization introduce additional information to penalize extreme parameter values. Build a large network and then apply regularization schemes to prevent overfitting. Use regularization schemes such as L2 regularization (also known as L2 shrinkage or weight decay) or dropout. In addition, the popular approaches to regularized linear regression model are Ridge Regression, Least Absolute Shrinkage and Selection Operator (LASSO) and Elastic Net.

We apply **cross-validation** techniques – holdout cross-validation and k-fold cross-validation to find optimal parameter values and to get reliable estimates of model generalization performance. Holdout method separated the data into 3 parts – training, validation and test. We first fit the model through training and then change the hypermeters and repeatedly evaluate the performance of model on validation. Once hyperparameter values tuned, estimate the model generalization performance on test data. In k-fold cross-validation, we repeat the holdout method k times on k subsets of training data. We randomly split the training data into k folds without replacement, where k-1 used for model training and 1 fold used for performance evaluation, and then compute the average performance of the model. This exercise results in more accurate and robust model. In addition, empirical evidence shows that a good standard value of k is 10.

If neural network model is not giving result as per your expectation, create an ensemble of neural network models (multiple NN models) and combine their predictive power. An ensemble of weak learners with low correlation is able to outperform an ensemble with high correlation between them.

**Convolution neural networks**(CNNs) are used for image classification while **Recurrent neural networks**(RNNs) are used for modelling sequential data (i.e time series data). RNNs are designed for modelling sequences, remembering past information and processing new events accordingly.

Deep learning libraries available in Python are TensorFlow, Keras and in R are neuralnet, nnet.

While building the neural network model, I observed a key point which I would like to share that dropping correlated variables (input that goes to the model) worsened the results in terms of precision and recall but giving stable results across validation samples. However, model built with correlated variables have better but unstable results. Figure out!