## DNN Regression using MATLAB with a strange Gradient Descent

Note that in this post, the gradient descent used is not the conventional one. This version is up only for trial.

I was in SimTech A*STAR working on this laser simulation project when I tried to incorporate some machine learning into the project. There was a little difficulty with the approval of software installations, so I decided to write up a Deep Neural Network regression from scratch in the available medium there, MATLAB. It was a fun project on its own, since I get to implement some features on my own, including k-fold cross validation and some adaptive learning rate for the gradient descent.

In this code, adaptive_batch_H_cross_validation_gradient_descent will consider every single data point per batch, then average all of them when network is actually updated in a step. In k-fold validation, the algorithm will again average by inverse square weight all k training sessions. Unfortunate to say, it does not utilize GPU.

The codes can be obtained here, while the manual can be found here.  The manual will be more complete and wholesome than the exposition here, using quite a number of LaTeX generated equations as well, so I urge you to read it instead. Enjoy!

In the manual, you will see that we have an example tutorial1.m that uses the following neural network. The true equations are set to be exactly solved by this model using sigmoid activation functions in the layers and linear activation layer at the output (since it is a regression problem). To obtain the value of h11 above, we calculate the sum S11=x1*W_(h11,x1) + x2*W_(h11,x2) + x3*W_(h11,x3) + bias_11, as shown below, and then h11 = f(S11) where f is the chosen activation function and W_(11,xN) is the weight for each N and bias_11 the bias. This process is repeated for every perceptron, from h12 to h22, and y1, y2 are similarly computed, but with linear activation function, i.e. f(x)=x, which is just identity. The values of weights and biases are adjusted through (a variant of) gradient descent. Let us define

1. lr : learning rate
2. W_next : the vector storing all the weights and biases in the next step of the gradient descent
3. W_now : the vector storing all the weights and biases in the current step of the gradient descent

then one step in the gradient descent is W_next = W_now – lr * grad(MSE(W_now)), where grad is the gradient operator, MSE the mean squared error.

For 1000 data points, we can set for example a batch size of 10. Then for 1 epoch (current version only runs 1 epoch), there will be 1000/10=100 batch of averaged gradient descent. Each averaged gradient descent is the result of average of  gradient descent over each point in the batch, so if batch size = 10 we average 10 gradient descents. That is the gist of the code.

We can see that the losses are improving over the course of training, as shown in the following figure. Different markings x, o and overlapped (invisible) red x marks show the k-th cross validation. You can see this real time using the code, by setting plot_loss=”dynamic”. As the losses decreases, the model supposedly predicts the output better. In this tutorial, the output are y1 and y2. In the following figures, predictions are correct if the points lie along y=x line, since these plots are the plot of real values versus predicted values. Figure (A) and (B) shows the initial prediction versus the final prediction y1; again you can see it improving real time in the code. Figure (C) shows the final prediction for y2. 