Backstage Coding: Toy Model - TF/Keras

Now, we are going to search for the best architecture by observing the prediction accuracy.

If you executed the code made here, you saw that everything compiles, but the network is useless. Now it is time to add more layers, activation functions and define a better loss function. But where to start?

First we need to see how the network is minimizing the loss function along the epochs. As shown in the example here, we can visualize this in a 2D graphic with the suport of Matplotlib.

As you can see, the accuracy measure of the example above cannot be calculated because we are working with regression. In this kind of problem, there are other metrics to use instead, like the Mean Absolute Error (MAE), the Root Mean Squared Error (RMSE), and the Mean Relative Error (MRE). To compute them in TF, write:

model.compile(metrics=[tf.keras.metrics.RootMeanSquaredError()], ...)
acc = model_history.history['root_mean_squared_error']
val_acc = model_history.history['val_root_mean_squared_error']

Which will lead to a graphic like this:

Even though this is expected, we can see by the prediction output that none of them are correct. Therefore, a loss of 0.005 and a RMSE of 0.075 is still a big value. If our goal is to compute with a precison of 0.00001, then the MSE must be around 1e-10.

To give a better hint if the loss is low enough, we can calculate the percentage of correct predictions based on the numeric precision. Therefore, if 95% of the predictions have a precision of 0.00001, for instance, then the network works fine enough. The implemented code is in the function "accuracy()" here.

The Loss Function and Metrics

Before adding layers, we can examine the loss function, since there is more available knowledge about them.

We are trying to predict areas with precision of 1e-5, at least. So our loss function must give low values for high precision predictions and high values for the low precision. The first functions we can think of are:

Mean Absolute Error (MAE):

Uses the same scale as the data being measured (wiki); If all output and database values are low, MAE will be low in every condition.
Is the average of the residuals. Good for datasets with outliers [2].

Mean Squared Error (MSE):

Gives higher values for lower absolute differences. More sensitive to outliers, but not to noise or bad prediction [1].
Is the variance of the residuals [2].

Root Mean Squared Error (RMSE):

The standard deviation of the residuals [2].
A lower deviation means the model is more general [1].

Mean Relative Error (MRE):

Considers the magnitude of the residuals.
Evaluates the accuracy invidually of all predictions, therefore, good for measuring the precision of all outputs.

By the definition, MRE seems more suitable for our loss function, which will be minimized. The others can be computed as the metrics:

model.compile(loss=relative_error, optimizer="adam", metrics=[tf.keras.metrics.RootMeanSquaredError(), "mse", "mae"])

OBS: There is a precompiled function for MRE, but it needs a predefined normalizer. The normilizer needed here is the expected area for that prediction, therefore it varies, and we can't use the TF function.

Adding Layers and activation functions

Next, we are going to add layers. Since our current model can't encode the input very well, we may increase the dimension of the hidden layer. Also, once the triangle area is limited between 0.01 and 0.5, we can add a sigmoid activation function in the output layer, so we ensure it is always positive (no reason at all to do this, so it is optional).

I made tests with one hidden layer of 12, 32 and 100 neurons for 50 epochs, which did not improve the predictions significantly. Adding more layers of the same amount of neurons, made the net converge faster to the lowest loss value (MSE = 0.005, or MRE = 1), but it did not improve any further. Actually, the hand made accuracy computation I suggested for this problem goes to zero.

If we look at the output, their values are really close to each other (when using MSE as loss function), as if the net computed the mean area value of the training data. We can presume that all the neurons are contributing, which is not helping. So it is time to set the actvation functions.

Starting with the sigmoid function, we allow the neuron to produce a impact or not, since it ranges from 0 to 1. However, the accuracy and the loss didin't improve for any of the amount of neurons. So we can imagine that the neurons were already doing that.

Another activation is the hyperbolic tangent, that ranges from -1 to 1. It is more suitable for multi-layer nets, not sure why. But it works. Looking at the table below, for 50 epochs, there are two good architectures:

TABLE