top of page
  • Aditi Chegu

Recognizing Badly Drawn Dragons- How Do Neural Networks Work?

I take my own ability to guess that a drawing is of a dragon from over 300 possible subjects quite lightly but when machines do it, I am blown away. Because someone had to rigorously break down the process of understanding images and reconstruct it for machines to be able to do the same.


In 2016, Google created “Quick, Draw!” A game that required its neural network to guess the subject of its players’ drawings. Similar to our brains, this neural network takes inputs from the image and sends certain reactions to the neurons in its network. This information passes through the network after which a final prediction is given. Which, in this case, was that my drawing was of a dragon.


Neural networks are made up of layers of neurons. The neurons in different layers are connected by “weights”, a parameter that defines how the input should be transformed. Each neuron also has a “bias” which determines the threshold past which a neuron should be activated. These layers of neurons are responsible for transforming the information received from the input into the required output.


When a neural network is first created, it has no information about the task at hand. So, it assigns arbitrary values to all the weights and biases. That is a recipe for chaos, to say the least. With this random assignment of values and lack of training, the output of the network for a given task is usually garbage. But, much like students learn from their mistakes in exam reviews and correct what they do wrong, neural networks can also improve their performance by tweaking their weights and biases to provide more accurate results next time.


"This sucks, try harder next time" is too vague as advice for improvement, so, we have to first assign a concrete value to the quality of its performance. This value should quantify the error between the expected prediction and actual prediction for a training example– essentially, it is the “loss” of the neural network for that particular training example. For example, if given a drawing of a dragon, a bad neural network might return a 0.4 probability for the image being that of a dragon and a 0.3 probability each for being a fish or a bird. To quantify this inaccuracy, the loss can be written as the sum of the squares of the differences between the probability expected and actual probability.

After the neural network is given sufficient training data, a collection of the different losses can be gathered. These losses show how poorly a network is performing for specific examples given their parameters– weights and biases. So, the average of all these losses would express the performance of the network as a whole. A high loss means that the network is inaccurate, whereas a low loss means that the network is accurate. Therefore, to improve the accuracy of the network, the losses need to be minimized.


Because we know that the losses are dependent on weights and biases, we use them to help us reduce the loss. To do this, we start by defining some function which returns the average loss when all the values of the weights and biases are taken as inputs. Then, we have to change the values of these parameters to minimize the loss as much as possible.


To start, if we take only one parameter as the input– weight– and one output– loss– the loss function represents the change in loss with respect to a change in the weight. In this graph, the loss is graphed on the y-axis, and weight is graphed on the x-axis.



We are at a certain point on this loss function depending on the arbitrary weight assigned, and to minimize loss, it should be moved closer to its local minimum. To move the point downwards, we need to know the direction and magnitude by which it should be moved. This can be done by taking the gradient of the function at that particular point. If the gradient is negative, then the point can be moved further to the right to find a local minimum. Similarly, if the gradient is positive, the point can be moved further to the left to find a local minimum. This process for finding the local minimum of the loss function is called “gradient descent”.

In this example, there is only one input that varies the average loss. In reality, however, millions of inputs influence the average loss. This means that graphing them, with all the parameters having one axis each and the output having another axis too, would mean having a graph with millions of dimensions– which is slightly inconvenient. However, the process of gradient descent can still be executed by using some rather clever calculus.


As I said before, each parameter in the loss function needs to be changed in some way to edge towards the least possible loss by finding the direction and magnitude of change. Loss functions often have millions of parameters (variables) as inputs, so, to find the direction and magnitude of change for one parameter: we have to take the derivative of the loss function with respect to that one parameter while keeping all the others constant. By doing this, we are finding the “partial derivatives” of the loss function.


Done manually- or even with a computer, for that matter- calculating all the partial derivatives with the process we learn in high school would be the equivalent of a death sentence. There is simply too much to compute.


To make this less cumbersome, some exploitation of mathematics is necessary.


Differentiation can be an extremely mechanical process once the basic rules are set in place, and that is game changing. By using a more mechanical method, “automatic differentiation”, we can calculate the exact partial derivatives independent of the number of input parameters, unlike all the other processes of differentiation (numeric, symbolic, manual.)


In automatic differentiation, the loss function is deconstructed to give its constituent operations and functions. This deconstruction can be represented as a “computational graph” where all the nodes represent operations and functions and the edges show how the nodes are related.

The process is made of two phases: forward pass and backward pass. In the forward pass, the function’s inputs are entered into the graph to evaluate their values. In the backward pass, it uses the chain rule to piece the function back together by calculating the derivatives at each node.


According to the partial derivatives, we now know the gradients of each parameter. Parameters with higher gradients can cause huge decreases in cost with small changes, and parameters with smaller gradients cause small decreases in cost with small changes. Making these adjustments to the weights and biases will result in a reduced cost function and an increase in accuracy.


As we can calculate the magnitude and gradient descent required with great efficiency using processes like automatic differentiation, the neural network becomes better at distinguishing dragons from all the other drawings with greater certainty much faster.


I've always found the idea of emulating how we think with mathematical rigour, especially the idea of gradient descent, and using it in tasks like data sorting and generating automatic responses really beautiful because it is a reminder of the fact that mathematics models the world around us and improves our ability to cope with it.


54 views0 comments

Recent Posts

See All
bottom of page