Activation Functions 🔥

7 min readJan 12, 2024

Have you ever wondered why we have to use activation functions in a neural network? Why there are various types of activation functions available? Why choosing the right one is important for model accuracy and how it affects the hardware resource consumption.

Hello Everyone,

I hope all of you are doing well.

Before we talk about why activation functions are important, let’s make sure we understand what linear and non-linear functions are.

Linear Functions

Linear functions are very straightforward and follow a straight line when plotted on a graph. Simply, They have a constant rate of change, which means as one variable changes, the other changes by a fixed amount.

An example of a linear function is y = 2x + 3, where the relationship between x and y is a straight line with a slope of 2 and a y-intercept of 3.

A linear activation function would essentially make the entire neural network a linear combination of its inputs.

Non-Linear Functions

Non-linear functions are more complex and don’t form a straight line on a graph. They have varying rates of change, which means the relationship between variables can be more complicated.

An example of a non-linear function:
Quadratic functions like y = x², form a U-shaped curve.

Non-linear activation functions introduce non-linearity into the neural network that allows the network to learn complex and non-linear relationships between inputs and outputs.

Let’s understand this with a simple example. Think of a neural network as a somewhat tricky function that takes some inputs and gives us some results. Imagine we have a bunch of data points shown in blue dots on a graph (Given below), and we want a straight line to guess what the blue dots might be. But guess what? It’s pretty tough for that straight line to guess right for most of the dots because it can’t reach the ones above or below the line (look at the red straight line in the graph below).

Now, if we make the guessing machine a bit smarter and allow it to curve a little, like the green line in the graph, it fits our data much better than the red line.

I think this is the easiest example I can provide. I recommend you spend a bit of time looking at this graph and noticing the difference between the lines. This is the main idea behind adding non-straight functions or activation functions to a neural network.

Used numpy.polyfit to generate the graph.

numpy.polyfit - NumPy v1.26 Manual

numpy. polyfit ( x , y , deg , , , , ) [source] Least squares polynomial fit. Note This forms part of the old…

numpy.org

Sigmoid Function

The Sigmoid function has a distinctive S-shaped curve, which is why it’s called “sigmoid.” It takes a real number as input and returns an output between 0 and 1.

It’s defined by the formula

As x becomes more positive, the sigmoid approaches 1, and as x becomes more negative, it approaches 0.

In the graph above, you might have noticed that when the x-values go below about -5, the function’s output becomes approx. zero, and when the values go above +5, the output becomes approx. one. This means that for x-values less than -5 and greater than +5, we can’t figure out the exact value of x from the output of the function. It simply gives us approx. zero for values from negative infinity to -5 and approx. one for values from 5 to positive infinity. So, we lose information about the x-values after they are processed by the function, except for the range between -5 and +5.

Another important point is that this function maps values between -5 and 0 to a range between 0.0 and 0.5. Therefore, we can’t directly tell if the input value, x, was positive or negative just by looking at the output value. Yes, we can make an educated guess by seeing if the output is below 0.5, but there’s no direct indication of the input’s sign in the output value.

This might not seem like a big issue, but in complex neural networks, it’s a significant limitation because you lose information about the input x and the positive or negative sign in the output value due to the restricted range of 0 to 1.

Why significant limitation?

Let’s understand this with a very simple graph. This is the formula for derivative of sigmoid function.

Pay attention to this graph, Left Axis is Y axis and Bottom one is X axis.

During the training phase of a neural network, we mainly depend on the gradient values of weights to decide how much the learning rate should change to minimize losses. However, as we can see in the graph, for values from negative infinity to -7.5 and from 7.5 to positive infinity, the derivative values are approximately zero. This means there is an extremely small or no gradient at all. Consequently, this situation makes the learning rate for these weights almost zero, resulting in no changes. This situation is also known as the “gradient vanishing” problem. It implies that the gradient signal, which is passed backward through the network, becomes very close to zero. When gradients become very small, it indicates that the model is learning very slowly.

The gradient vanishing problem becomes more pronounced when many sigmoid activations are stacked on top of each other, causing the gradient to diminish significantly as it is propagated backward through layers

Tanh (Hyperbolic tangent)

It’s similar to the sigmoid function but has a range between -1 and 1. Like the sigmoid, the tanh function has an S-shaped curve. It means that as the input gets larger, the output approaches 1, and as the input becomes more negative, the output approaches -1.

It’s defined by the formula

Graph.

As we discussed in the sigmoid function, it loses the direct positive or negative sign in its output value. However, tanh mitigates this issue. It assigns [0 to -1] as the output value if the input is [0 to -2.5 to to infinity] and [0 to 1] if the input is [0 to 2.5 to infinity]. In other words, it’s zero-centered. This means that its output is centered around 0 when the input is close to 0.

However, It also suffer from the “vanishing gradient” problem, similar to the sigmoid. As we can see in the above graph, changes are only in between -2.5 to +2.5.

Derivative of Tanh

The “gradient vanishing” problem: As we can see in the graph, for values from negative infinity to -3.5 and from 3.5 to positive infinity, the derivative values are approximately zero. This means there is an extremely small or no gradient at all. Consequently, this situation makes the learning rate for these weights almost zero, resulting in no changes.

ReLU (Rectified Linear Unit)

ReLU is simple yet widely used activation function.

ReLU(x)=max(0,x)

It takes a real number as input and produces an output. If the input is greater than zero, the ReLU function returns the same value. If the input is less than or equal to zero, ReLU returns zero as the output. It effectively “deactivates” negative values by turning them into zero. It is computationally efficient and easy to implement.

As we can observe in the graph provided above, the right side of the 0.0 X-axis displays consistent output values of the function based on its input values. However, as evident from the graph, when the input to the ReLU function is negative, it results in an output of zero. Hence, Neurons based on the RELU activation are considered “dead” if they always output zero for any input. Once a neuron becomes “dead,” it remains inactive and doesn’t contribute to the learning process, making it essentially useless.

Derivative of RELU

Formula to calculate the derivative of ReLU is also very easy.

dx\dy = { 1 if x ≥ 0 else 0 }

As we can see in the graph, although ReLU mitigates the vanishing gradient problem for positive inputs, it still suffer from the vanishing gradient issue for negative inputs. This can slow down or hinder the training of deep networks, especially if a large portion of the data is negative.

However, To address some of these disadvantages, variations of ReLU have been developed, such as Leaky ReLU and Parametric ReLU (PReLU), which attempt to improve upon the limitations of the standard ReLU function.

Conclusion

As we’ve seen in this article, every activation function has its own advantages and disadvantages. Choosing the right activation function often depends on the specific problem and network architecture. It is usually advised that experimenting with different activations can lead to better results.

Don’t forget to follow me :)

Cheers!!