Building a Neural Network from Scratch for Digit Recognition on MNIST — Part 2

21 min readJan 3, 2024

Ever wondered about the inner workings of everyday neural networks? Wonder no more! In part two of our series, we unravel the process, guiding you through building your own neural network from scratch (No Pytorch, No Numpy). Witness it transform into a proficient MNIST digit recognizer.

Hello everyone,

I’m Suraj Singh Bisht, and welcome back to my machine-learning journey. Before we continue, I assume you’ve already read the first part of this series, where I explained Gradient Descent and the core concept behind backpropagation. If you haven’t read it yet, it’s an important one. You can find it here:

Mastering Gradient Descent: Math, Python, and the Magic Behind Machine Learning

Implementing a Python class to auto-compute gradient values like PyTorch tensor

medium.com

Introduction

In this article, our goal is to develop and train a neural network using the MNIST dataset to recognize numeric digits from 28x28 pixel grayscale images. We won’t rely on external libraries like PyTorch or NumPy; instead, we’ll build everything from scratch and train it from the ground up.

The primary aim of this part is to gain a deep understanding of how a neural network functions. We’ll explore the hidden complexities within neural networks, the challenges associated with training them, and their inherent limitations. Additionally, we’ll delve into the mathematics of neural networks, activation functions, and other fundamental terminology essential for grasping the concept.

The article will follow this structured flow:

Understanding Neural Networks

What is a neural network?
Understanding neurons & layers
Introduction to tensors

2. Activation Functions

Explanation of activation functions
Their advantages and disadvantages
Introduction to the softmax layer

3. Loss Function

4. Backpropagation & Training

Avoiding Overfitting and Underfitting

5. Data Sourcing & Approach to Model Training

Recommended approach to train a model
Understanding Performance Limitations

6. Practical Implementation in Python

Loading a dataset
Creating a neural network from scratch
Training the model

This organized flow will help readers grasp the fundamentals of neural networks and walk them through building and training a neural network from scratch in Python.

Neural Networks

Jupyter notebook (**3xNeuron Input Layer**, **4xNeuron Hidden Layer**, **2xNeuron Output Layer**)

What is a neural network?

A neural network is a computational model inspired by the way the human brain processes information. It consists of interconnected nodes called neurons, organized into layers. Each neuron performs a simple computation, and these computations are layered to process and transform data.

In our case, we are aiming to build a neural network to recognize handwritten digits. we would have an input layer that takes pixel values of an image, a hidden layer(s) that processes this information, and an output layer that gives the predicted digit (0 to 9).

For example: If we have an image of the digit “3,” the input layer takes pixel values, the hidden layers process them, and the output layer predicts “3” as the result. Neurons in each layer have weights that determine the strength of their connections, and the network adjusts these weights during training to improve accuracy.

Understanding neurons & layers

In a neural network context, neurons in a neural network takes inputs, apply weights to these inputs, compute a weighted sum, pass it through an activation function, and produce an output. The weights are adjusted during training to enable the network to learn and make accurate predictions or classifications in various tasks. Neurons work together in layers to process and transform data, making neural networks capable of solving a wide range of problems, from image recognition to natural language understanding.

Neurons compute a weighted sum of their inputs. This is done by multiplying each input by its corresponding weight and summing up these products. Mathematically, this operation can be represented as follows:

Sum = (Input1 * Weight1) + (Input2 * Weight2) + … + (InputN * WeightN)

After the weighted sum is computed, an activation function is applied to it. The activation function introduces non-linearity into the neuron’s output, allowing the neural network to learn complex patterns and relationships in the data. Common activation functions include the sigmoid function, ReLU (Rectified Linear Unit), and tanh (hyperbolic tangent).

Let’s understand this by an example:

x1, x2, …, xn are the inputs to the neuron.
w1, w2,…,wn are the corresponding weights for each input.
b be the bias term.
z be the weighted sum of inputs plus the bias term.
a be the output of the neuron after applying the activation function.

The weighted sum, z, is calculated as follows:

z = [( x1 ⋅ w1) + (x2 ⋅ w2) + … + (xn⋅wn )]+b

Next, the output, a, is obtained by applying an activation function f to z:

a=f(z)

Common activation functions include the sigmoid function, ReLU (Rectified Linear Unit), and tanh. Each of these functions has a specific mathematical expression, but the general idea is that they introduce non-linearity into the neuron’s output.

The ReLU (Rectified Linear Unit) activation function is defined as

f(z)=max(0,z)

This concludes the core idea of a neural network. Please ensure that everything discussed above is theoretically understood.

Simple Python Code Example to Understand

import numpy as np

# Input data
input_data = np.array([2.0, 3.0, 1.0])

# Weights and bias (for simplicity, manually initialized)
weights = np.array([0.5, -0.2, 0.8])
bias = 0.1

# Calculate the weighted sum
weighted_sum = np.dot(input_data, weights) + bias

# Apply the ReLU activation function
def relu(x):
    return max(0, x)

output = relu(weighted_sum)

# Display the results
print("Weighted Sum:", weighted_sum)
print("Output after ReLU:", output)

Layers

Layers are the building blocks that organize and structure the flow of information processing. Neural networks typically consist of multiple layers, each with a specific role in transforming and learning from data.

Input Layer: To receive the raw data
Hidden Layer: Intermediate layers between the input and output layers.
Output Layer: The final layer of the neural network.

The depth and architecture of a neural network, including the number of hidden layers and neurons in each layer, can vary depending on the complexity of the task. Deeper networks with more layers can learn complex patterns and representations from data, while shallower networks may be suitable for simpler tasks.

Introduction to tensors

Tensors are a fundamental data structure for storing and manipulating numerical data efficiently. They are multi-dimensional arrays used to represent data in various machine learning and deep learning frameworks, including TensorFlow and PyTorch.

For example, using numpy in Python, we can represent a tensor matrix like this.

Tensors can have different numbers of dimensions, which determine their rank or order.

Scalar (Rank 0 Tensor): A scalar is a single numerical value, such as a single number like 5 or -2.7.
Vector (Rank 1 Tensor): A vector is a one-dimensional array of numbers. It represents a list of values, such as [1, 2, 3], where each value is indexed by its position in the array.
Matrix (Rank 2 Tensor): A matrix is a two-dimensional array of numbers arranged in rows and columns.
Higher-Rank Tensors: Tensors can have more than two dimensions. For example, a rank 3 tensor is a three-dimensional array, and so on. Higher-rank tensors are used to represent data with more complex structures, such as color images (rank 3).

Tensors offer a consistent way to represent data of various complexities and facilitate essential operations such as addition, multiplication, and division, as well as more intricate transformations like matrix multiplications. These operations can be executed efficiently on different hardware types like CPUs, GPUs, or TPUs while optimizing memory usage to minimize overhead.

In simple terms, tensors empower you to carry out computations on extensive datasets in parallel, greatly accelerating both training and inference in neural networks. Modern hardware, including GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units), are finely tuned for tensor operations. Frameworks like TensorFlow and PyTorch are designed to harness the power of these hardware accelerators for efficient tensor computations.

Furthermore, tensors play a crucial role in automatic differentiation, a fundamental component of gradient-based optimization algorithms. During the training of neural networks, gradients of the loss function concerning model parameters are computed through a process called backpropagation. Tensors store both the forward pass (computation) and backward pass (gradient) information, enabling efficient gradient calculations using the principles of the chain rule.

I’ve previously explained this concept in the first part of this series.

Activation Functions

Activation functions introduce non-linearity into the model. Without non-linearity, a neural network would be equivalent to a linear regression model, which has limited expressive power. Non-linearity enables neural networks to learn complex patterns and relationships in data. Neural networks with non-linear activation functions can approximate a wide range of complex functions, making them suitable for a variety of tasks, including image classification, natural language processing, and more.

The three most commonly used activation functions in neural networks: are the sigmoid function, the hyperbolic tangent (tanh) function, and the rectified linear unit (ReLU).

Sigmoid Function

Mathematical Representation:

It Produces outputs in the range (0, 1), mapping any real number input to a sigmoid-shaped curve. Sigmoid functions were historically popular for binary classification problems because they squish the output into a probability-like range.

Advantage: They have interpretable outputs, making it easy to understand the model’s confidence in a prediction.

Disadvantage: Suffers from the vanishing gradient problem. Outputs are not zero-centered, which can slow down convergence during training. plus, Computationally expensive compared to ReLU and its variants.

Hyperbolic Tangent (tanh) Function

Mathematical Representation:

tanh produces output values in the range (-1, 1), centered around zero. Smooth and continuous, similar to the sigmoid function but with a different range.

Advantages: Similar to the sigmoid function but less prone to the vanishing gradient problem.

Disadvantage: Still suffers from the vanishing gradient problem to some extent, plus computationally more expensive than ReLU.

Rectified Linear Unit (ReLU)

Mathematical Representation:

ReLU outputs the input value if it’s positive; otherwise, it outputs zero. Simple, and computationally efficient.

Advantage: Addresses the vanishing gradient problem by allowing non-zero gradients for positive inputs.

Disadvantages: Suffers from Dying ReLU Problem.

As you may have noticed, I’ve mentioned some terms in the section about advantages and disadvantages that haven’t been introduced yet. Let me provide explanations for these terms to ensure a clear understanding of both the advantages and disadvantages.

Vanishing Gradient Problem: The gradients during backpropagation become extremely small for deep networks. This can lead to slow or ineffective training.
Dying ReLU Problem: ReLU activation functions can lead to neurons becoming inactive if they consistently output zero. This can result in a loss of representational capacity in the network.
Non-zero-Centered Outputs: Sigmoid and Tanh functions do not produce zero-centered outputs, which can affect the convergence speed when using optimization algorithms that rely on gradients.

Introduction to the softmax layer

A softmax layer is used at the output stage, to convert raw scores or logits into a probability distribution over multiple classes. The main purpose of a softmax layer is to transform the raw output scores from the previous layer (often called logits) into probabilities. These probabilities represent the likelihood or confidence of the input belonging to each class. This transformation is essential for making meaningful predictions in multi-class classification tasks.

The softmax function is mathematically defined as follows:

For a given set of raw scores or logits z1, z2,…, zn, the softmax function computes the probability P(y=i) for each class i as follows:

Here’s what each component represents:

These probabilities represent the likelihood or confidence of the input belonging to each class. This transformation is essential for making meaningful predictions in multi-class classification tasks.

Example.

Let’s say the network’s logits for an input image are as follows:

Logit for “Cat”: 2.5
Logit for “Dog”: 1.0
Logit for “Fish”: 0.8

We want to use the softmax layer to calculate the probabilities of the input belonging to each class.

Using the softmax function, we can calculate the probabilities as follows:

Mathematical Calculation:

Result

Probability of Cat: 0.7113
Probability of Dog: 0.1587
Probability of Fish: 0.1299

The same example is in Python code using the NumPy library.

import numpy as np

# Raw logits for each class
logits = np.array([2.5, 1.0, 0.8])

# Calculate softmax probabilities
exp_logits = np.exp(logits)
softmax_probs = exp_logits / np.sum(exp_logits)

# Display the results
classes = ["Cat", "Dog", "Fish"]
for i, class_name in enumerate(classes):
    print(f"Probability of {class_name}: {softmax_probs[i]:.4f}")

Loss Function

A loss function, also known as a cost function, is a mathematical function that quantifies the dissimilarity between predicted values (output of a model) and actual target values (ground truth). During training, our goal is to minimize the loss of function. Optimization algorithms, such as gradient descent, adjust model parameters (weights and biases) to minimize the loss. A lower loss indicates a better-performing model on the given task.

The choice of a loss function depends on the specific problem and the nature of the data. Different loss functions are suitable for regression, classification, and other tasks.

But In this tutorial, I’m gonna focus on Mean Squared Error (MSE) only.

Mean Squared Error (MSE):

MSE is a common loss function used for regression problems. It measures the average squared difference between predicted values and actual target values.

Derivative of MSE:

To update model parameters during training, we need to calculate the derivative (gradient) of the MSE with respect to each parameter. For a simple linear regression model with parameters w (weights) and b (bias), the derivatives are calculated as follows:

Simple Example of MSE:

Let’s say we have a regression problem where we want to predict house prices. We have a dataset with three houses and their actual sale prices (Ytrue) and predicted sale prices (Ypred) as follows:

House 1: (Ytrue) = 300,000, (Ypred) = 280,000
House 2: (Ytrue) = 450,000, (Ypred) = 480,000
House 3: (Ytrue) = 200,000, (Ypred) = 220,000

Using the MSE formula, we can calculate the MSE for this dataset:

= 1/3 ((280000−300000)**2 + (480000−450000)**2 +(220000−200000) **2 )

= 566666666.6666666

So, the MSE for this dataset is approximately 566666666.666, indicating the average squared error between predicted and true house prices.

Same Example in Python Code :

import numpy as np

# True house prices
true_prices = np.array([300000, 450000, 200000])

# Predicted house prices
predicted_prices = np.array([280000, 480000, 220000])

# Calculate Mean Squared Error (MSE)
mse = np.mean((predicted_prices - true_prices)**2)

# Display the MSE
print("Mean Squared Error (MSE):", mse)

Backpropagation & Training

Backpropagation, short for “backward propagation of errors,” is a supervised learning algorithm that plays a central role in updating the model’s parameters (weights and biases) to minimize a defined loss function. It involves two main steps:

Forward Pass: Input data is fed into the neural network, and computations are performed layer by layer, propagating through the network to produce predictions or output values. These predictions are compared to the actual target values to compute the loss.
Backward Pass (Backpropagation): The gradients of the loss with respect to each model parameter are computed by applying the chain rule of calculus. These gradients indicate how much the loss would change with small adjustments to each parameter. The gradients are then used to update the model’s parameters through optimization techniques like gradient descent.

In simple terms, it effectively calculates how the loss function’s gradients change concerning each network parameter. These gradients act as guides for determining both the direction and extent of updates to the parameters. By having this gradient information, the network can adjust its parameters in a way that minimizes the loss. This continuous process of parameter updates enables the network to learn from data and enhance its performance gradually.

This topic has already been addressed in the earlier section of this series, so please ensure you have a solid grasp of it.

Overfitting

Overfitting occurs when a model learns the training data too well, capturing noise and random fluctuations in the data rather than the underlying patterns. As a result, the model performs exceptionally well on the training data but poorly on new, unseen data.

Underfitting

Underfitting occurs when a model is too simple to capture the underlying patterns in the data. It fails to learn the training data effectively, resulting in poor performance both on the training data and new data.

To Avoid Overfitting and Underfitting

Collect more data that can help reduce overfitting, especially when the model complexity is high. Additional data provides a better representation of the true underlying patterns.
Examine learning curves (training and validation error over time) to detect signs of overfitting or underfitting. Adjust model complexity based on the curves.
Do cross-validation to assess model performance. Cross-validation helps estimate how well the model generalizes to unseen data and can detect overfitting or underfitting.

Data Sourcing & Approach to Model Training

There are numerous public datasets available for various machine-learning tasks. They offer a wide range of datasets for different domains. You can download these datasets for free and use them for your projects.

In some cases, you may need to purchase or license data from third-party providers. This is common for specialized datasets or proprietary data. Ensure that you have the right to use the data for your intended purposes.

Recommended approach to train a model

The recommended approach to model training depends on the specific problem, dataset, and resources available. However, there are some common steps and best practices to train models effectively:

Data Preprocessing: Preprocessing your data. This includes handling missing values, encoding categorical variables, and scaling/normalizing numerical features.
Data Splitting: Split your dataset into training, validation, and test sets. The training set is used to train the model, the validation set is used for hyperparameter tuning and model evaluation during training, and the test set is used for final model evaluation.
Loss Function Selection: Choose an appropriate loss function that aligns with your task. Common loss functions include Mean Squared Error (MSE) for regression and categorical cross-entropy for classification.

Understanding Performance Limitations

Performance limitations refer to the factors and challenges that can affect the performance and effectiveness of trained models. These limitations can arise from various sources and can impact the model’s ability to generalize well to new, unseen data. Here are a few common performance limitations:

Data Quality and Noise: Noisy or low-quality data can introduce errors into the training process, affecting the model’s performance. Data preprocessing and cleaning are crucial to address this limitation.
Model Complexity: Overly complex models, can lead to underfitting and long training times. Finding the right balance between model complexity and performance is essential.
Hardware-Resource Constraints: Limited computational resources, memory, or hardware can limit the size and complexity of models that can be used effectively.

Practical Implementation in Python

Let’s begin coding. First, import all the necessary libraries for our code.

import matplotlib.pyplot as plt # only required for visualization
import array
import struct
import random
import torch
import math
from datetime import datetime

DataHolder: This class serves as a utility for extracting, managing, and visualizing data from the MNIST dataset. It provides methods to load images and labels from binary files, retrieve image-label pairs, display individual images with labels using `matplotlib`, and initialize these datasets during object creation. The class simplifies the process of working with the MNIST dataset, making it easier to handle the dataset.

# Dataset : https://www.kaggle.com/code/hojjatk/read-mnist-dataset/notebook
class DataHolder:
    def __init__(self):
        self.image_dimension = (28, 28)
        self.train_img = self.extract_images("Dataset/train-images.idx3-ubyte")
        self.train_label = self.extract_labels("Dataset/train-labels.idx1-ubyte")
        self.test_img = self.extract_images("Dataset/t10k-images.idx3-ubyte")
        self.test_label = self.extract_labels("Dataset/t10k-labels.idx1-ubyte")
        
    def extract_labels(self, path):
        res = []
        with open(path, 'rb') as fp:
            # extract header
            _, size = struct.unpack("!II", fp.read(4*2))
            res = array.array("B", fp.read(size))
        return res
    
    def extract_images(self, path):
        res = []
        with open(path, 'rb') as fp:
            magic_code, size, rows, cols = struct.unpack("!IIII", fp.read(4*4))
            dim = [size, rows, cols]
            buffer_count = cols*rows
            image_data = array.array("B", fp.read())
            for index in range(size):
                raw_arr = image_data[index*buffer_count:(index*buffer_count)+buffer_count]
                res.append(raw_arr)
        return res
                
        
    def get_set(self, index, test=False):
        if test:
            return self.test_img[index].tolist(), self.test_label[index]
        
        return self.train_img[index].tolist(), self.train_label[index]
    
    def get_img(self, *args, **kwargs):
        img_arr, label = self.get_set(*args, **kwargs)
        image = []
        for w in range(28):
            image.append(img_arr[w*28:(w+1)*28])
        return image, label
    
    def prev_img(self, *args, **kwargs):
        image, label = self.get_img(*args, **kwargs)
        plt.imshow(image, cmap='gray')
        plt.axis('off') 
        plt.title(label)
        plt.show()
        return image, label

TensorValue: This class represents a mathematical tensor value with various mathematical operations and gradient calculation capabilities. I have already explained this piece of code in the previous part of this series. This class computes the gradient values for all the operations, allowing for backpropagation in a computational graph.

class TensorValue:    
    def __init__(self, num, ops= None, left=None, right=None):
        self.num = float(num)
        self.ops = ops
        self.left = left
        self.right = right
        self.grad = 0.0
        self.leaf = True
        
        if self.left or self.right:
            self.leaf = False
    
    def __add__(self, y):
        return self.__class__(self.num + getattr(y, "num", y), ops="+", left=self, right=y)
    
    def __radd__(self,  y):
        return self.__add__(y)
    
    def __sub__(self, y):   
        return self.__add__(-y)
    
    def __rsub__(self,  y):
        return self.__add__(-y)
    
    def __mul__(self, y):     
        return self.__class__(self.num * getattr(y,"num", y), ops="*", left=self, right=y)
    
    def __rmul__(self,  y):
        return self.__mul__(y)
    
    def relu(self):
        return self.__class__(max(0, self.num), ops="rl", left=self)
    
    def exp(self):
        return self.__class__(math.exp(self.num), ops="exp", left=self)

    def __neg__(self):
        return self.__mul__(-1)
    
    def __truediv__(self, y):        
        return self.__mul__(y**-1)
    
    def __rtruediv__(self,  y):
        return self.__mul__(y**-1)
    
    def __pow__(self, y):
        return self.__class__(self.num ** getattr(y, "num", y), ops="pow", left=self, right=y)
    
    def __repr__(self):
        return f"{self.__class__.__name__}({self.num})<{id(self)}>"
    
    
    def flush_gradient(self):
        if not self.leaf:
            self.grad = 0.0
        if isinstance(self.left, self.__class__):
            self.left.flush_gradient() 
        if isinstance(self.right, self.__class__):
            self.right.flush_gradient()
    
    def backward(self):
        self.grad = 1.0
        return self.calculate_gradient_backward()
        
    def calculate_gradient_backward(self,):
        r = getattr(self.right, "num", self.right)
        l = getattr(self.left, "num", self.left)
    
        # the derivative of f(x, y) = x + y with respect to x is simply 1
        if self.ops=="+":
            self.left.grad += (self.grad * 1.0)
            if isinstance(self.right, self.__class__):
                self.right.grad += (self.grad * 1.0)
                
        # the derivative of f(a, b) = a * b with respect to 'a' is 'b'.
        elif self.ops=="*":
            self.left.grad += (r * self.grad)
                
            if isinstance(self.right, self.__class__):
                self.right.grad += (l * self.grad)
        
        elif self.ops=="rl":
            self.left.grad += (int(self.num > 0) * self.grad)
        
        elif self.ops=="exp":
            self.left.grad += (self.num * self.grad)
                
        # the derivative of f(a, b) = a^b with respect to 'a' is 'b * a^(b-1)'
        elif self.ops=="pow":
            self.left.grad += ((r * (l ** (r - 1.0))) * self.grad)
                
            if isinstance(self.right, self.__class__):
                self.right.grad += ((l * (r ** (l - 1.0))) * self.grad)
        
    
        if isinstance(self.left, self.__class__):
            self.left.calculate_gradient_backward()
            self.left.flush_gradient()
            
        if isinstance(self.right, self.__class__):
            self.right.calculate_gradient_backward()
            self.right.flush_gradient()

Neuron Class: This class represents a single neuron. It is initialized with a list of weights, a bias, and an optional gradient. Each weight and the bias are represented as `TensorValue` objects. It provides methods to get the values of weights, bias, and the number of weights. The `activation` method applies a rectified linear unit (ReLU) activation function to a given value. The `feed` method computes the output of the neuron by summing the products of its weights and input values, applying the activation function, and adding the bias.
LinearLayer Class: This class represents a layer of neurons. It is initialized with a list of neurons and an optional label and provides methods to get information about the layer, including the label, the number of neurons, the number of weights per neuron, and the values of neurons. The `feed` method computes the outputs of all neurons in the layer by applying the `feed` method of each neuron to the input array.

These classes are designed to facilitate the creation and operation of individual neurons and layers in a neural network, making it easier to build and work for our tutorial.

class Neuron:
    def __init__(self, weights, bias=1, grad=0):
        self.data = [TensorValue(i) for i in weights]
        self.bias = TensorValue(bias)
        self.wcount = len(self.data)
        self.grad = grad
    
    def get_values(self):
        return {"data":[i.num for i in self.data], "bias":self.bias.num, "wcount":self.wcount}
    
    def activation(self, val):
        return val.relu()
    
    def feed(self, arr):
        
        return self.activation(sum([weight*num for weight, num in zip(self.data, arr)]) + self.bias)
        
    def __repr__(self):
        return f"N({self.wcount}xW.)"
    
    def __iter__(self):
        return iter(self.data)
    
    def __len__(self):
        return len(self.data)

class LinearLayer:
    def __init__(self, neurons, label="Layer"):
        self.label = label
        self.neurons = neurons 
        self.ncount = len(self.neurons)
        self.wcount = len(neurons[0].data) if neurons else 0
        self.results = []
        
        
    def get_values(self):
        return {
            "label":self.label,
            "ncount":self.ncount,
            "wcount":self.wcount,
            "neurons":[neuron.get_values() for neuron in self.neurons]
        }
    
    def feed(self, arr, rcount=None, ccount=None):
        return [neuron.feed(arr) for neuron in self.neurons]
    
    def __repr__(self):
        return f"{self.label}({self.ncount}x{self.wcount})"
    
    def __len__(self):
        return self.ncount

BasicNet class: is a simple neural network implementation. It’s initialized with a list of layers (layers). The pre_feed_hook and post_feed_hook attributes to functions that preprocess and post-process input data, respectively.

The `feed` method passes input data through the network, applying any pre and post-processing hooks and passing the data through each layer.
The `get_parameters` method yields all the weights and biases in the network.
The `softmax` method computes the softmax probabilities of a list of tensor values.
The `predict` method runs input data through the network and returns a dictionary of class probabilities and the predicted class.
The `get_loss` method calculates the mean squared error loss and predicted class given an input image and its correct label.
The `save` method serializes the network’s layer and parameter information to a JSON file.
The `load` method reads layer and parameter information from a JSON file and reconstructs the network.

class BasicNet:
    def __init__(self, layers, *args, **kwargs):
        self.layers = layers
        self.args = args
        self.kwargs = kwargs
        self.pre_feed_hook = None
        self.post_feed_hook = None
        
                
    def __repr__(self):
        o = ['input(*)']
        o += [repr(i) for i in self.layers]
        return ' -> '.join(o)
        
    def feed(self, arr):
        if self.pre_feed_hook:
            arr = self.pre_feed_hook(arr)
        
        for layer in self.layers:
            arr = layer.feed(arr)
        
        if self.post_feed_hook:
            arr = self.post_feed_hook(arr)
        
        return arr
    
    def get_parameters(self):
        for layer in self.layers:
            for neuron in layer.neurons:
                for weight in neuron.data:
                    yield weight
                yield neuron.bias
    
    def softmax(self, tvals):
        exp_logits = [val.exp() for val in tvals]
        sum_exp_logits = sum(exp_logits)
        softmax_probs = [exp_logit / sum_exp_logits for exp_logit in exp_logits]
        return softmax_probs
    
    def predict(self, *args, **kwargs):
        arr = self.feed(*args, **kwargs)
        c = dict(zip(range(9), [i.num for i in arr]))
        return c, max(c, key=c.get)
    
    def get_loss(self, input_img, label):
        target = [0]*10
        # our expectations
        target[label]=1
        # prediction
        arr = self.feed(input_img)
        # calculating mse
        loss = sum([(x-y)**2 for x,y in zip(arr, target)])/len(target)
        c = dict(zip(range(9), [i.num for i in arr]))
        return loss, max(c, key=c.get)
    
    def save(self, filename):
        import json
        layers_data = [d.get_values() for d in self.layers]
        with open(filename, "w") as fp:
            json.dump(layers_data, fp)

    def load(self, filename):
        import json
        with open(filename, "r") as fp:
            dump = json.load(fp)
            self.layers = []
            for layer in dump:
                l = LinearLayer([])
                l.label = layer['label']
                l.ncount = layer['ncount']
                l.wcount = layer['wcount']
                l.neurons = []
                for ndata in layer['neurons']:
                    o = Neuron([])
                    o.data = [TensorValue(i) for i in ndata['data']]
                    o.bias = TensorValue(ndata['bias'])
                    o.wcount = ndata['wcount']
                    l.neurons.append(o)

                self.layers.append(l)

Time to set up the neural network and a dataset for further use:

An instance of the `BasicNet` class is created and assigned to the variable `bent`. This neural network is composed of two layers: a hidden layer and an output layer. Each layer is defined using `LinearLayer` objects, and within each `LinearLayer`, there are 10 neurons created using list comprehensions. The weights of these neurons are randomly initialized within the range [-0.1, 0.1].

pre_feed_hook: The lambda function divides each pixel value into the input data by 255, which normalizes the pixel values to the range [0, 1].

post_feed_hook: After the input data passes through the neural network, the `softmax` function will be applied to the output to obtain class probabilities.

# dataset handler object
dataset_obj = DataHolder()

# 728 * 10 * 10
bnet = BasicNet([
    LinearLayer([
        Neuron(random.uniform(-0.1, 0.1) for _ in range(728)) for _ in range(10)
    ], label="hidden"),
    
    LinearLayer([
        Neuron(random.uniform(-0.1, 0.1) for _ in range(10)) for _ in range(10)
    ], label="output")
    
])
# pre-process input data
bnet.pre_feed_hook = lambda arr: [v/255 for v in arr]
# post-process input data
bnet.post_feed_hook = bnet.softmax

Up to this point, if you’ve been diligently following all the steps, you’ll have a basic neural network prepared for training and predicting outcomes for 28x28 pixel images. Building everything from the ground up without relying on external libraries.

Now, you can use the neural network to make predictions by giving it some data. Of course, the network isn’t trained yet, so it will not give accurate predictions right now.

In the next step, we’ll train the network using the MNIST dataset to make it better at making predictions. I’ll provide you with a piece of code that you can run on your computer for 2–3 days. As time goes on, you’ll notice the network getting better at its predictions. After 2 days, it’s a good idea to make small adjustments to how the network learns to avoid making it change too much, which could be harmful. There are many methods and tips you can try to improve the network’s accuracy. I encourage you to explore and experiment with them. If you discover anything helpful, please share your findings in the comments section.

range_numbers = list(range(59990))
random.shuffle(range_numbers)

for iternum, datasetIndex in enumerate(range_numbers):
    learning_rate = random.randint(50, 110) * 0.01
    im, ll = dataset_obj.get_set(datasetIndex)
    (pre_loss, predict), actual = bnet.get_loss(im, ll), ll
    if iternum > 2000 and int(predict)==int(actual):
        print(f"{datetime.now()}; {iternum}; DataIndex:{datasetIndex}; PreLoss:{round(pre_loss.num, 8)}; Prediction:{predict}; Actual:{actual}; Pass;")
        continue
    # calculating backward gradient
    pre_loss.backward()
    for w in bnet.get_parameters():
        w.num += (w.grad * learning_rate * -1)
        w.grad = 0 # set zero
    (loss, predict), _ = bnet.get_loss(im, ll), ll
    t = datetime.now()
    print(f"{datetime.now()}; {iternum}; DataIndex:{datasetIndex}; PreLoss:{round(pre_loss.num, 8)}; NowLoss:{round(loss.num, 8)}; Prediction:{predict}; Actual:{actual}; Rate:{learning_rate};")
    bnet.save(f"{t.date()}_{t.time().hour}")
    bnet.save("t1.wt")
    gc.collect()

In my situation, I noticed that computing the backward pass from a single forward pass took approximately 3 minutes when using CPython. As an alternative, I attempted to run the same code in PyPy3, and it only took 25 seconds. Consequently, I conducted my training in PyPy3 and allowed the script to run continuously for 2 days. After this period, I observed that it could correctly predict digits in simple cases, but there were still inaccuracies in some instances.

Therefore, as of the time when I’m writing this article, I’m sharing the currently saved weights with you all. This way, you can start training from where I left off, saving you the time it took me to reach this point in training.

You can find Python codes and weights here :

https://github.com/surajsinghbisht054/ML-Learnings/tree/main/MNIST_from_scratch

I hope you liked this post.

You can use my code as you wish, but I’d appreciate it if you could mention my name, Suraj Singh Bisht (github.com/surajsinghbisht054), in the credits section if you do.

That’s all for today. Please don’t forget to provide your thoughts, and I wish you a great day! Also, stay tuned for the next part!

Building a Neural Network from Scratch for Digit Recognition on MNIST — Part 2

Mastering Gradient Descent: Math, Python, and the Magic Behind Machine Learning

Implementing a Python class to auto-compute gradient values like PyTorch tensor

Introduction

The article will follow this structured flow:

Neural Networks

What is a neural network?

Understanding neurons & layers

Layers

Introduction to tensors

Activation Functions

Sigmoid Function

Hyperbolic Tangent (tanh) Function

Rectified Linear Unit (ReLU)

Introduction to the softmax layer

Loss Function

Backpropagation & Training

Overfitting

Underfitting

To Avoid Overfitting and Underfitting

Data Sourcing & Approach to Model Training

Recommended approach to train a model

Understanding Performance Limitations

Practical Implementation in Python

Written by Suraj Singh Bisht

No responses yet