Contents

Neural networks: the minimalist version

🌱 draft 🌱

Andrej Karpathy’s minimum viable artificial neural network - micrograd - inspired me to write up an MVO (minimally viable overview!) of neural networks.

For some context - micrograd1 excludes anything not absolutely required for a bare bones implementation of an artificial neural network (ANN). Minimalist to the max, even matrices2 and attention layers3 are omitted.

In that spirit, I wanted to write up the information I would have appreciated having at my fingertips when I initially encountered micrograd. Pretty fresh on my journey into working with machine learning models at the time, I had lingering questions about the motivation for specific design choices and seemingly essential constraints. Most of that information, admittedly, wasn’t actually essential to be able to implement micrograd. Understandably, then, the explanations I sought weren’t to be found in the example code. Nonetheless, they seemed important to follow up on to hone a deeper intuition.

Having chased down those details, I thought I’d share this nominally minimalist overview that aims to unearth the path from modeling the brain as a network through to the core implementation details of a minimally functional ANN, with a few asides addressing production-grade models. I hope you find it useful.

Thinking machines

ANNs emulate brains, and brains are complicated - so even this minimalist overview isn’t exactly short. To set the stage for what’s covered: we’ll start with the neuronal circuitry on which artificial neural networks (ANNs) are modeled, identify and interrogate essential requirements and constraints, and end by walking through a minimalist build of an ANN in the form of micrograd.

As a warm up to a Rust implementation, I wrote my own Python version of micrograd, which we’ll use to explore the components in detail. Despite its micro sizing, micrograd has all the essentials: nodes, edges, weights, biases, { input, hidden, output } layers, an activation function, a loss metric, a learning function. The code base has core functionality organized into just 2 files of less than 100 lines each. Snippets are referenced below, and the full code can be accessed on GitHub or found toward the end (Code).

If you’re interested in following a hand-held guide through building micrograd in full, join Andrej on YT where he develops micrograd from scratch, start to finish.

Let’s start by establishing a foundation for using physiological neural nets as a basis for designing ANNs.

Why neural nets?

A machine that has the ability to answer any random question might seem like it’s thinking - even when the answer isn’t entirely correct. This probably rings especially true when the path to answering that question doesn’t have an algorithmic solution. Where does one begin, though, on the design of a machine intended to execute general tasks? Those where the machine would most seem like it needs to actually think?

The machinery for human thought is concentrated in the brain, so it’s not too much of a stretch to consider exploring designs based on the brain. A simplified model of the brain’s neuronal networks seems a logical place to start.

In the early 1940s, researchers working on mathematical models of cognition developed connectionist networks, an early form of artificial neural network, merging statistics with physiologically-inspired structure and function. Eventually, reward-based feedback was added, emulating learning: the machine’s abilities could improve iteratively.

A machine that could improve was a huge step forward. Humans accumulate a wide variety of knowledge and develop a plethora of skills. In contrast to that, traditional computing is specialized, without capacity to develop general abilities. But ANNs work on a completely different foundation: the code doesn’t define exactly how to process an input to accomplish a target task. Rather, an ANN is trained to develop a range of abilities by using a learning algorithm which prompts the ANN to do better by making iterative, task-agnostic updates to its internal state during training on broad datasets. As it happens, if the right conditions are met, the breadth of those learned capabilities is not intrinsically limited4.

Brain basics

To oversimplify drastically: Neurons in the brain receive and modulate an input signal, then decide whether to propagate a modulated response. A neuron’s axons and synapses deliver outgoing signal to one or more downstream neurons. Generally, the theory is that each neuron subnetwork has a reward system that influences collective subnetwork behavior, and the subnetworks coordinate within an overarching network. That network-level adaptive behavior is key to how neural networks enable thinking and learning.

Figure 1. Detailed picture of a neuron (BruceBlaus, CC BY 3.0 https://creativecommons.org/licenses/by/3.0, via Wikimedia Commons).

/images/Blausen_0657_MultipolarNeuron.png

Modeling a network

To implement brain function in computer code, certain abstractions will be useful, even if they’re leaky. Recall that ANNs use a simplified model of the brain, represented as a network. So, what defines a network?

Generally, a network has individual entities (nodes) connected by channels (vertices, or edges). According to network theory, a network is characterised by both its topology and the attributes of its nodes and edges.

Figure 2. Network structure. Circles are nodes and straight lines are edges.

/images/network-schematic.png

Modeling a neural network

In a network model of the brain, the individual entities are neurons and the connection channels are synapses.

To model physiological neural networks, ANNs are designed to replicate neurons’ receiving, processing, and discrimination capabilities, as well as inter-neuron communication. These abilities are encoded in a multilayer perceptron (MLP) network that:

  1. mimics neurons’ ability to perceive a stimulus and respond
  2. mimics multiplexing of neuronal activity across neuron layers
  3. mimics learning via feedback

The general network schematic above is intentionally non MLP-like to emphasize that MLP topology is quite specific: nodes in a layer are not connected, and all nodes in a layer connect to each node in the next layer, as in Figure 3.

Figure 3. Basic MLP schematic, from Scikit Learn.

/images/mlp-scikitlearn.png

Functionally, each MLP neuron independently perceives an input and decides whether to propagate a corresponding output to the subsequent layer in the network. Separately, the neuron also participates in coordinated collective behavior.

Modeling learning

The coordinated network behavior of ANNs encodes learning over numerous training runs. Here’s the essence of an ANN’s learning process:

  1. Network performance is measured by evaluating forward-pass output relative to a goal.
  2. A learning algorithm maps that metric back to individual neurons.
  3. Neuron parameters adjust to collectively improve the network performance metric.
  4. Repeat.

Notice that the network output is evaluated against the goal. That evaluation metric goes on to prompt updates to individual neurons. Learning, therefore, relies both on the delivery of feedback to each neuron, and on parameter updates that benefit the network as a whole.

We’ll keep in mind that neurons need to manage two signal streams:

  1. (feedforward) signal carrying (a component of) information from the initial input, which the neuron modulates and decides whether to propagate
  2. (feedback) network performance information, which the neuron uses to update itself

You’ll have noticed that the design of the learning algorithm is probably important, so we’ll look into that next.

Gradient descent

A familiar example of a non-learning machine is the automation that controls a set of traffic lights. The machine (hardware + software) that implements that automation has no generalized abilities: it can’t perceive or respond to any non-traffic-light-related query. Further, its static internals mean there’s no mechanism for self-improvement. Its capabilities can’t evolve; it doesn’t learn.

In contrast, neural networks are driven by a learning algorithm which we’ve already established will alter the state of the network during training. The machine learns to adjust its output with each training pass (aka epoch), and the output improves until a performance target is met or a maximum number of epochs is reached.

We can model that process as a standard feedback circuit.

Figure 4. Feedback circuit schematic from Wikipedia(By Me (Intgr) - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=2836622.

/images/Ideal_feedback_model.png

In response to feedback about performance, the machine’s internal state updates are designed to collectively improve the network’s transfer function: the machine gets smarter.

Efficient learning

The network transfer function may be complex, but neuronal transfer functions are - by design - relatively simple. This isn’t actually a theoretical requirement, but it’s strongly motivated by the pragmatic consideration of computational efficiency. A model that takes excessively long to train won’t be useful.

Neurons are constructed with two stages of processing for their forward-pass (not feedback) inputs. The first stage, we’ll refer to as preprocessing - that’s where multiple inputs are combined (multiplexed). The second, we’ll refer to as activation.

A specific design constraint plays an outsized role in delivering this efficiency: preprocessing involves only linear transformations. A neuron’s scalar state parameters ($w$ and $b$) are the linear coefficients for the neuron’s preprocessing function, expressed as:

$preproc_{out} = ( w * neuron\_input ) + b$

After preprocessing, a nonlinear activation function is applied before broadcasting the neuron’s output to the next layer of the MLP. (This satisfies the nonlinear processing requirement that enables the ANN to have broad capabilities.)

$neuron\_out = act_{nonlin} ( preproc_{out} )$

Although it’s not relevant at this stage, the activation function is typically designed to both discriminate (mimic a brain neuron’s activation threshold) and compress (constrain the output magnitude).

To get the desired efficiency boost, the learning algorithm will update just the (linear) preprocessing parameters. The (nonlinear) activation function won’t change during training.

Recall that the same-layer neurons in MLP networks behave independently of each other (they’re not connected). Putting together the layer-level independence and learning update linearity, the training process can use matrices to parallelize a significant portion of the computational lift required on each training pass.

In the case of micrograd, we’re not parallelizing anything (no matrices), but we’ll still be able to take advantage of the linearity condition: parameter updates will be covered by implementing just addition and multiplication, plus coverage for the (static) nonlinear function.

Micrograd learns via backpropagation

We’ve yet to define exactly how to compute node parameter updates. Let’s look at the essentials of the feedback cycle: we want to evaluate a forward pass output, use that information to drive a feedback signal that prompts neurons to update their (preprocessing) parameters such that, collectively, those changes make the network better.

It seems like it might be useful to break that process down somewhat. To do all of that within the MLP network model, we’ll need:

  1. a feedback circuit that connects network output to individual nodes
  2. a function to evaluate network performance
  3. a function to update node parameters

We’ll mainly focus on micrograd’s implementation of these elements, though we’ll also get a general sense of navigating the options for more full-fledged ANNs.

Intuitively, the feedback circuit could implement a reverse traversal of the input-output paths. (In Figure 3, we can see that the MLP network has many input-to-output paths.) A topological map of the nodes traversed on each feedforward path also serves to define feedback paths.

Next, the loss function. The goal is to quantify the difference between actual and expected outcomes. Guided by our minimalist principles, we’ll use the mean square error (aka Euclidean distance, or L2 norm) as our metric:

$error = \sqrt{(actual - expected)^2}$

Other loss functions may be equally valid, or even preferable. But this one is sufficient for micrograd.

Finally, the update function needs to tie that metric to individual neurons in some way. It’s not actually obvious that there’s a single update function that, when applied to all neurons, will improve network performance. But having a single function is important for computational efficiency, particularly for large models, so let’s see what we can come up with.

Let’s expand the neuron transfer function from the previous section:

$neuron\_out = act_{nonlinear} ( ( w * neuron\_input ) + b )$

where ($w$, $b$) contain the neuron’s scalar parameters.

We’re only part way there. After the above processing, each neuron’s output will propagate to its subsequent layer. According to the MLP model, it then either becomes a neuron input or it contributes to network output. So - either way - we’ve now established that all processing (aside from activation funcations5) is linear.

Making an intuitive connection between node-local gradients (slope) and potential influence on network outcome per update, it seems promising to note that linear functions are differentiable by definition. The nonlinear activation function is the only potential outlier - so, what if we set a requirement that it be differentiable?

Applying that single constraint means that partial derivatives can be computed from output back to input - along each connected path in the topological map, for each node. Now, consider the loss metric at the output, and that our goal is to minimize loss. Extending that concept to nodes, we’d want to follow local gradients toward a minimum - so, downward. That’s gradient descent: an intuitive way to propagate loss to update nodes proportionally to their influence on network output.

To illustrate the implementation, let’s choose a path in the topological map that involves two hidden layers:

(input) $\boxed{x}$ $—>$ $\boxed{v}$ $—>$ $\boxed{u}$ $—>$ $\boxed{y}$ (output)

Taking the derivative of the output with respect to the input reveals partial derivatives for each node according to the chain rule from calculus:

$\partial{y}/\partial{x} = \partial{y}/\partial{u} * \partial{u}/\partial{v} * \partial{v}/\partial{x}$

Backpropagation using gradient descent has been remarkably successful. So successful that, despite its inherent simplicity and potential pitfalls6, learning facilitated by this method plus the basic MLP structure of micrograd are retained in (far more complex) state of the art models.

Aside: Learning efficiently

Efficiency in terms of processing speed and compute resources matters a fair bit in practice as training high quality models requires a good deal of both. Loss is linked back to each of (typically) billions of parameters, and then updates applied for each. Now, multiply the resources required for one pass by the number of epochs in a successful training run - and see why, as a machine learning engineer, you’ll be chasing down every efficiency hack you can get your hands on.

Aside: Model interpretability

Training rounds update the model’s many parameters, and the model’s overall transfer function after training may not have an obvious interpretation in the sense of reflecting a logical ’thought’ process. The goal of research on model interpretability is to characterise the (hidden) tactics employed by trained models.

State of the art ANNs

Researchers working on machine learning didn’t land on MLPs with gradient descent as the secret sauce right off the bat. Here’s a detailed timeline of machine learning according to Wikipedia, but we’ll skip straight to the present.

The announcement for the 2024 Nobel Prize in Physics includes a very readable overview of the development of ANNs, with a focus on the progression from Hopfield networks to restricted Boltzmann machines, the foundation of today’s mainstream machine learning models.

Many varieties of ANNs can perform high quality general purpose computation. From a theoretical perspective, the fundamental requirement for such broad capabilities is having at least one hidden layer: the network must be deep (see the universal approximation theorem4 for neural networks7). In practice, though, a very large number of nodes and many training iterations on large and varied datasets are required.

For decades, the scale of available training data and compute power limited both research and adoption of ANNs. Eventually, massive datasets became available for training, and realistic training times for models of sufficient size and complexity were made possible by advances in GPUs. Progress was unblocked and rapid development brought ANNs into the mainstream, powering large language models (LLMs) such as ChatGPT, Claude, Llama, Gemini. I’ve been enjoying experimenting with Google’s NotebookLM, which fine tunes its base model on data you’ve shared (maybe a research paper or a thesis, but it also needn’t be research-y at all). The fun part is that the outcome is a podcast with two ‘people’ discussing your input data - with often surprisingly decent results.

Believe it or not, the component elements of the ANNs driving these advanced models are essentially no different from those in micrograd.

Exploring micrograd

Zooming back in from the broader context of brains as ANNs, let’s take a look at the implementation details of micrograd.

Looking at the file structure, micrograd is split into three files:

nn.py

  • Define the neural network’s structural components and implement functionality for each.

autograd.py

  • Define the update function. Set out how to backpropagate the learning signal to nodes.

train.py

  • Configure a training run. Define network structure; choose activation and reward/loss functions; declare number of epochs, step size, and tolerance/loss objective.

Summarizing the constraints and requirements laid out in earlier sections, a neuron-node will:

  1. process inputs independently of other same-layer nodes
  2. preprocess inputs using only linear8 transformations
  3. apply a nonlinear activation function to the preprocessed signal
  4. propagate that output signal to nodes in the next layer

At the layer level, independent processing means a loop over same-layer neurons can be used to evaluate their outputs, or matrices can be used to support concurrent evaluation (not implemented for micrograd).

nn.py: Build nodes

Finally! Let’s start by implementing our model neuron.

Initializing the neuron (__init__) sets the number of neuron inputs and assigns each input a random weight. The neuron’s bias is also set randomly. (During training, the network will identify parameters that optimize network performance. That’s the machine learning.) The activation function passed to the initialization function is assigned to the neuron.

Processing the neuron (__call__) involves linear preprocessing to set act, using coefficients mapped to the neuron’s w and b parameters. The nonlinear activation function is then selected and applied to act to generate the neuron output out, which is returned. Here, $relu$ and $tanh$ are included as activation function options.

A function for accessing a neuron’s parameters is provisioned.

In terms of options for the nonlinear activation function, it’s standard for the chosen function to discriminate (mimic threshold activation) and also compress the outgoing signal. The key requirement is that it’s a differentiable function.

class Neuron:
    def __init__(self, nin, activation):
        self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
        self.b = Value(random.uniform(-1, 1))
        self.activation = activation
        
    def __call__(self, x):
        act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
        out = act.relu() if self.activation == 'relu' else act.tanh()
        return out

    def parameters(self):
        return self.w + [self.b]        

You might be wondering about the Value type. Looking ahead, we’ll need to pass node parameters along with other metadata to implement gradient descent during training. Value will support that; we’ll revisit it later.

nn.py: Compose nodes to build a network layer

To build one layer, collect the neurons that are processed at the same stage in the ANN.

Initializing the layer (__init__) creates nout neurons, each with nin inputs. This represents each current-layer neuron broadcasting to each of the next-layer neurons. The layer’s activation function is also set.

Processing the layer (__call__) involves evaluating each neuron output in the layer independently (inline loop). A list of inputs for the next layer’s neurons is returned as out; if the list contains just one element, the list structure is omitted and the element is returned on its own out[0].

A function for returning the layer’s neuron parameters is provisioned.

class Layer:
    def __init__(self, nin, nout, activation):
        self.neurons = [Neuron(nin, activation) for _ in range(nout)]
            
    def __call__(self, x):
        outs = [n(x) for n in self.neurons]
        return outs[0] if len(outs) == 1 else outs

    def parameters(self):
        return [p for neuron in self.neurons for p in neuron.parameters()]

nn.py: Stack layers into a network

Next, the layers are composed into an MLP network; the layers are ordered.

Initializing the MLP network (__init__) defines each layer. For each, the number of neurons in the layer and the number of neurons in the next layer are set, as is the layer’s activation function.

Processing the MLP network (__call__) involves processing each layer and returning the final output of the network.

A function for returning the layer’s neuron parameters is provisioned.

Since the loss function is evaluated at the network level, gradient descent is managed from this abstraction layer. First, network-wide gradient reset is provisioned with zero_grad. Second, neuron parameter updates are provisioned, per the neuron’s grad parameter scaled by step.

Why afford control over how large a step to take along the gradient? Picture the loss function as an uneven landscape composed of node points, each with its own gradient. Ideally, the landscape has a primary valley to which all gradients point. The step size determines how fast each training iteration will take you downhill. With so many locally-different slopes, you can see how many small steps will more precisely get you to the valley while large steps carry a heightened risk of overshooting the mark. However, iterations tend to be slow for large models, and you’ll want to size the steps to not miss the target while also not taking forever getting there.

class MLP:
    def __init__(self, nin, nouts, activation, step):
        sz_io = [nin] + nouts
        self.layers = [Layer(sz_io[i], sz_io[i+1], activation) for i in range(len(nouts))]
        self.step = step
    
    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]
    
    def zero_grad(self):
        for p in self.parameters():
            p.grad = 0.
    
    def update_parameters(self):
        for p in self.parameters():
            p.data += self.step * -p.grad

autograd.py: Backpropagation

Take a look at the MLP schematic and note the many distinct forward pathways through the network. To propagate the loss back through the network, we’ll need to reverse each forward path.

The backward function creates the topological map iteratively.

def backward(self): 
  topo = []
  visited = set()

  def build_topo(v):
    if v not in visited:
      visited.add(v)
      for child in v._prev:
        build_topo(child)
      topo.append(v)      
  build_topo(self)

  self.grad = 1.0
  for node in reversed(topo):
      node._backward()

autograd.py: Gradient descent

To implement gradient descent, nodes need a data structure to store not only their weight and bias parameters, but also local gradient, position in the topological map, and origin operation. For this, the Value type is defined. Functions defined for this data type cover everything needed for forward passes and for implementing gradient descent. The following snippet reflects state initialization. There’s a fair bit of functionality to implement all the arithmetic functions (from scratch!) required to implement gradient descent. The details are in the full code below or on GitHub.

class Value:
  
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self.grad = 0.0
    self._backward = lambda: None
    self._prev = set(_children)
    self._op = _op
    self.label = label

train.py: Compose a network and train it

This is, essentially, a script to train the model. A few notes on training parameters:

  • Step size: the scale of the adjustment at each node. The goal is to balance too small (updates too slowly) with too large (overshooting the target).
  • Number of epochs: train a sufficient number of samples to get a distribution. The risks of insufficient training epochs include overfitting to limited data, which effectively caps the ANN’s capabilities, making them narrow.
  • Number of evals: retrain to get within tolerance.
    • Tolerance: quantify ‘satisfactory’ network performance, and use it to stop training

What’s missing from micrograd?

A minimalist implementation of an ANN works surprisingly well - but we can do much better. I’ll just mention a few topics that merit further attention.

Efficient processing

In reality, the sort of training that LLMs require isn’t even possible without pulling out every stop on efficiency. At the very least, implement matrices to leverage linearity and independence. Incorporating PyTorch or Keras, and supporting libraries such as Einsum and Jax, is essentially necessary to train a non-miniature ANN.

Tokenization

Tokenization is used to deliberately group fractional components of input data. This is applied as preprocessing of raw input data to give the ANN a boost by supplying it with a more robust kernel than it would have access to in using the input data as-is for its initial basis set.

Attention

To the best of our current knowledge, it’s all you need3. The transformer architecture is one of the great successes in the past decade of advances in machine learning. It’s beyond the scope of this write up, but I strongly encourage you to follow up on it, and the original paper is a great start.

Model evaluation

A trained model might have satisfied the loss function tolerance, but does it perform well on non-training data? Every model should be thoroughly tested for performance, not only in terms of accuracy/correctness, but also in terms of alignment. A lot of folks swear by their own custom eval processes, but you might want to start with the cute and approachable forest friends eval guide.

Code

@ msyvr/micrograd-python: Copyright 2024 Monica Spisar, (License)

nn.py

from autograd import Value
import random
import matplotlib.pyplot as plt


class Neuron:
    def __init__(self, nin, activation):
        self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
        self.b = Value(random.uniform(-1, 1))
        self.activation = activation  
        
    def __call__(self, x):
        act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
        out = act.relu() if self.activation == 'relu' else act.tanh()
        return out
    
    def parameters(self):
        return self.w + [self.b]

class Layer:
    def __init__(self, nin, nout, activation):
        self.neurons = [Neuron(nin, activation) for _ in range(nout)]
            
    def __call__(self, x):
        outs = [n(x) for n in self.neurons]
        return outs[0] if len(outs) == 1 else outs
    
    def parameters(self):
        return [p for neuron in self.neurons for p in neuron.parameters()]
    
class MLP:
    def __init__(self, nin, nouts, activation, step):
        sz_io = [nin] + nouts # len(sz_io) = len(nouts) + 1
        self.layers = [Layer(sz_io[i], sz_io[i+1], activation) for i in range(len(nouts))]
        self.step = step
    
    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return x
    
    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]
    
    def zero_grad(self):
        for p in self.parameters():
            p.grad = 0.
    
    def update_parameters(self):
        for p in self.parameters():
            p.data += self.step * -p.grad

autograd.py

import math as m
import numpy as np
import matplotlib.pyplot as plt
# % matplotlib inline # jno

class Value:
  
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self.grad = 0.0 # nb: base case for nn_output: nn_output.grad = 1.0
    self._backward = lambda: None
    self._prev = set(_children)
    self._op = _op
    self.label = label
  
  def __repr__(self):
    return f"Value(data={self.data})"
  
  def __add__(self, other):
    other = other if isinstance (other, Value) else Value(other)
    out = Value(self.data + other.data, (self, other), '+')

    def _backward():
        self.grad += 1.0 * out.grad
        other.grad += 1.0 * out.grad
    out._backward = _backward

    return out
  
  def __mul__(self, other):
    other = other if isinstance (other, Value) else Value(other)        
    out = Value(self.data * other.data, (self, other), '*')

    def _backward():
      self.grad += other.data * out.grad
      other.grad += self.data * out.grad
    out._backward = _backward

    return out

  def __pow__(self, other):
    assert isinstance(other, (int, float)), "other must be int or float"
    out = Value(self.data ** other, (self, ), f'**{other}')

    def _backward():
      self.grad += (other * self.data ** (other - 1)) * out.grad
    out._backward = _backward
    
    return out

  def __rmul__(self, other):
    return self * other # would not be valid for matrix multiplication

  def __truediv__(self, other):
    return self.data * other**(-1)

  def __neg__(self): # -self
    return self * -1
  
  def __sub__(self, other):
    return self + (-other) # uses __neg__

  def relu(self):
    out = Value(0 if self.data < 0 else self.data, (self, ), 'ReLU')
    
    def _backward():
      self.grad += (out.data > 0) * out.grad      
    out.backward = _backward
    
    return out

  def tanh(self):
    x = self.data
    t = (m.exp(2*x) - 1)/(m.exp(2*x) + 1)
    out = Value(t, (self, ), 'tanh')

    def _backward():
      self.grad += (1 - t**2) * out.grad
    out._backward = _backward

    return out

  def exp(self):
    x = self.data
    out = Value(m.exp(x), (self,), 'exp')

    def _backward():
      self.grad += out.data * out.grad # because d(e^x)/dx = e^x
    out._backward = _backward

    return out

  def __div__(self, other):
    out = Value(self.data / other.data, (self, other))

    def _backward():
      self.grad += (1/other.data) * out.grad
      other.grad += -self.data * (other.data**(-2)) * out.grad
    out._backward = _backward

    return out
  
  def backward(self): 
    topo = []
    visited = set()

    def build_topo(v):
      if v not in visited:
        visited.add(v)
        for child in v._prev:
          build_topo(child)
        topo.append(v)      
    build_topo(self)
    
    # print(f'{topo=}')
    # topo_children = [(item._op, item._prev, item.data) for item in topo]
    # print(f'{topo_children=}')

    self.grad = 1.0
    for node in reversed(topo):
        node._backward()

train.py

import random
import math
import matplotlib.pyplot as plt
from nn import MLP
from autograd import Value

def train():
    '''
    This NN is set up as a binary classifier. 
    Each input list maps to a single output. 
    The targets list length is, thus, the number of 
    inputs used to train the model.
    '''
    num_inputs = 5
    len_input = 3

    # Define the neural network    
    layer_nodes = [4, 4, 1]
    activation_function = ''
    if activation_function != 'relu':
        activation_function = 'tanh'
    
    # Gradient descent parameters
    step_size = 0.05;
    num_epochs = 100;
    # Option: loss_function    
    tolerance = 0.05;
    
    # Eval loops
    eval_loops = 10

    # Set up the run!
    print(f'step={step_size} : max epochs={num_epochs} : {tolerance=} : activation function={activation_function} : repeat?={eval_loops} \n')    
    epochs = []
    losses = []
    
    # Eval loops.    
    for _ in range(eval_loops):

        # Generate inputs and targets.
        inputs = []    
        for _ in range(num_inputs):
            new_input = [round(random.uniform(-3., 3.)) for _ in range(len_input)]
            inputs.append(new_input)  
        targets = [round(random.uniform(-1., 1.)) for _ in range(num_inputs)]

        # Generate neural net.        
        net = MLP(len_input, layer_nodes, activation_function, step_size)

        # Gradient descent loops.
        for epoch in range(num_epochs):
            
            # Forward pass.
            nn_guesses = [net(input) for input in inputs]
            
            # Evaluate loss.
            square_errors = [(nn_guess - target)**2 for nn_guess, target in zip(nn_guesses, targets)]
            summed_square_errors = Value(0)
            for se in square_errors:
                summed_square_errors += se
                
            # Break at performance metric or max epochs.
            metric = math.sqrt(summed_square_errors.data) / num_inputs
            if metric <= tolerance or epoch == num_epochs - 1:
                epochs.append(epoch + 1)
                losses.append(summed_square_errors)
                break
               
            # Backprop to get local gradient wrt each parameter: d(output)/d(parameter).
            net.zero_grad()
            summed_square_errors.backward()        
            # Update model weights and biases.
            net.update_parameters()
    
    for epoch, summed_square_errors in zip(epochs, losses):
        print(f'{epoch=} : loss={summed_square_errors.data}')
 
if __name__ == "__main__":
    train()

  1. For anyone looking for an introduction to building your own machine learning models - artificial neural networks, specifically - from the ground up, Andrej Karpathy live-codes a series of explanatory videos that make for a terrific intro. Andrej is a superb teacher in part because he doesn’t gloss over implementation details or make assumptions, despite his extensive experience and knowledge. He sweats the small stuff and, in doing so, ensures both that the material is approachable and also that others can ‘make it their own’. With basic programming skills in Python plus basic (high school) linear algebra and calculus, you, too, can build your own ANN - from scratch. ↩︎

  2. Matrices are essential to efficient computation in the context of large scale neural networks. They’re used to parallelize computations that are independent. Since they’re independent, they can be isolated from the matrix environment for analysis, as done here in the context of micrograd. ↩︎

  3. Attention Is All You Need ↩︎ ↩︎

  4. Universal Approximation Theorem ↩︎ ↩︎

  5. A final activation function - just before the output layer - typically normalizes the output distribution such that its components can be interpreted as probabilities which sum to 1. Softmax is a common choice. ↩︎

  6. For example, picture that loss landscape again. Now, think about the node-local gradients and that the learning algorithm always pulls nodes downhill. You no doubt have intuited that a local gradient doesn’t necessarily point down to the global minimum of the loss landscape. And, yet, gradient descent generally works regardless. ↩︎

  7. Neural network theory, Philipp Christian Petersen 2022 ↩︎

  8. Neuron intrinsic behavior is modeled as linear for computational efficiency; however, researchers have experimented with non linear input response but current models stick with linear neuronal reponse. In order to model generalized functions, non linearity is typically introduced via non linear activation functions. ↩︎