Neural networks

Tools for building neural nets

Pytorch

PyTorch

How does torch.compile speed up a transformer? | Adam Casson

JAX

JAX: High performance array computing — JAX documentation

Einops

Training data

For supervised machine learning, the training data is labeled.

For a generative general-purpose language model, a typical training data set might be, say, 10TB of text scraped from the internet; the goal is for the training text to ~represent general knowledge.

A smaller set of domain-specific training data can be used to fine-tune a general purpose model previously trained on the larger, more general data set.

AI models collapse when trained on recursively generated data | Nature

Can LLMs invent better ways to train LLMs?

Synthetic training data

Machine Learning for Synthetic Data Generation: A Review

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Architectures

Convolutional neural nets

CNN explainer - visualization, Georgia Tech

An Interactive Node-Link Visualization of Convolutional Neural Networks - Adam Harley

GANs

Transformers

Transformer explainer - poloclub.github.io

The Illustrated Transformer – Jay Alammar: Visualizing machine learning one concept at a time.

Vision Transformer (ViT)

Rethinking the transformer ‘circuit’ model: Adversarial Circuit Evaluation

Reinforcement learning

main parameter set (base model from general info dataset, eg subset of internet)
finetune: retrain model using much smaller domain-specific dataset
- RLHF: alternately (and/or) have humans choose from options offered by the (finetuned?) model

Fine tuning

Domain-adaptive pre-training

Continual Pre-training of Language Models

Optimization techniques

Batch norm

How does batch normalization help optimization? (pdf)

Random initialization

Randomly Initialized One-Layer Neural Networks Make Data Linearly Separable

Selecting model components

Loss functions

A model’s loss function quantifies the difference between the model prediction and the known value; the result is used by the backpropagation algorithm to improve model parameters in subsequent training epochs.
The function should be selected according to the purpose of the model, but a standard loss function is the mean squared error $ MSE = \sum_{i=1}^{D}(x_i-y_i)^2 $, where $ x_i $ is the prediction and $ y_i $ is the known value.

Loss Functions and Metrics in Deep Learning

Choice of loss function matters: Inside the maths that drives AI