Neural networks

Training

Data

For supervised machine learning, the training data is labeled.

For a generative general-purpose language model, a typical training data set might be, say, 10TB of text scraped from the internet; the goal is for the training text to ~represent general knowledge.

A smaller set of domain-specific training data can be used to fine-tune a general purpose model previously trained on the larger, more general data set.

AI models collapse when trained on recursively generated data | Nature

Can LLMs invent better ways to train LLMs?

  • Synthetic training data

Machine Learning for Synthetic Data Generation: A Review

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Distributed training

[2501.18512v1] Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

Architectures

Convolutional neural nets

CNN explainer - visualization, Georgia Tech

An Interactive Node-Link Visualization of Convolutional Neural Networks - Adam Harley

GANs

Transformers

Transformer explainer - poloclub.github.io

The Illustrated Transformer – Jay Alammar: Visualizing machine learning one concept at a time.

Vision Transformer (ViT)

Rethinking the transformer ‘circuit’ model: Adversarial Circuit Evaluation

Reinforcement learning

  • main parameter set (base model from general info dataset, eg subset of internet)
  • finetune: retrain model using much smaller domain-specific dataset
    • RLHF: alternately (and/or) have humans choose from options offered by the (finetuned?) model

Fine tuning

Domain-adaptive pre-training

Continual Pre-training of Language Models

… vs RAG?

[2312.05934] Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs

Optimization techniques

Batch norm

How does batch normalization help optimization? (pdf)

Random initialization

Randomly Initialized One-Layer Neural Networks Make Data Linearly Separable

Finessing model training: experimentation

During training, an LLM accepts inputs which are tokenized and forward-passed through the model. Model outputs are evaluated, and a loss function quantifies the results for that pass (aka epoch). That quantified performance metric is fed back to model parameters, which update according to a learning function (generally, gradient descent). For a fixed architecture, the tokenizer, loss function, and learning function may be finessed to optimize model training.

Tokenization

Tokenization (nlp.stanford.edu)

Embeddings (pdf) - Vicki Boykis

[1301.3781] Efficient Estimation of Word Representations in Vector Space

What are Vector Embeddings | Pinecone

Performance metric (loss)

  • A model’s loss function quantifies the difference between the model prediction and the known value; the result is used by the backpropagation algorithm to improve model parameters in subsequent training epochs.

  • The function should be selected according to the purpose of the model, but a standard loss function is the mean squared error $ MSE = \sum_{i=1}^{D}(x_i-y_i)^2 $, where $ x_i $ is the prediction and $ y_i $ is the known value.

Loss Functions and Metrics in Deep Learning

Choice of loss function matters: Inside the maths that drives AI

Learning

Gradient descent is used to identify an appropriate scale factor for updating individual model parameters. A universal scale factor applied to the individual parameters’ SGD-derived scale factors is useful for traversing parameter space methodically across training epochs. Ideally, the universal scale factor will enable model training to identify optimal parameters efficiently, without stepping ‘over’ the global minimum in parameter space.

Model selection, evaluation, benchmarking:

GPT in Data Science (model selection)

Evaluating Large Language Models Trained on Code

Benchmarking inference providers

LangWatch - Monitor, Evaluate and Optimize your LLM-apps

Show Your Work: Improved Reporting of Experimental Results - ACL Anthology

Introducing the WeirdML Benchmark — LessWrong

Interpretability

[2504.02732] Why do LLMs attend to the first token?

OthelloGPT learned a bag of heuristics — LessWrong

(2015) Visualizing and understanding DNNs - Matt Zeiler, Clarifai

Etc

LLM Prompting

[2402.10200] Chain-of-Thought Reasoning Without Prompting

A Control Theory of LLM Prompting

Non-gradient methods

Detailed answer to: Is it possible to train a neural network without backpropagation? - Stack Overflow


In practice: code

Karpathy: Micrograd

karpathy/micrograd: A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API (github)

The spelled-out intro to neural networks and backpropagation: building micrograd (YouTube)

Karpathy: Neural networks

Neural networks: zero to hero, Implement GPT2


Courses / course notes

MIT 6.867 Machine Learning (grad)

Theory

Neural_Network_Theory.pdf


Tools for building neural nets

Pytorch

PyTorch

How does torch.compile speed up a transformer? | Adam Casson

JAX

JAX: High performance array computing — JAX documentation

Einops

Einops

Models

huggingface

Anthropic Models

Llama (Meta)

Google AI Models - Gemini, open, industry-specific

OpenAI Models - API

Datasets

huggingface (Hugging Face)

Chat bots

GitHub Copilot

Claude

Gemini - chat to supercharge your ideas

ChatGPT | OpenAI