Neural networks

Training

Data

For supervised machine learning, the training data is labeled.

For a generative general-purpose language model, a typical training data set might be, say, 10TB of text scraped from the internet; the goal is for the training text to ~represent general knowledge.

A smaller set of domain-specific training data can be used to fine-tune a general purpose model previously trained on the larger, more general data set.

AI models collapse when trained on recursively generated data | Nature

Can LLMs invent better ways to train LLMs?

Synthetic training data

Machine Learning for Synthetic Data Generation: A Review

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Distributed training

[2501.18512v1] Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

Architectures

Convolutional neural nets

CNN explainer - visualization, Georgia Tech

An Interactive Node-Link Visualization of Convolutional Neural Networks - Adam Harley

GANs

Transformers

Transformer explainer - poloclub.github.io

The Illustrated Transformer – Jay Alammar: Visualizing machine learning one concept at a time.

Vision Transformer (ViT)

Rethinking the transformer ‘circuit’ model: Adversarial Circuit Evaluation

Reinforcement learning

main parameter set (base model from general info dataset, eg subset of internet)
finetune: retrain model using much smaller domain-specific dataset
- RLHF: alternately (and/or) have humans choose from options offered by the (finetuned?) model

Fine tuning

Domain-adaptive pre-training

Continual Pre-training of Language Models

… vs RAG?

[2312.05934] Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs

Optimization techniques

Batch norm

How does batch normalization help optimization? (pdf)

Random initialization

Randomly Initialized One-Layer Neural Networks Make Data Linearly Separable

Finessing model training: experimentation

During training, an LLM accepts inputs which are tokenized and forward-passed through the model. Model outputs are evaluated, and a loss function quantifies the results for that pass (aka epoch). That quantified performance metric is fed back to model parameters, which update according to a learning function (generally, gradient descent). For a fixed architecture, the tokenizer, loss function, and learning function may be finessed to optimize model training.

Tokenization

Tokenization (nlp.stanford.edu)

Embeddings (pdf) - Vicki Boykis

[1301.3781] Efficient Estimation of Word Representations in Vector Space

What are Vector Embeddings | Pinecone

Performance metric (loss)

A model’s loss function quantifies the difference between the model prediction and the known value; the result is used by the backpropagation algorithm to improve model parameters in subsequent training epochs.
The function should be selected according to the purpose of the model, but a standard loss function is the mean squared error $ MSE = \sum_{i=1}^{D}(x_i-y_i)^2 $, where $ x_i $ is the prediction and $ y_i $ is the known value.

Loss Functions and Metrics in Deep Learning

Choice of loss function matters: Inside the maths that drives AI

Learning

Gradient descent is used to identify an appropriate scale factor for updating individual model parameters. A universal scale factor applied to the individual parameters’ SGD-derived scale factors is useful for traversing parameter space methodically across training epochs. Ideally, the universal scale factor will enable model training to identify optimal parameters efficiently, without stepping ‘over’ the global minimum in parameter space.