Training
Data
For supervised machine learning, the training data is labeled.
For a generative general-purpose language model, a typical training data set might be, say, 10TB of text scraped from the internet; the goal is for the training text to ~represent general knowledge.
A smaller set of domain-specific training data can be used to fine-tune a general purpose model previously trained on the larger, more general data set.
AI models collapse when trained on recursively generated data | Nature
Can LLMs invent better ways to train LLMs?
- Synthetic training data
Machine Learning for Synthetic Data Generation: A Review
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models
Distributed training
[2501.18512v1] Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch
Architectures
Convolutional neural nets
CNN explainer - visualization, Georgia Tech
An Interactive Node-Link Visualization of Convolutional Neural Networks - Adam Harley
GANs
Transformers
Transformer explainer - poloclub.github.io
The Illustrated Transformer – Jay Alammar: Visualizing machine learning one concept at a time.
Rethinking the transformer ‘circuit’ model: Adversarial Circuit Evaluation
Reinforcement learning
- main parameter set (base model from general info dataset, eg subset of internet)
- finetune: retrain model using much smaller domain-specific dataset
- RLHF: alternately (and/or) have humans choose from options offered by the (finetuned?) model
Fine tuning
Domain-adaptive pre-training
Continual Pre-training of Language Models
… vs RAG?
[2312.05934] Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs
Optimization techniques
Batch norm
How does batch normalization help optimization? (pdf)
Random initialization
Randomly Initialized One-Layer Neural Networks Make Data Linearly Separable
Finessing model training: experimentation
During training, an LLM accepts inputs which are tokenized and forward-passed through the model. Model outputs are evaluated, and a loss function quantifies the results for that pass (aka epoch). That quantified performance metric is fed back to model parameters, which update according to a learning function (generally, gradient descent). For a fixed architecture, the tokenizer, loss function, and learning function may be finessed to optimize model training.
Tokenization
Tokenization (nlp.stanford.edu)
Embeddings (pdf) - Vicki Boykis
[1301.3781] Efficient Estimation of Word Representations in Vector Space
What are Vector Embeddings | Pinecone
Performance metric (loss)
-
A model’s loss function quantifies the difference between the model prediction and the known value; the result is used by the backpropagation algorithm to improve model parameters in subsequent training epochs.
-
The function should be selected according to the purpose of the model, but a standard loss function is the mean squared error $ MSE = \sum_{i=1}^{D}(x_i-y_i)^2 $, where $ x_i $ is the prediction and $ y_i $ is the known value.
Loss Functions and Metrics in Deep Learning
Choice of loss function matters: Inside the maths that drives AI
Learning
Gradient descent is used to identify an appropriate scale factor for updating individual model parameters. A universal scale factor applied to the individual parameters’ SGD-derived scale factors is useful for traversing parameter space methodically across training epochs. Ideally, the universal scale factor will enable model training to identify optimal parameters efficiently, without stepping ‘over’ the global minimum in parameter space.
Model selection, evaluation, benchmarking:
GPT in Data Science (model selection)
Evaluating Large Language Models Trained on Code
Benchmarking inference providers
LangWatch - Monitor, Evaluate and Optimize your LLM-apps
Show Your Work: Improved Reporting of Experimental Results - ACL Anthology
Introducing the WeirdML Benchmark — LessWrong
Interpretability
[2504.02732] Why do LLMs attend to the first token?
OthelloGPT learned a bag of heuristics — LessWrong
(2015) Visualizing and understanding DNNs - Matt Zeiler, Clarifai
Etc
LLM Prompting
[2402.10200] Chain-of-Thought Reasoning Without Prompting
A Control Theory of LLM Prompting
Non-gradient methods
In practice: code
Karpathy: Micrograd
The spelled-out intro to neural networks and backpropagation: building micrograd (YouTube)
Karpathy: Neural networks
Neural networks: zero to hero, Implement GPT2
Courses / course notes
MIT 6.867 Machine Learning (grad)
Theory
Tools for building neural nets
Pytorch
How does torch.compile speed up a transformer? | Adam Casson
JAX
JAX: High performance array computing — JAX documentation
Einops
Models
Google AI Models - Gemini, open, industry-specific