Training data
For supervised machine learning, the training data is labeled.
For a generative generalpurpose language model, a typical training data set might be, say, 10TB of text scraped from the internet; the goal is for the training text to ~represent general knowledge.
A smaller set of domainspecific training data can be used to finetune a general purpose model previously trained on the larger, more general data set.
Synthetic training data
Machine Learning for Synthetic Data Generation: A Review
Cosmopedia: how to create largescale synthetic data for pretraining Large Language Models
Architectures
Convolutional neural nets
CNN explainer  visualization, Georgia Tech
An Interactive NodeLink Visualization of Convolutional Neural Networks  Adam Harley
Transformers
The Illustrated Transformer – Jay Alammar: Visualizing machine learning one concept at a time.
Reinforcement learning
 main parameter set (base model from general info dataset, eg subset of internet)
 finetune: retrain model using much smaller domainspecific dataset
 RLHF: alternately (and/or) have humans choose from options offered by the (finetuned?) model
Fine tuning
Domainadaptive pretraining
Continual Pretraining of Language Models
Optimization techniques
Batch norm
How does batch normalization help optimization? (pdf)
Random initialization
Randomly Initialized OneLayer Neural Networks Make Data Linearly Separable
Selecting model components
Loss functions

A model’s loss function quantifies the difference between the model prediction and the known value; the result is used by the backpropagation algorithm to improve model parameters in subsequent training epochs.

The function should be selected according to the purpose of the model, but a standard loss function is the mean squared error $ MSE = \sum_{i=1}^{D}(x_iy_i)^2 $, where $ x_i $ is the prediction and $ y_i $ is the known value.
Loss Functions and Metrics in Deep Learning
Choice of loss function matters: Inside the maths that drives AI
Model selection, evaluation, benchmarking:
GPT in Data Science (model selection)
Evaluating Large Language Models Trained on Code
Benchmarking inference providers
Interpretability
OthelloGPT learned a bag of heuristics — LessWrong
(2015) Visualizing and understanding DNNs  Matt Zeiler, Clarifai
Etc
LLM Prompting
A Control Theory of LLM Prompting
Nongradient methods
In practice: code!
Karpathy: Micrograd
The spelledout intro to neural networks and backpropagation: building micrograd (YouTube)
Karpathy: Neural networks
Neural networks: zero to hero, Implement GPT2