Tools for building neural nets
Pytorch
How does torch.compile speed up a transformer? | Adam Casson
JAX
JAX: High performance array computing — JAX documentation
Einops
Models
Google AI Models - Gemini, open, industry-specific
Datasets
Chat bots
Gemini - chat to supercharge your ideas
Training data
For supervised machine learning, the training data is labeled.
For a generative general-purpose language model, a typical training data set might be, say, 10TB of text scraped from the internet; the goal is for the training text to ~represent general knowledge.
A smaller set of domain-specific training data can be used to fine-tune a general purpose model previously trained on the larger, more general data set.
AI models collapse when trained on recursively generated data | Nature
Can LLMs invent better ways to train LLMs?
Synthetic training data
Machine Learning for Synthetic Data Generation: A Review
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models
Architectures
Convolutional neural nets
CNN explainer - visualization, Georgia Tech
An Interactive Node-Link Visualization of Convolutional Neural Networks - Adam Harley
GANs
Transformers
Transformer explainer - poloclub.github.io
The Illustrated Transformer – Jay Alammar: Visualizing machine learning one concept at a time.
Rethinking the transformer ‘circuit’ model: Adversarial Circuit Evaluation
Reinforcement learning
- main parameter set (base model from general info dataset, eg subset of internet)
- finetune: retrain model using much smaller domain-specific dataset
- RLHF: alternately (and/or) have humans choose from options offered by the (finetuned?) model
Fine tuning
Domain-adaptive pre-training
Continual Pre-training of Language Models
Optimization techniques
Batch norm
How does batch normalization help optimization? (pdf)
Random initialization
Randomly Initialized One-Layer Neural Networks Make Data Linearly Separable
Selecting model components
Loss functions
-
A model’s loss function quantifies the difference between the model prediction and the known value; the result is used by the backpropagation algorithm to improve model parameters in subsequent training epochs.
-
The function should be selected according to the purpose of the model, but a standard loss function is the mean squared error $ MSE = \sum_{i=1}^{D}(x_i-y_i)^2 $, where $ x_i $ is the prediction and $ y_i $ is the known value.
Loss Functions and Metrics in Deep Learning
Choice of loss function matters: Inside the maths that drives AI
Model selection, evaluation, benchmarking:
GPT in Data Science (model selection)
Evaluating Large Language Models Trained on Code
Benchmarking inference providers
LangWatch - Monitor, Evaluate and Optimize your LLM-apps
Interpretability
OthelloGPT learned a bag of heuristics — LessWrong
(2015) Visualizing and understanding DNNs - Matt Zeiler, Clarifai
Etc
LLM Prompting
[2402.10200] Chain-of-Thought Reasoning Without Prompting
A Control Theory of LLM Prompting
Non-gradient methods
In practice: code
Karpathy: Micrograd
The spelled-out intro to neural networks and backpropagation: building micrograd (YouTube)
Karpathy: Neural networks
Neural networks: zero to hero, Implement GPT2