Neural networks

Tools for building neural nets

Pytorch

PyTorch

How does torch.compile speed up a transformer? | Adam Casson

JAX

JAX: High performance array computing — JAX documentation

Einops

Einops

Models

huggingface

Anthropic Models

Llama (Meta)

Google AI Models - Gemini, open, industry-specific

OpenAI Models - API

Datasets

huggingface (Hugging Face)

Chat bots

GitHub Copilot

Claude

Gemini - chat to supercharge your ideas

ChatGPT | OpenAI

Training data

For supervised machine learning, the training data is labeled.

For a generative general-purpose language model, a typical training data set might be, say, 10TB of text scraped from the internet; the goal is for the training text to ~represent general knowledge.

A smaller set of domain-specific training data can be used to fine-tune a general purpose model previously trained on the larger, more general data set.

AI models collapse when trained on recursively generated data | Nature

Can LLMs invent better ways to train LLMs?

Synthetic training data

Machine Learning for Synthetic Data Generation: A Review

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Architectures

Convolutional neural nets

CNN explainer - visualization, Georgia Tech

An Interactive Node-Link Visualization of Convolutional Neural Networks - Adam Harley

GANs

Transformers

Transformer explainer - poloclub.github.io

The Illustrated Transformer – Jay Alammar: Visualizing machine learning one concept at a time.

Vision Transformer (ViT)

Rethinking the transformer ‘circuit’ model: Adversarial Circuit Evaluation

Reinforcement learning

  • main parameter set (base model from general info dataset, eg subset of internet)
  • finetune: retrain model using much smaller domain-specific dataset
    • RLHF: alternately (and/or) have humans choose from options offered by the (finetuned?) model

Fine tuning

Domain-adaptive pre-training

Continual Pre-training of Language Models

Optimization techniques

Batch norm

How does batch normalization help optimization? (pdf)

Random initialization

Randomly Initialized One-Layer Neural Networks Make Data Linearly Separable

Selecting model components

Loss functions

  • A model’s loss function quantifies the difference between the model prediction and the known value; the result is used by the backpropagation algorithm to improve model parameters in subsequent training epochs.

  • The function should be selected according to the purpose of the model, but a standard loss function is the mean squared error $ MSE = \sum_{i=1}^{D}(x_i-y_i)^2 $, where $ x_i $ is the prediction and $ y_i $ is the known value.

Loss Functions and Metrics in Deep Learning

Choice of loss function matters: Inside the maths that drives AI

Model selection, evaluation, benchmarking:

GPT in Data Science (model selection)

Evaluating Large Language Models Trained on Code

Benchmarking inference providers

LangWatch - Monitor, Evaluate and Optimize your LLM-apps

Interpretability

OthelloGPT learned a bag of heuristics — LessWrong

(2015) Visualizing and understanding DNNs - Matt Zeiler, Clarifai

Etc

LLM Prompting

[2402.10200] Chain-of-Thought Reasoning Without Prompting

A Control Theory of LLM Prompting

Non-gradient methods

Detailed answer to: Is it possible to train a neural network without backpropagation? - Stack Overflow


In practice: code

Karpathy: Micrograd

karpathy/micrograd: A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API (github)

The spelled-out intro to neural networks and backpropagation: building micrograd (YouTube)

Karpathy: Neural networks

Neural networks: zero to hero, Implement GPT2


Courses / course notes

MIT 6.867 Machine Learning (grad)