ML irl

More or less specific resources for machine learning which I may or may not keep up to date. For general or theoretical machine learning resources, see:

Using machine learning libraries

Best known: PyTorch, Keras

PyTorch

… can be tricky to install so here are some potential starting points with tips:

pip install light-the-torch
ltt install torch

Model development: training, evaluation

Synthetic data

Synthetic data for privacy-preserving clinical risk prediction | Scientific Reports

Inference: using pretrained base models

Fine tuning IRL

Fine-tuning Guide with a 4090

e-p-armstrong/augmentoolkit: Convert Compute And Books Into Instruct-Tuning Datasets! Makes: QA, RP, Classifiers.

Inference: using deployed models (agents, bots, apps)

Prompt engineering

Prompt Engineering Guide | Prompt Engineering Guide

(Twisted) RAG

Roaming RAG – RAG without the Vector Database - Arcturus Labs

Synthetic data

Synthetic data for privacy-preserving clinical risk prediction | Scientific Reports

[2407.01490] LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives

[2408.14960] Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual Progress

This is AI's brain on AI

Data: ETL applications of AI

The overlooked GenAI use case

‘Learnable’ tasks (aka can AI/ML help?)

[2210.17011] A picture of the space of typical learnable tasks

Etc

Training an LLM from scratch for personal use

Not really something an individual would be expected to be able to do? (And even just fine tuning is hard!) Though, if you have the compute, it can be attempted, it takes a lot of time and effort, and it’s probably not going to be great, though you can try a light pretrain on domain specific data with fine tune on instructions to maybe get okay one shot performance.

And here’s a full screen recording of someone training a llama.cpp mini-ggml-model from scratch with the script to train.

If you have ~$500k to do the training, you can use the MosaicML platform to get a GPT-3 quality model.

Usign LLMs for real time information

How can an LLM (Large Language Model) be trained on live data to generate real-time answers? - Beginners - Hugging Face Forums

Model distillation

(1) jack morris on X: "it's a baffling fact about deep learning that model distillation works method 1 - train small model M1 on dataset D method 2 (distillation) - train large model L on D - train small model M2 to mimic output of L - M2 will outperform M1 no theory explains this; it's magic" / X