More or less specific resources for machine learning which I may or may not keep up to date. For general or theoretical machine learning resources, see:
Using machine learning libraries
Best known: PyTorch, Keras
PyTorch
… can be tricky to install so here are some potential starting points with tips:
pip install light-the-torch
ltt install torch
Model development: training, evaluation
Synthetic data
Synthetic data for privacy-preserving clinical risk prediction | Scientific Reports
Inference: using pretrained base models
Fine tuning IRL
Inference: using deployed models (agents, bots, apps)
Prompt engineering
Prompt Engineering Guide | Prompt Engineering Guide
(Twisted) RAG
Roaming RAG – RAG without the Vector Database - Arcturus Labs
Synthetic data
Synthetic data for privacy-preserving clinical risk prediction | Scientific Reports
[2407.01490] LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives
[2408.14960] Multilingual Arbitrage: Optimizing Data Pools to Accelerate Multilingual Progress
Data: ETL applications of AI
‘Learnable’ tasks (aka can AI/ML help?)
[2210.17011] A picture of the space of typical learnable tasks
Etc
Training an LLM from scratch for personal use
Not really something an individual would be expected to be able to do? (And even just fine tuning is hard!) Though, if you have the compute, it can be attempted, it takes a lot of time and effort, and it’s probably not going to be great, though you can try a light pretrain on domain specific data with fine tune on instructions to maybe get okay one shot performance.
And here’s a full screen recording of someone training a llama.cpp mini-ggml-model from scratch with the script to train.