🌱 conversations (any format) + prompts for conversation/contemplation 🌱
Researchers
Geoffrey Hinton
2024.05.20 Geoffrey Hinton | On working with Ilya, choosing problems, and the power of intuition - YouTube
- multiple timescales: temporary changes to the weights that depend on input data (fast weights) conflicts with parallelization (multiple inputs processed concurrently for efficient training)
- graph call
- sequential / online learning would be needed; may be solved when conductances are used for weights
2023.06.05 Geoffrey Hinton - Two Paths to Intelligence - YouTube
2023.01.16 Geoff Hinton | Mortal Computers - YouTube
- upload a mind to a computer? Hinton intends to explain why this won’t be possible
- standard computing: permanent memory, precise computation
- ML: weights same for every copy of the model {because the underlying infrastructure is the same}
- aka immortal computing
- built on transistors which consume a lot of power
- expensive for two reasons: power consumption, fabrication precision
- manufacturing plant costs O($xB)
- {total power consumption = [power per transistor] x [N transistors / chip] x [N chips / zPU] x [N zPU / compute unit] x [N compute units]}
- low power analog hardware alternative {inputs/activity defined by voltages, model weights defined by capacitance (1/R), output is charge as an integral of current, I = V/R}
- {is this low power simply because no transistors required?}
- {relaxed fabrication specs: imprecise fab okay, shift the responsibility for reliable compute to the learning algorithm for the substrate}
- {inherently memory limited because the system isn’t designed to be replicable and, by its nature, isn’t}
- {‘growing’ vs manufacturing: GH doesn’t get into this, but my take is that the key difference is the process’ spurious variations; maybe it’s not noise, but some other distribution}
- parallelism at the level of the weights: compute doesn’t need to be fast
- {this implies that the same computational operation is applied to every compute element/neuron}
- {the complexity is shifted away from the model structure to ? (neuron processing operation/transfer function, or ???)}
- {standard DL leverages layer-level parallelism} Mortal computing
- {this implies that the same computational operation is applied to every compute element/neuron}
- unlike artificial neural networks, can’t use backpropagation: to apply chain rule, need to fully specify the system
- {this is an inverse problem situation: the processing system (model) doesn’t have a known analytical representation}
- slow, noisy, high variance: weight perturbations -> measure effect -> multiply weights by effect -> new weights
- also, not really parallelizable: applying weight perturations to batches increases variance even more
- activity perturbation: evaluate random neuron input (activity) and use that to estimate the local gradient
- adding noise to neurons not the weights so there are thousands of times less of them
- better variance
- works okay for, eg, MNIST
- scaling to large nets is an issue
- {why?}
- one approach is to develop objective functions that work well with fewer parameters
- eg: one objective function for a smaller subgroup of neurons (local objective functions)
- {? isn’t this like having a bunch of narrower AIs side by side?}
- memory-limited + low power consumption
- knowledge transfer {(how?) or start from scratch?}
- {if the latter: 1. how long to catch up? 2. timescale for keeping up? (i.e., knowledge latency: pragmatic limit vs target spec?)}
- start by considering knowledge transfer within the system, between local patches
- in computer vision, share knowledge across patches via convolution {(convolve each patch with some shared knowledge)} or transformers {(? why mentioned in the context of CV, specifically? link to attention)}
- {works because all patches are built on the same substrate}
- for imperfect and dissimilar substrates, use knowledge distillation
- align local patches’ feature vectors (extract similarities, or transformations that make different feature vectors agree on a prediction)
- this is useful as local patch (module) receptor fields (inputs?) can have a different number of sensors {(distinct measurements)} of different spacing {(resolution)} (no regular grid required)
- in computer vision, share knowledge across patches via convolution {(convolve each patch with some shared knowledge)} or transformers {(? why mentioned in the context of CV, specifically? link to attention)}
- knowledge transfer between mortal computers (distillation)
- classification task: source / teacher provides answer plus relative probabilities of wrong answers
- compute task: consensus of mortal computers on which additional outputs will {sufficiently represent the current knowledge-holder’s transfer function, transferring not just specific knowledge but how the origin (mortal) computer thinks}
- black boxes: he’s embracing transfer functions for mortal computers
- this is in contrast to explainable models {which are deconstructed into components and the model transfer function analysed in terms of components’ individual transfer functions}
- computers that are more brain-like: neuromorphic hardware
- don’t have this yet because a suitable general-purpose learning algorithm hasn’t yet been devised
- the brain must have such a procedure, but it hasn’t been discovered yet
2014.11.07 Hinton: 2014 Reddit AMA
Yann LeCun
Francois Chollet
2024.06.11 Francois Chollet - LLMs won’t lead to AGI - $1,000,000 Prize to find true solution - YouTube
Ilya Sutskever
2023.03.27 Ilya Sutskever - Building AGI, Alignment, Spies, Microsoft, & Enlightenment - YouTube
Andrej Karpathy
2024.09.05 No Priors Ep. 80 | With Andrej Karpathy from OpenAI and Tesla - YouTube
AI safety
AI: commercial development
2024.05.15 John Schulman - Reasoning, RLHF, & Plan for 2027 AGI - YouTube