Conversations on machine learning

🌱 conversations (any format) + prompts for conversation/contemplation 🌱

Yoshua Bengio

AI & Society – CIFAR UK AI Summit: What can it achieve? - YouTube

Geoffrey Hinton

2024.11.29 Geoff Hinton - Will Digital Intelligence Replace Biological Intelligence? | Vector's Remarkable 2024 - YouTube

2024.05.20 Geoffrey Hinton | On working with Ilya, choosing problems, and the power of intuition - YouTube

multiple timescales: temporary changes to the weights that depend on input data (fast weights) conflicts with parallelization (multiple inputs processed concurrently for efficient training)
- graph call
- sequential / online learning would be needed; may be solved when conductances are used for weights

2023.06.05 Geoffrey Hinton - Two Paths to Intelligence - YouTube

2023.01.16 Geoff Hinton | Mortal Computers - YouTube

- upload a mind to a computer? Hinton intends to explain why this won't be possible
- standard computing: permanent memory, precise computation
  - ML: weights same for every copy of the model {because the underlying infrastructure is the same}
  - aka immortal computing
  - built on transistors which consume a lot of power
  - expensive for two reasons: power consumption, fabrication precision
    - manufacturing plant costs O($xB)
    - {total power consumption = [power per transistor] x [N transistors / chip] x [N chips / zPU] x [N zPU / compute unit] x [N compute units]}
- low power analog hardware alternative {inputs/activity defined by voltages, model weights defined by capacitance (1/R), output is charge as an integral of current, I = V/R}
  - {is this low power simply because no transistors required?}
- {relaxed fabrication specs: imprecise fab okay, shift the responsibility for reliable compute to the learning algorithm for the substrate}
  - {inherently memory limited because the system isn't designed to be replicable and, by its nature, isn't}
  - {'growing' vs manufacturing: GH doesn't get into this, but my take is that the key difference is the process' spurious variations; maybe it's not noise, but some other distribution}
- parallelism at the level of the weights: compute doesn't need to be fast
  - {this implies that the same computational operation is applied to every compute element/neuron}
    - {the complexity is shifted away from the model structure to ? (neuron processing operation/transfer function, or ???)}
  - {standard DL leverages layer-level parallelism}
Mortal computing
- unlike artificial neural networks, can't use backpropagation: to apply chain rule, need to fully specify the system
  - {this is an inverse problem situation: the processing system (model) doesn't have a known analytical representation}
  - slow, noisy, high variance: weight perturbations -> measure effect -> multiply weights by effect -> new weights
    - also, not really parallelizable: applying weight perturations to batches increases variance even more
  - activity perturbation: evaluate random neuron input (activity) and use that to estimate the local gradient
    - adding noise to neurons not the weights so there are thousands of times less of them
    - better variance
    - works okay for, eg, MNIST
    - scaling to large nets is an issue
      - {why?}
      - one approach is to develop objective functions that work well with fewer parameters
      - eg: one objective function for a smaller subgroup of neurons (local objective functions)
      - {? isn't this like having a bunch of narrower AIs side by side?}
- memory-limited + low power consumption
- knowledge transfer {(how?) or start from scratch?}
  - {if the latter: 1. how long to catch up? 2. timescale for keeping up? (i.e., knowledge latency: pragmatic limit vs target spec?)}
- start by considering knowledge transfer within the system, between local patches
  - in computer vision, share knowledge across patches via convolution {(convolve each patch with some shared knowledge)} or transformers {(? why mentioned in the context of CV, specifically? link to attention)}
    - {works because all patches are built on the same substrate}
  - for imperfect and dissimilar substrates, use knowledge distillation
    - align local patches' feature vectors (extract similarities, or transformations that make different feature vectors agree on a prediction)
    - this is useful as local patch (module) receptor fields (inputs?) can have a different number of sensors {(distinct measurements)} of different spacing {(resolution)} (no regular grid required)
- knowledge transfer between mortal computers (distillation)
  - classification task: source / teacher provides answer plus relative probabilities of wrong answers
  - compute task: consensus of mortal computers on which additional outputs will {sufficiently represent the current knowledge-holder's transfer function, transferring not just specific knowledge but how the origin (mortal) computer thinks}
- black boxes: he's embracing transfer functions for mortal computers
  - this is in contrast to explainable models {which are deconstructed into components and the model transfer function analysed in terms of components' individual transfer functions}
- computers that are more brain-like: neuromorphic hardware
  - don't have this yet because a suitable general-purpose learning algorithm hasn't yet been devised
  - the brain must have such a procedure, but it hasn't been discovered yet