Safe AI

Machine learning (ML) systems optimized for generalized learning via training on very broad data sets are experiencing rapid advances, primarily due to investments in compute, data, and energy. Artificial general intelligence (AGI) is a sort of benchmark of capability of ML systems; roughly, when a compute system becomes capable of producing knowledge-work outputs comparable to those of a human across a wide range of knowledge areas, it would be considered AGI.

Computers achieving that degree of ability represent the possibility of both relieving humans of work1 and also the possibility of exceeding human (work production) capacity. An AI’s compute speed enables it to process data faster than a human, and the way an artificial neural net represents information enables the AI to potentially hold more ideas ‘in its head’ simultaneously, enabling it to draw on more information while ’thinking’.

One might imagine risks would accompany an artificial neural network having such supercapabilities, particularly if given agency - or if it manages to figure out how to assume agency on its own. Even if the leap to AGI either takes years (or never comes to pass), current ML systems carry a subset of AGI’s sizable barrel of risks, too. Generally speaking, the field of AI safety addresses the quantification and mitigation of those risks: sometimes addressing machine learning as it currently exists, and sometimes AGI.

AI alignment

Generally, alignment in this context can represent characterization of two different things:

  1. model output alignment with user intent
  2. model alignment with human values

The second - aligning with human values - is considerably trickier to quantify as not only is there ambiguity about what those values might be and whether they’re fully represented, but it’s non-trivial to evaluate whether the representation of those values in the model’s loss function(s) is accurate, complete, and robust.

IRL experiences

(1) thebes on X: "is an llm agent aligned just because the llm is? not necessarily

OpenAI's new models "instrumentally faked alignment"

The Fallacy of AI Functionality

Scaling

OpenAI’s Strawberry and inference scaling laws

AI Capabilities Can Be Significantly Improved Without Expensive Retraining | Epoch AI

How Scaling became a Local Optima in Deep Learning [Markets]

Model evaluation

On the Measure of Intelligence - Chollet

2024.09 Expert questions for model eval:

Current/2024:

Leaderboards:

Watch for low quality evals: MMLU, Human Eval, per Jim Fan, NVidia:

“I would not trust any claims of a superior model until I see the following: 1. ELO points on LMSys Chatbot Arena. It’s difficult to game democracy in the wild. 2. Private LLM evaluation from a trusted 3rd party, such as Scale AI’s benchmark. The test set must be well-curated and held secret, otherwise it quickly loses potency.”

Also, see: Dangerous capability tests should be harder

Introducing SimpleQA | OpenAI

Interpretability

Multimodal interpretability in 2024

On the Interpretability of Artificial Intelligence in Radiology: Challenges and Opportunities | Radiology: Artificial Intelligence

Do All AI Systems Need to Be Explainable?

Self-explaining SAE features — LessWrong

JShollaj/awesome-llm-interpretability: A curated list of Large Language Model (LLM) Interpretability resources.

[2408.07852] Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

Mechanistic Interpretability:

Representation engineering:

Bias

A look at Bias in Generative AI [Thoughts] - by Devansh

Intended model bias aka representation engineering

Oversight & standards

Lessons from the FDA for AI - AI Now Institute

Governing General Purpose AI — A Comprehensive Map of Unreliability, Misuse and Systemic Risks

Etc.

My recent lecture at Berkeley and a vision for a “CERN for AI”

Should AI Progress Speed Up, Slow Down, or Stay the Same?

Yann LeCun - A Path Towards Autonomous Machine Intelligence

Reasoning through arguments against taking AI safety seriously: Yoshua Bengio 2024.07.09

Towards a Cautious Scientist AI with Convergent Safety Bounds: Yoshua Bengio 2024.02.26

ADD / XOR / ROL: Someone is wrong on the internet (AGI Doom edition)

Conversations -


Defining AGI

The Turing Test and our shifting conceptions of intelligence | Science

Karpathy tweet 2024.03.18

Reasoning

RAG

Vectors and Graphs: Better Together - Graph Database & Analytics

Accepting not-G AI

Setting boundaries

Jaana Dogan: LLMs are tools to navigate a corpus based on a very biased and authoritative prompt.

Model size

Larger and more instructable language models become less reliable | Nature

Project Analyzing Human Language Usage Shuts Down Because ‘Generative AI Has Polluted the Data’

Alexandr Wang: War, AI and the new global arms race | TED Talk

[2409.14160] Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI

Security

AI against Censorship: Genetic Algorithms, The Geneva Project, ML in Security, and more! - YouTube

censorship.ai | About Geneva

Power dynamics: control over AI

[2312.06942] AI Control: Improving Safety Despite Intentional Subversion

Can You Trust An AI Press Release?—Asterisk

On Civilizational Triumph - by Dean W. Ball

x-risk

Dario Amodei — Machines of Loving Grace

Situational awareness: The Decade Ahead (Aschenbrenner 2024)

A.I. Pioneers Call for Protections Against ‘Catastrophic Risks’ - The New York Times

Organizations focusing on AI safety

AI Safety Awareness Foundation

U.S. Artificial Intelligence Safety Institute | NIST

The AI Safety Institute (AISI)

Stanford AI Safety

Center for AI Safety (CAIS)

Research – Epoch AI

FAR AI

AI • Objectives • Institute

Center for Human-Compatible Artificial Intelligence – Center for Human-Compatible AI is building exceptional AI for humanity

Why AI Safety? - Machine Intelligence Research Institute

Redwood Research

AI safety research training

ARENA

MATS Program

AI Security

HydroX AI | Advanced AI Safety and Security Solutions

Prediction discussions

The phony comforts of AI skepticism


Mission critical systems & ML

Medical

Reconciling privacy and accuracy in AI for medical imaging | Nature Machine Intelligence

Sharing Data Is Essential for the Future of AI in Medical Imaging | Radiology: Artificial Intelligence

Keeping Patient Data Secure in the Age of Radiology Artificial Intelligence: Cybersecurity Considerations and Future Directions - ScienceDirect

Security Considerations for AI in Radiology

All models are wrong and yours are useless: making clinical prediction models impactful for patients - s41698-024-00553-6.pdf


Demographics of contributors to ML, AI, AGI, and AI safety

Where are all the ‘godmothers’ of AI? Women’s voices are not being heard | Luba Kassova | The Guardian

Artificial Intelligence and gender equality | UN Women – Headquarters


  1. Of which, presumably, at least some is work no one actually wants to do - though even that seems a tricky deal to strike just right↩︎