Safe AGI

Machine learning systems optimized for generalized learning via training on very broad data sets are experiencing rapid advances, primarily due to investments in compute and data. Artificial general intelligence (AGI) captures the idea of a compute system capable of producing knowledge-work outputs comparable to those of a human, and without being constrained to specific fields/skills.

The benefits of such a system are obvious - as are the risks. The field of AI safety addresses the quantification and mitigation of those risks.

AI alignment

Generally, alignment in this context can represent characterization of two different things:

  1. model output alignment with user intent
  2. model alignment with human values

Clearly, the second is much harder to quantify as not only is there ambiguity about what those values might be and whether those represented in the model are comprehensive, but it’s non-trivial to evaluate whether the representation of those values in the model’s loss function(s) is accurate, complete, and robust.

Model evaluation

2024.09 Expert questions for model eval:

Current/2024:

Leaderboards:

Watch for low quality evals: MMLU, Human Eval, per Jim Fan, NVidia:

I would not trust any claims of a superior model until I see the following:

  1. ELO points on LMSys Chatbot Arena. It’s difficult to game democracy in the wild.
  2. Private LLM evaluation from a trusted 3rd party, such as Scale AI’s benchmark. The test set must be well-curated and held secret, otherwise it quickly loses potency.

Also, see: Dangerous capability tests should be harder

Interpretability

On the Interpretability of Artificial Intelligence in Radiology: Challenges and Opportunities | Radiology: Artificial Intelligence

Mechanistic Interpretability for AI Safety A Review

Do All AI Systems Need to Be Explainable?

Self-explaining SAE features — LessWrong

JShollaj/awesome-llm-interpretability: A curated list of Large Language Model (LLM) Interpretability resources.

[2408.07852] Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

Oversight & standards

Lessons from the FDA for AI - AI Now Institute

Governing General Purpose AI — A Comprehensive Map of Unreliability, Misuse and Systemic Risks

Etc.

Yann LeCun - A Path Towards Autonomous Machine Intelligence

Reasoning through arguments against taking AI safety seriously: Yoshua Bengio 2024.07.09

Towards a Cautious Scientist AI with Convergent Safety Bounds: Yoshua Bengio 2024.02.26

ADD / XOR / ROL: Someone is wrong on the internet (AGI Doom edition)


Defining AGI

The Turing Test and our shifting conceptions of intelligence | Science

Karpathy tweet 2024.03.18

Reasoning

RAG

Vectors and Graphs: Better Together - Graph Database & Analytics

Accepting not-G AI

Setting boundaries

Jaana Dogan: LLMs are tools to navigate a corpus based on a very biased and authoritative prompt.

Model size

Larger and more instructable language models become less reliable | Nature

Project Analyzing Human Language Usage Shuts Down Because ‘Generative AI Has Polluted the Data’

Alexandr Wang: War, AI and the new global arms race | TED Talk

[2409.14160] Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI

Security

AI against Censorship: Genetic Algorithms, The Geneva Project, ML in Security, and more! - YouTube

censorship.ai | About Geneva

Power dynamics: control over AI

Can You Trust An AI Press Release?—Asterisk

x-risk

Situational awareness: The Decade Ahead (Aschenbrenner 2024)

A.I. Pioneers Call for Protections Against ‘Catastrophic Risks’ - The New York Times

Organizations focusing on AI safety

U.S. Artificial Intelligence Safety Institute | NIST

The AI Safety Institute (AISI)

Stanford AI Safety

Center for AI Safety (CAIS)

Research – Epoch AI

FAR AI

AI • Objectives • Institute

Center for Human-Compatible Artificial Intelligence – Center for Human-Compatible AI is building exceptional AI for humanity

Why AI Safety? - Machine Intelligence Research Institute

Redwood Research

AI safety research training

ARENA

MATS Program

AI Security

HydroX AI | Advanced AI Safety and Security Solutions