Scalable Oversight
Methods to monitor and correct AI systems that exceed human capabilities – how do you oversee something smarter than yourself?
Scalable Oversight = How do you oversee AI smarter than humans? Approaches: AI-assisted evaluation, debate, recursive reward modeling. One of the most important open AI safety problems.
Explanation
Approaches: AI-assisted evaluation (weaker AIs evaluate stronger ones), Debate (two AIs argue, human judges), recursive reward modeling, interpretability tools.
Marketing Relevance
As AI becomes more capable, human oversight becomes harder. Scalable oversight is one of the most important open problems in AI safety.
Common Pitfalls
No approach is proven safe. AI-assisted evaluation can have the same blind spots. Debate can be susceptible to manipulation.
Origin & History
Amodei et al. (2016, OpenAI) defined the problem. AI Safety via Debate (Irving et al., 2018) and Recursive Reward Modeling (Leike et al., 2018) were early approaches. Anthropic and OpenAI actively research this.
Comparisons & Differences
Scalable Oversight vs. Human-in-the-Loop
HITL works when humans understand the AI; Scalable Oversight is needed when AI exceeds human capabilities.
Scalable Oversight vs. RLAIF
RLAIF is a practical scalable oversight technique; Scalable Oversight is the broader research field.