Projects

Each project moves through four stages: de-risk, core validation, scaling up, and application.

Completed

A theoretical account showing that self-improvement in language models can be understood as coherence optimization over behavioral basins. Characterizes when bootstrap-based elicitation methods work and explains how emergent generalization arises.

Ongoing
Unsupervised Martingale Training for Removal of Belief Entrenchment

Developing RL-based training interventions that minimize Martingale deviation in LLM reasoning, removing confirmation bias without requiring ground-truth labels or controlled conditions.

Core validation
Learning Agents That Seek Human Reflective Equilibrium

Training AI to guide people toward their reflective equilibrium: the set of beliefs they would hold after careful, iterative reflection on challenges to their current views. The method avoids entrenching instrumental preferences by targeting globally stable belief states through adversarial minimax training.

De-risk
Truth-Seeking Co-Arena

Human-AI benchmark infrastructure that evaluates LLMs on how well they assist people in genuine truth-seeking tasks: research, decision-making, learning, and value judgment.

Core validation
Large Social-Technical Systems Evaluation

An initial study applying the Martingale score to measure collective belief entrenchment at the ecosystem level, on social media platforms and recommender systems, without requiring ground-truth labels or experimental controls.

De-risk

Survey paper drawing together research from cognitive science, computational social science, ML, and human-AI interaction on how AI systems shape human epistemics. Building a research coalition around the topic.

Planned
Martingale Training in Human-LLM Interaction

RL training on LLMs to reduce confirmation bias in human reasoning during AI-assisted tasks, by minimizing users' Martingale deviation over the course of conversations.

Martingale Training on Social Media & Recommender Systems

Training discourse facilitators and recommendation agents with RL to minimize collective Martingale scores, reducing polarization and belief entrenchment at scale.

Simulating and Forecasting AI's Impact on Societal Epistemics

LLM-based network simulations to study how AI interventions affect polarization, fanaticism, and epistemic dynamics at the societal level, benchmarked by prediction accuracy on real user behaviors and events.