Projects

Each project moves through four stages: de-risk, core validation, scaling up, and application.

Completed

Measurement of Belief Entrenchment in LLM Reasoning

Evaluated belief entrenchment across problem domains, models, and reasoning paradigms using the Martingale property. Validated its causal impact on forecasting performance.

Self-Improvement as Coherence Optimization

A theoretical account showing that self-improvement in language models can be understood as coherence optimization over behavioral basins. Characterizes when bootstrap-based elicitation methods work and explains how emergent generalization arises.

ProgressGym: Alignment with a Millennium of Moral Progress

Added a temporal dimension to RL-based alignment, enabling models to track and align with moral progress across historical time rather than fixing a snapshot of values.

The Lock-in Hypothesis: Simulation & Causal Inference

Showed through simulation how a collective knowledge base can collapse under human-AI feedback loops, and tested the hypothesis using causal inference on real-world chatbot interaction data.

Position: AI Systematically Rewires the Flow of Ideas

Argued that LLMs, acting as epistemic technologies, systematically amplify biases and errors, and can lead to knowledge collapse and value lock-in.

Ongoing

Unsupervised Martingale Training for Removal of Belief Entrenchment

Developing RL-based training interventions that minimize Martingale deviation in LLM reasoning, removing confirmation bias without requiring ground-truth labels or controlled conditions.

Core validation

Learning Agents That Seek Human Reflective Equilibrium

Training AI to guide people toward their reflective equilibrium: the set of beliefs they would hold after careful, iterative reflection on challenges to their current views. The method avoids entrenching instrumental preferences by targeting globally stable belief states through adversarial minimax training.

De-risk

Coherence Optimization as a Unifying Lens on Surprising Generalization

Developing a bootstrap method to discover high-coherence behavioral basins in LLMs and the generalization graph between them, with applications to emergent misalignment and elicitation.

Scaling up

Truth-Seeking Co-Arena

Human-AI benchmark infrastructure that evaluates LLMs on how well they assist people in genuine truth-seeking tasks: research, decision-making, learning, and value judgment.

Core validation

Large Social-Technical Systems Evaluation

An initial study applying the Martingale score to measure collective belief entrenchment at the ecosystem level, on social media platforms and recommender systems, without requiring ground-truth labels or experimental controls.

De-risk

Open Problems in AI Influence

Survey paper drawing together research from cognitive science, computational social science, ML, and human-AI interaction on how AI systems shape human epistemics. Building a research coalition around the topic.

Planned

Martingale Training in Human-LLM Interaction

RL training on LLMs to reduce confirmation bias in human reasoning during AI-assisted tasks, by minimizing users' Martingale deviation over the course of conversations.

Martingale Training on Social Media & Recommender Systems

Training discourse facilitators and recommendation agents with RL to minimize collective Martingale scores, reducing polarization and belief entrenchment at scale.

Simulating and Forecasting AI's Impact on Societal Epistemics

LLM-based network simulations to study how AI interventions affect polarization, fanaticism, and epistemic dynamics at the societal level, benchmarked by prediction accuracy on real user behaviors and events.