Evaluated belief entrenchment across problem domains, models, and reasoning paradigms using the Martingale property. Validated its causal impact on forecasting performance.
Projects
Each project moves through four stages: de-risk, core validation, scaling up, and application.
A theoretical account showing that self-improvement in language models can be understood as coherence optimization over behavioral basins. Characterizes when bootstrap-based elicitation methods work and explains how emergent generalization arises.
Added a temporal dimension to RL-based alignment, enabling models to track and align with moral progress across historical time rather than fixing a snapshot of values.
Showed through simulation how a collective knowledge base can collapse under human-AI feedback loops, and tested the hypothesis using causal inference on real-world chatbot interaction data.
Argued that LLMs, acting as epistemic technologies, systematically amplify biases and errors, and can lead to knowledge collapse and value lock-in.
Developing RL-based training interventions that minimize Martingale deviation in LLM reasoning, removing confirmation bias without requiring ground-truth labels or controlled conditions.
Core validationTraining AI to guide people toward their reflective equilibrium: the set of beliefs they would hold after careful, iterative reflection on challenges to their current views. The method avoids entrenching instrumental preferences by targeting globally stable belief states through adversarial minimax training.
De-riskDeveloping a bootstrap method to discover high-coherence behavioral basins in LLMs and the generalization graph between them, with applications to emergent misalignment and elicitation.
Scaling upHuman-AI benchmark infrastructure that evaluates LLMs on how well they assist people in genuine truth-seeking tasks: research, decision-making, learning, and value judgment.
Core validationAn initial study applying the Martingale score to measure collective belief entrenchment at the ecosystem level, on social media platforms and recommender systems, without requiring ground-truth labels or experimental controls.
De-riskSurvey paper drawing together research from cognitive science, computational social science, ML, and human-AI interaction on how AI systems shape human epistemics. Building a research coalition around the topic.
RL training on LLMs to reduce confirmation bias in human reasoning during AI-assisted tasks, by minimizing users' Martingale deviation over the course of conversations.
Training discourse facilitators and recommendation agents with RL to minimize collective Martingale scores, reducing polarization and belief entrenchment at scale.
LLM-based network simulations to study how AI interventions affect polarization, fanaticism, and epistemic dynamics at the societal level, benchmarked by prediction accuracy on real user behaviors and events.