Chenyang An

New York City

email: cya.portfolio at gmail dot com

I am a Research Scientist @ Miromind, working on improving SOTA AI systems for reasoning and verification. Previously, I was an Applied Scientist at Amazon AWS, and have interned at Microsoft Research, Scale AI, and Amazon.

My PhD research centers on improving LLM reasoning through post-training and building AI agentic systems. I have published at ACL, ICLR, and EMNLP, with work on curating higher-quality reasoning data via trial-and-error trajectories, studying LLM generalization through internal representations, and designing diversity-aware reward mechanisms for mathematical reasoning.

My goal is to build AI that reasons correctly and verifies its own answers for meaningfully difficult problems. I developed QED, an open-source agent for mathematical discovery that has solved 3 open PDE research problems: constructing Carleman weight functions for inverse problems, proving nonexistence results for critical transport equations, and establishing equivalence conditions for Batchelor scale existence in shear flow. QED is currently tackling other open math research questions—more to be announced. As of April 27, 2026, QED proves that AI can already produce original nontrivial proofs to open research math problems.

The thing left to do is to make AI stronger and accelerate its application.

At Amazon, I built agentic systems that translate natural language into formal specifications and executable logic, using symbolic verification methods to ensure correctness, reduce hallucination, and improve logical consistency in LLM outputs.

I also build tools for mathematical workflows, such as pdf-to-Lean for converting mathematical documents into formal proofs.

[Resume]

If you are interested in reasoning, verification, or formal methods for AI, feel free to drop me an email!

news

Apr 27, 2026	Happy to join Miromind as a Research Scientist, working on improving SOTA AI systems for reasoning and verification.
Apr 24, 2026	QED solved two more research-level problems—this time in applied analysis and fluid dynamics. Professor Xiaoqian Xu from Duke Kunshan University contributed four open problems; QED proved two of them.
Apr 10, 2026	I open-sourced QED, a multi-agent pipeline that transforms mathematical problem statements into rigorous proofs. QED solved a research-level open problem in PDEs, with the proof verified by domain experts from three institutions and incorporated into their mathematical work.
Feb 25, 2026	I’m happy to release an agent pipeline based on Claude Code that help proofread latex source code of paper and books! Check https://github.com/chenyang-an/proofread for details!
Aug 30, 2025	I’m excited to share that I will join Amazon AWS Automated Reasoning Group as an Applied Scientist!

latest posts

Apr 24, 2026	QED Solves Two Open Problems in Analysis
Apr 10, 2026	QED: A Multi-Agent System for Mathematical Discovery
Apr 10, 2026	pdf-to-Lean: Automatically Formalizing Mathematics from Textbooks

selected publications

Arxiv

The Price of Format: Diversity Collapse in LLMs

May 2025

Abs HTML

Instruction-tuned large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference. However, we identify a critical limitation of such formatting: it induces a phenomenon we term diversity collapse, where the model generates semantically similar outputs for open-ended inputs, undermining creativity and variability. We systematically evaluate this effect across tasks like story completion and free-form generation, finding that (1) diversity collapse persists even under high-temperature sampling, and (2) structural tokens in templates significantly constrain the model’s output space. To contextualize these findings, we fine-tune the same model using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity. Our analysis shows that format consistency between fine-tuning and inference is crucial for structure-sensitive tasks (e.g., GSM8K, IFEval), but has marginal influence on knowledge-heavy tasks (e.g., MMLU, WebQuestions). In contrast, output diversity is primarily governed by the presence or absence of structural tokens, with minimal formatting yielding the most diverse outputs. These findings reveal that current prompting conventions, while beneficial for alignment, may inadvertently suppress output diversity, underscoring the need for diversity-aware prompt design and instruction tuning.
Arxiv

Linear Correlation in LM’s Compositional Generalization and Hallucination

Feb 2025

Abs HTML

The generalization of language models (LMs) is undergoing active debates, contrasting their potential for general intelligence with their struggles with basic knowledge composition (e.g., reverse/transition curse). This paper uncovers the phenomenon of linear correlations in LMs during knowledge composition. For explanation, there exists a linear transformation between certain related knowledge that maps the next token prediction logits from one prompt to another, e.g., "X lives in the city of" → "X lives in the country of" for every given X. This mirrors the linearity in human knowledge composition, such as Paris → France. Our findings indicate that the linear transformation is resilient to large-scale fine-tuning, generalizing updated knowledge when aligned with real-world relationships, but causing hallucinations when it deviates. Empirical results suggest that linear correlation can serve as a potential identifier of LM’s generalization. Finally, we show such linear correlations can be learned with a single feedforward network and pre-trained vocabulary representations, indicating LM generalization heavily relies on the latter.
Arxiv

Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation

Oct 2024

Abs HTML

In the field of large language model (LLM)-based proof generation, despite being trained on extensive corpora such as OpenWebMath and Arxiv, these models still exhibit only modest performance on proving tasks of moderate difficulty. We believe that this is partly due to the suboptimal order of each proof data used in training. Published proofs often follow a purely logical order, where each step logically proceeds from the previous steps based on the deductive rules. However, this order aims to facilitate the verification of the proof’s soundness, rather than to help people and models learn the discovery process of the proof. In proof generation, we argue that the optimal order for one training data sample occurs when the relevant intermediate supervision for a particular proof step in the proof is always positioned to the left of that proof step. We call such order the intuitively sequential order. We validate our claims using two tasks: intuitionistic propositional logic theorem-proving and digit multiplication. Our experiments verify the order effect and provide support for our explanations. We demonstrate that training is most effective when the proof is in the intuitively sequential order. Moreover, the order effect and the performance gap between models trained on different data orders are substantial – with an 11 percent improvement in proof success rate observed in the propositional logic theorem-proving task, between models trained on the optimal order compared to the worst order.
ICLR

Correlation and Navigation in the Vocabulary Key Representation Space of Language Models

Jan 2025

Abs HTML

Language model (LM) decoding is based on the next-token prediction (NTP) probability distribution. For neural LMs (e.g., Transformer-based), NTP distribution is essentially a softmax-regularized dot product between an encoded input context (query) and fixed vocabulary representations (keys). In this paper, we study the effect of the key distribution on the NTP distribution, with a focus on whether the similarity between keys will trigger spurious correlations in NTP. Through knowledge-probing tasks, we show that in the NTP distribution, the few top-ranked tokens are typically accurate. However, the middle-ranked prediction is highly biased towards the tokens that are distributionally (not necessarily semantically) similar to these top ones. For instance, if "P" is predicted as the top-1 token, "A"-"Z" will all be ranked high in NTP, no matter whether they can lead to correct decoding results. This hurts the sampling diversity and makes the sampling of correct, long-tail results hopeless and noisy. We attempt to alleviate this issue via a novel in-context method that iteratively pushes the query representation away from explored regions. Specifically, we include the explored decoding results in the context and prompt the LM to generate something else, which encourages the LM to produce a query representation that has small dot products with explored keys. Experiments on knowledge-probing tasks show that our method leads to efficient navigation away from explored keys to correct new keys. We further extend our method to open-ended and chain-of-thought (for reasoning) generation. Experiment results show that ICN contributes to better generation diversity and improved self-consistency voting performance. Finally, we discuss potential training issues caused by the fixed key space together with the challenges and possible ways to address them in future research.
ACL

Learn from Failure: Fine-Tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic Proving

May 2024

Abs HTML

Recent advances in Automated Theorem Proving have shown the effectiveness of leveraging a (large) language model that generates tactics (i.e. proof steps) to search through proof states. The current model, while trained solely on successful proof paths, faces a discrepancy at the inference stage, as it must sample and try various tactics at each proof state until finding success, unlike its training which does not incorporate learning from failed attempts. Intuitively, a tactic that leads to a failed search path would indicate that similar tactics should receive less attention during the following trials. In this paper, we demonstrate the benefit of training models that additionally learn from failed search paths. Facing the lack of such trial-and-error data in existing open-source theorem-proving datasets, we curate a dataset on intuitionistic propositional logic theorems and formalize it in Lean, such that we can reliably check the correctness of proofs. We compare our model trained on relatively short trial-and-error information (TrialMaster) with models trained only on the correct paths and discover that the former solves more unseen theorems with lower trial searches.