Chenyang An

New York City

email: cya.portfolio at gmail dot com

I’m an Applied Scientist at Amazon AWS Automated Reasoning Group, working on LLM post-training, reasoning and verification.

I interned at Microsoft Research in Seattle during the summer of 2024, focusing on improving the training efficiency of large language models (LLMs) for reasoning tasks. I also worked part-time at Scale AI as an AI Consultant, contributing to the development of LLM-based web agents and scalable verification systems for reasoning data. In Spring 2025, I joined Amazon AWS Neurosymbolic as an Applied Scientist Intern, where I designed a new reinforcement learning pipeline incorporating a diversity-based reward to encourage the generation of varied chains of thought (CoTs), along with the supporting data preprocessing framework.

If you are interested in any of the topics above, feel free to drop me an email!

news

Aug 30, 2025	I’m excited to share that I will join Amazon AWS Automated Reasoning Group as an Applied Scientist!
Apr 01, 2025	I’m excited to share that our paper, “The Price of Format: Diversity Collapse in LLMs”, has been accepted to Empirical Methods in Natural Language Processing 2025 (EMNLP)! In this work, we find out that structured templates in instruction-tuned LLMs cause diversity collapse—limiting open-ended generation—even under high-temperature sampling, and systematically evaluated this effect across tasks to show the trade-off between alignment, task performance, and output diversity.
Apr 01, 2025	I’m excited to share that I’ll be joining Amazon AWS Neurosymbolic as an Applied Scientist Intern, where I’ll be working on LLM reasoning in both natural and formal language!
Jan 22, 2025	I’m thrilled to see our publication on ACL 2024 “Learn from failure: Fine-tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic Proving” is featured by Neptunes News Agency! News link.
Jan 22, 2025	I’m thrilled to share that our paper, “Correlation and Navigation in the Vocabulary Key Representation Space of Language Models”, has been accepted to the International Conference on Learning Representations (ICLR)! This work studies spurious correlation existing in the vocabulary key space of LLMs, and proposes a novel in-context learning method (called In-Context Navigation) to sample high quality results from the key space of the LLMs that otherwise cannot be obtained through the usual top-k inference.

selected publications

Arxiv

The Price of Format: Diversity Collapse in LLMs

May 2025

Abs HTML

Instruction-tuned large language models (LLMs) employ structured templates, such as role markers and special tokens, to enforce format consistency during inference. However, we identify a critical limitation of such formatting: it induces a phenomenon we term diversity collapse, where the model generates semantically similar outputs for open-ended inputs, undermining creativity and variability. We systematically evaluate this effect across tasks like story completion and free-form generation, finding that (1) diversity collapse persists even under high-temperature sampling, and (2) structural tokens in templates significantly constrain the model’s output space. To contextualize these findings, we fine-tune the same model using a range of structured prompts and then evaluate them across three axes: downstream task performance, alignment behavior, and output diversity. Our analysis shows that format consistency between fine-tuning and inference is crucial for structure-sensitive tasks (e.g., GSM8K, IFEval), but has marginal influence on knowledge-heavy tasks (e.g., MMLU, WebQuestions). In contrast, output diversity is primarily governed by the presence or absence of structural tokens, with minimal formatting yielding the most diverse outputs. These findings reveal that current prompting conventions, while beneficial for alignment, may inadvertently suppress output diversity, underscoring the need for diversity-aware prompt design and instruction tuning.
Arxiv

Linear Correlation in LM’s Compositional Generalization and Hallucination

Feb 2025

Abs HTML

The generalization of language models (LMs) is undergoing active debates, contrasting their potential for general intelligence with their struggles with basic knowledge composition (e.g., reverse/transition curse). This paper uncovers the phenomenon of linear correlations in LMs during knowledge composition. For explanation, there exists a linear transformation between certain related knowledge that maps the next token prediction logits from one prompt to another, e.g., "X lives in the city of" → "X lives in the country of" for every given X. This mirrors the linearity in human knowledge composition, such as Paris → France. Our findings indicate that the linear transformation is resilient to large-scale fine-tuning, generalizing updated knowledge when aligned with real-world relationships, but causing hallucinations when it deviates. Empirical results suggest that linear correlation can serve as a potential identifier of LM’s generalization. Finally, we show such linear correlations can be learned with a single feedforward network and pre-trained vocabulary representations, indicating LM generalization heavily relies on the latter.
Arxiv

Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation

Oct 2024

Abs HTML

In the field of large language model (LLM)-based proof generation, despite being trained on extensive corpora such as OpenWebMath and Arxiv, these models still exhibit only modest performance on proving tasks of moderate difficulty. We believe that this is partly due to the suboptimal order of each proof data used in training. Published proofs often follow a purely logical order, where each step logically proceeds from the previous steps based on the deductive rules. However, this order aims to facilitate the verification of the proof’s soundness, rather than to help people and models learn the discovery process of the proof. In proof generation, we argue that the optimal order for one training data sample occurs when the relevant intermediate supervision for a particular proof step in the proof is always positioned to the left of that proof step. We call such order the intuitively sequential order. We validate our claims using two tasks: intuitionistic propositional logic theorem-proving and digit multiplication. Our experiments verify the order effect and provide support for our explanations. We demonstrate that training is most effective when the proof is in the intuitively sequential order. Moreover, the order effect and the performance gap between models trained on different data orders are substantial – with an 11 percent improvement in proof success rate observed in the propositional logic theorem-proving task, between models trained on the optimal order compared to the worst order.
ICLR

Correlation and Navigation in the Vocabulary Key Representation Space of Language Models

Jan 2025

Abs HTML

Language model (LM) decoding is based on the next-token prediction (NTP) probability distribution. For neural LMs (e.g., Transformer-based), NTP distribution is essentially a softmax-regularized dot product between an encoded input context (query) and fixed vocabulary representations (keys). In this paper, we study the effect of the key distribution on the NTP distribution, with a focus on whether the similarity between keys will trigger spurious correlations in NTP. Through knowledge-probing tasks, we show that in the NTP distribution, the few top-ranked tokens are typically accurate. However, the middle-ranked prediction is highly biased towards the tokens that are distributionally (not necessarily semantically) similar to these top ones. For instance, if "P" is predicted as the top-1 token, "A"-"Z" will all be ranked high in NTP, no matter whether they can lead to correct decoding results. This hurts the sampling diversity and makes the sampling of correct, long-tail results hopeless and noisy. We attempt to alleviate this issue via a novel in-context method that iteratively pushes the query representation away from explored regions. Specifically, we include the explored decoding results in the context and prompt the LM to generate something else, which encourages the LM to produce a query representation that has small dot products with explored keys. Experiments on knowledge-probing tasks show that our method leads to efficient navigation away from explored keys to correct new keys. We further extend our method to open-ended and chain-of-thought (for reasoning) generation. Experiment results show that ICN contributes to better generation diversity and improved self-consistency voting performance. Finally, we discuss potential training issues caused by the fixed key space together with the challenges and possible ways to address them in future research.
ACL

Learn from Failure: Fine-Tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic Proving

May 2024

Abs HTML

Recent advances in Automated Theorem Proving have shown the effectiveness of leveraging a (large) language model that generates tactics (i.e. proof steps) to search through proof states. The current model, while trained solely on successful proof paths, faces a discrepancy at the inference stage, as it must sample and try various tactics at each proof state until finding success, unlike its training which does not incorporate learning from failed attempts. Intuitively, a tactic that leads to a failed search path would indicate that similar tactics should receive less attention during the following trials. In this paper, we demonstrate the benefit of training models that additionally learn from failed search paths. Facing the lack of such trial-and-error data in existing open-source theorem-proving datasets, we curate a dataset on intuitionistic propositional logic theorems and formalize it in Lean, such that we can reliably check the correctness of proofs. We compare our model trained on relatively short trial-and-error information (TrialMaster) with models trained only on the correct paths and discover that the former solves more unseen theorems with lower trial searches.