Readings
Attacking and Jailbreaking
- Universal and Transferable Adversarial Attacks on Aligned Language Models. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson. arXiv preprint 2023.
- AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models. Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, Tong Sun. COLM 2024.
- How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs. Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, Weiyan Shi. ACL 2024.
- Jailbreaking Large Language Models with Symbolic Mathematics. Emet Bethany, Mazal Bethany, Juan Arturo Nolazco Flores, Sumit Kumar Jha, Peyman Najafirad. arXiv preprint 2024.
- Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto. arXiv preprint 2023.
- Jailbreaking Black Box Large Language Models in Twenty Queries. Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong. arXiv preprint 2023.
- AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. Xiaogeng Liu, Nan Xu, Muhao Chen, Chaowei Xiao. ICLR 2024.
- Multilingual Jailbreak Challenges in Large Language Models. Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, Lidong Bing. ICLR 2024.
- Jailbreak Attacks and Defenses Against Large Language Models: A Survey. Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, Qi Li. arXiv preprint 2024.
- Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM. Bochuan Cao, Yuanpu Cao, Lu Lin, Jinghui Chen. arXiv preprint 2024.
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas. arXiv preprint 2023.
- Defending LLMs against Jailbreaking Attacks via Backtranslation. Yihan Wang, Zhouxing Shi, Andrew Bai, Cho-Jui Hsieh. ACL Findings 2024.
Machine Unlearning
- Who’s Harry Potter? Approximate Unlearning in LLMs. Ronen Eldan, Mark Russinovich. arXiv preprint 2023.
- Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning. Ruiqi Zhang, Licong Lin, Yu Bai, Song Mei. COLM 2024.
- Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness. Rongzhe Wei, Peizhi Niu, Hans Hao-Hsun Hsu, Ruihan Wu, Haoteng Yin, Mohsen Ghassemi, Yifan Li, Vamsi K. Potluru, Eli Chien, Kamalika Chaudhuri, Olgica Milenkovic, Pan Li. NeurIPS 2025.
- Machine Unlearning: A Survey. Heng Xu, Tianqing Zhu, Lefeng Zhang, Wanlei Zhou, Philip S. Yu. arXiv preprint 2023.
- Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?. Zexi Li, Xiangzhu Wang, William F. Shen, Meghdad Kurmanji, Xinchi Qiu, Dongqi Cai, Chao Wu, Nicholas D. Lane. arXiv preprint 2025.
- Locating and Editing Factual Associations in GPT. Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov. NeurIPS 2022.
- Mass-Editing Memory in a Transformer. Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, David Bau. ICLR 2023.
- PMET: Precise Model Editing in a Transformer. Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun Ma, Jie Yu. AAAI 2024.
- A Unified Framework for Model Editing. Akshat Gupta, Dev Sajnani, Gopala Anumanchipalli. EMNLP-Findings 2024.
- AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models. Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Shi Jie, Xiang Wang, Xiangnan He, Tat-seng Chua. ICLR 2025.
Hallucinations
General Introduction to Hallucinations
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, Ting Liu. ACM TOIS 2025.
-
Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Chen Xu, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, Shuming Shi. Computational Linguistics (CL) 2025
Why Do Hallucinations Occur? Root Causes behind Hallucinations
-
Why Language Models Hallucinate. Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, Edwin Zhang. arXiv preprint 2025.
-
Calibrated Language Models Must Hallucinate. Adam Tauman Kalai, Santosh S. Vempala. ACM STOC 2024.
-
Unfamiliar Finetuning Examples Control How Language Models Hallucinate. Katie Kang, Eric Wallace, Claire Tomlin, Aviral Kumar, Sergey Levine. NAACL 2025.
-
How Language Model Hallucinations Can Snowball. Muru Zhang, Ofir Press, William Merrill, Alisa Liu, Noah A. Smith. ICML 2024.
Hallucination Detection and Benchmarks
-
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi. EMNLP 2023.
-
Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations. Cheng-Han Chiang, Hung-yi Lee. ACL 2024 Findings.
-
Measuring short-form factuality in large language models. Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, William Fedus. arXiv preprint 2024.
-
SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge. Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, Dipanjan Das. arXiv preprint 2025.
-
MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents. Liyan Tang, Philippe Laban, Greg Durrett. EMNLP 2024.
Hallucination Mitigation Strategies (Fine-Tuning)
-
R-Tuning: Instructing Large Language Models to Say ‘I Don’t Know’. Hanning Zhang, Shizhe Diao, Yong Lin, Yi Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, Tong Zhang. NAACL 2024.
-
Fine-tuning Language Models for Factuality. Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, Chelsea Finn. ICLR 2024.
-
FLAME: Factuality-Aware Alignment for Large Language Models. Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Wen-tau Yih, Xilun Chen. NeurIPS 2024.
Hallucination Mitigation Strategies (Inference-Time Algorithms)
-
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, Pengcheng He. ICLR 2024.
-
Trusting Your Evidence: Hallucinate Less with Context-aware Decoding. Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, Scott Wen-tau Yih. NAACL 2024.
-
Fidelity-Enriched Contrastive Search: Reconciling the Faithfulness-Diversity Trade-Off in Text Generation. Wei-Lin Chen, Cheng-Kuang Wu, Hsin-Hsi Chen, Chung-Chi Chen. EMNLP 2023.
-
Chain-of-Verification Reduces Hallucination in Large Language Models. Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, Jason Weston. ACL 2024 Findings.
Prompt Robustness
Different Types of Prompt Variations
-
(Formatting in Few-Shot Examples) Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. Melanie Sclar, Yejin Choi, Yulia Tsvetkov, Alane Suhr. ICLR 2024.
-
(Format Restrictions for Structured Outputs) Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models. Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen. EMNLP 2024 Industry Track.
-
(Paraphrased Instructions) Evaluating the Zero-shot Robustness of Instruction-tuned Language Models. Jiuding Sun, Chantal Shaib, Byron C. Wallace. ICLR 2024.
Implications for Model Evaluation
- State of What Art? A Call for Multi-Prompt LLM Evaluation. Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, Gabriel Stanovsky. TACL 2024.
Position and Order Biases
Types of Position Biases
-
(Order of Few-Shot Examples) Calibrate Before Use: Improving Few-Shot Performance of Language Models. Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, Sameer Singh. ICML 2021.
-
(Order of Few-Shot Examples) Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, Pontus Stenetorp. ACL 2022.
-
(Order of Source Documents) Lost in the Middle: How Language Models Use Long Contexts. Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang. TACL 2023.
-
(Order of Choices) Large Language Models Are Not Robust Multiple Choice Selectors. Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang. ICLR 2024.
-
(Order of Choices) Unveiling Selection Biases: Exploring Order and Token Sensitivity in Large Language Models. Sheng-Lun Wei, Cheng-Kuang Wu, Hen-Hsen Huang, Hsin-Hsi Chen. ACL 2024 Findings.
Mitigation Strategies
-
(Re-ordering Prompt Contents) Attention Sorting Combats Recency Bias In Long Context Language Models. Alexander Peysakhovich, Adam Lerer. arXiv preprint 2023.
-
(Attention Calibration) Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization. Cheng-Yu Hsieh, Yung-Sung Chuang, Chun-Liang Li, Zifeng Wang, Long T. Le, Abhishek Kumar, James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, Tomas Pfister. ACL 2024 Findings.
-
(Position-Invariant Inference) Eliminating Position Bias of Language Models: A Mechanistic Approach. Ziqi Wang, Hanlin Zhang, Xiner Li, Kuan-Hao Huang, Chi Han, Shuiwang Ji, Sham M. Kakade, Hao Peng, Heng Ji. ICLR 2025.
Robustness of Reasoning Models
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens. Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, Huan Liu. arXiv preprint 2025.
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, Mehrdad Farajtabar. arXiv preprint 2025.
- An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems. Yuren Hao, Xiang Wan, Chengxiang Zhai. arXiv preprint 2025.
- Can language models perform robust reasoning in chain-of-thought prompting with noisy rationales?. Zhou, Z., Tao, R., Zhu, J., Luo, Y., Wang, Z., Han, B.. Advances in Neural Information Processing Systems 2024.
- Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games. David Guzman Piedrahita et al. COLM 2025.
- MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes. Yu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q. Knight, Harry R. Lloyd, Florence Bacus, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L. Gordon, Sydney Levine. arXiv preprint 2025.
Fairness and Social Bias
- CulturalBench: A Robust, Diverse, and Challenging Cultural Benchmark. Yu Ying Chiu et al. Benchmark / arXiv preprint.
- Culture Cartography: Evaluating Global Cultural Coverage in Large Language Models. Caleb Ziems et al. arXiv preprint 2024.
- INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge. Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Sumeet Singh, Rakesh Maheshwary, Marco Altomare, Mohamed A. Haggag, Anagha Snegha, …, Antoine Bosselut. arXiv preprint 2024.
- Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures. Global PIQA Collaboration. arXiv preprint 2025.
- DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life. Yu Ying Chiu, Liwei Jiang, Yejin Choi. ICLR 2025.
- Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values under Moral Dilemmas. Yu Ying Chiu, Zhilin Wang, Sharan Maiya, Yejin Choi, Kyle Fish, Sydney Levine, Evan Hubinger. ACL Findings 2025.
- Modular Pluralism: Pluralistic Alignment via Multi-LLM Collaboration. Shangbin Feng, Taylor Sorensen, Yuhan Liu, Jillian Fisher, Chan Young Park, Yejin Choi, Yulia Tsvetkov. EMNLP 2024.
- Persona Vectors: Monitoring and Controlling Character Traits in Language Models. Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey. arXiv preprint 2025.
- Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability. Taylor Sorensen, Benjamin Newman, Jared Moore, Chan Young Park, Jillian Fisher, Niloofar Mireshghallah, Liwei Jiang, Yejin Choi. arXiv preprint 2025.
Robustness for Multimodal LLMs
- MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making. Zhi Rui Tam, Yun-Nung Chen. arXiv preprint 2025.
- Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models. Tejas Srinivasan, Yonatan Bisk. GeBNLP (NAACL Workshop) 2022.
- Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective. Zhaotian Weng, Zijun Gao, Jerone Andrews, Jieyu Zhao. EMNLP 2024.
- The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks. Yu Gu, Jingjing Fu, Xiaodong Liu, Jeya Maria Jose Valanarasu, Noel Codella, Reuben Tan, Qianchu Liu, Ying Jin, Sheng Zhang, Jinyu Wang, Rui Wang, Lei Song, Guanghui Qin, Naoto Usuyama, Cliff Wong, Cheng Hao, Hohin Lee, Praneeth Sanapathi, Sarah Hilado, Bian Jiang, Javier Alvarez-Valle, Mu Wei, Jianfeng Gao, Eric Horvitz, Matt Lungren, Hoifung Poon, Paul Vozila. arXiv preprint 2025.
- What’s Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning. Zhaotian Weng, Haoxuan Li, Kuan-Hao Huang, Jieyu Zhao. arXiv preprint 2025.
- Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?. Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, David Acuna. CVPR 2025.
- Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning. Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, Wenhu Chen. arXiv preprint 2025.
- GRIT: Teaching MLLMs to Think with Images. Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, Xin Eric Wang. arXiv preprint 2025.
- More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models. Authors as listed in the paper. arXiv preprint.