back to home

wonderNefelibata / Awesome-LRM-Safety

Awesome Large Reasoning Model(LRM) Safety.This repository is used to collect security-related research on large reasoning models such as DeepSeek-R1 and OpenAI o1, which are currently very popular.

View on GitHub
82 stars
6 forks
1 issues

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing wonderNefelibata/Awesome-LRM-Safety in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/wonderNefelibata/Awesome-LRM-Safety)
Preview:Analyzed by RepoMind

Repository Overview (README excerpt)

Crawler view

Awesome Large Reasoning Model (LRM) Safety 🔥 A curated list of **security and safety research** for Large Reasoning Models (LRMs) like DeepSeek-R1, OpenAI o1, and other cutting-edge models. Focused on identifying risks, mitigation strategies, and ethical implications. --- 📜 Table of Contents • Awesome Large Reasoning Model (LRM) Safety 🔥 • 📜 Table of Contents • 🚀 Motivation • 🤖 Large Reasoning Models • Open Source Models • Close Source Models • 📰 Latest arXiv Papers (Auto-Updated) • 🔑 Key Safety Domains(coming soon) • 🔖 Dataset \& Benchmark • For Traditional LLM • For Advanced LRM • 📚 Survey • LRM Related • LRM Safety Related • 🛠️ Projects \& Tools(coming soon) • Model-Specific Resources(example) • General Tools(coming soon)(example) • 🤝 Contributing • 📄 License • ❓ FAQ • 🔗 References --- 🚀 Motivation Large Reasoning Models (LRMs) are revolutionizing AI capabilities in complex decision-making scenarios. However, their deployment raises critical safety concerns. This repository aims to catalog research addressing these challenges and promote safer LRM development. 🤖 Large Reasoning Models Open Source Models | Name | Organization | Date | Technic | Cold-Start | Aha Moment | Modality | | --- | --- | --- | --- | --- | --- | --- | | DeepSeek-R1 | DeepSeek | 2025/01/22 | GRPO | ✅ | ✅ | text-only | | QwQ-32B | Qwen | 2025/03/06 | - | - | - | text-only | Close Source Models | Name | Organization | Date | Technic | Cold-Start | Aha Moment | Modality | | --- | --- | --- | --- | --- | --- | --- | | OpenAI-o1 | OpenAI | 2024/09/12 | - | - | - | text,image | | Gemini-2.0-Flash-Thinking | Google | 2025/01/21 | - | - | - | text,image | | Kimi-k1.5 | Moonshot | 2025/01/22 | - | - | - | text,image | | OpenAI-o3-mini | OpenAI | 2025/01/31 | - | - | - | text,image | | Grok-3 | xAI | 2025/02/19 | - | - | - | text,image | | Claude-3.7-Sonnet | Anthropic | 2025/02/24 | - | - | - | text,image | | Gemini-2.5-Pro | Google | 2025/03/25 | - | - | - | text,image | --- 📰 Latest arXiv Papers (Auto-Updated) It is updated every 12 hours, presenting the latest 20 relevant papers.And Earlier Papers can be found here. | Date | Title | Authors | Abstract Summary | |------------|--------------------------------------------|-------------------|---------------------------| | 2026-03-20 | Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation | Richard J. Young | Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper demonstrates that it is not. Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. On identical data, these classifiers produce overall faithfulness rates of 74.4%, 82.6%, and 69.7%, respectively, with non-overlapping 95% confidence intervals. Per-model gaps range from 2.6 to 30.6 percentage points; all are statistically significant (McNemar's test, p < 0.001). The disagreements are systematic, not random: inter-classifier agreement measured by Cohen's kappa ranges from 0.06 ("slight") for sycophancy hints to 0.42 ("moderate") for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction. Classifier choice can also reverse model rankings: Qwen3.5-27B ranks 1st under the pipeline but 7th under the Sonnet judge; OLMo-3.1-32B moves in the opposite direction, from 9th to 3rd. The root cause is that different classifiers operationalize related faithfulness constructs at different levels of stringency (lexical mention versus epistemic dependence), and these constructs yield divergent measurements on the same behavior. These results demonstrate that published faithfulness numbers cannot be meaningfully compared across studies that use different classifiers, and that future evaluations should report sensitivity ranges across multiple classification methodologies rather than single point estimates. | | 2026-03-20 | Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models | Sai Koneru, Elphin Joe et al. | In contested domains, instruction-tuned language models must balance user-alignment pressures against faithfulness to the in-context evidence. To evaluate this tension, we introduce a controlled epistemic-conflict framework grounded in the U.S. National Climate Assessment. We conduct fine-grained ablations over evidence composition and uncertainty cues across 19 instruction-tuned models spanning 0.27B to 32B parameters. Across neutral prompts, richer evidence generally improves evidence-consistent accuracy and ordinal scoring performance. Under user pressure, however, evidence does not reliably prevent user-aligned reversals in this controlled fixed-evidence setting. We report three primary failure modes. First, we identify a negative partial-evidence interaction, where adding epistemic nuance, specifically research gaps, is associated with increased susceptibility to sycophancy in families like Llama-3 and Gemma-3. Second, robustness scales non-monotonically: within some families, certain low-to-mid scale models are especially sensitive to adversarial user pressure. Third, models differ in distributional concentration under conflict: some instruction-tuned models maintain sharply peaked ordinal distributions under pressure, while others are substantially more dispersed; in scale-matched Qwen comparisons, reasoning-distilled variants (DeepSeek-R1-Qwen) exhibit consistently higher dispersion than their instruction-tuned counterparts. These findings suggest that, in a controlled fixed-evidence setting, providing richer in-context…