SEVRA-BENCH Evaluates LLM Code Reviewer Susceptibility to Social Engineering Attacks

A new research paper published on arXiv introduces SEVRA-BENCH, a security benchmark designed to measure the resilience of large language model code reviewers against adversarial pull requests. As software development pipelines increasingly adopt LLM-based agents to review and approve pull requests, they face the risk of attackers using social engineering alongside malicious code. Standard benchmarks for static vulnerability detection do not capture this threat vector, where an adversary controls both the functional code changes and the persuasive PR description.
Related tools
Recommended tools for this topic
These picks prioritize high-intent tools relevant to this topic. Some links may include partner or affiliate tracking.
A strong security and edge platform match across CDN, Zero Trust, and app protection.
View CloudflareA high-relevance security pick for identity, secret management, and team access control.
View 1PasswordStrong for identity, OIDC, and B2B auth readers evaluating implementation tradeoffs.
View Auth0Comparison
| Aspect | Before / Alternative | After / This |
|---|---|---|
| Evaluation Focus | Static vulnerability detection and direct code generation benchmarks | Combined code and social engineering text manipulation in pull requests |
| Source Vulnerabilities | Synthetic or generic code templates | Real-world CVEs from the top 10 categories of the 2025 CWE Top 25 |
| Reviewer Context | Analyzing code files in isolation | Analyzing code changes wrapped in 15 different social-engineering framings |
| Model Performance Gap | Assumed uniform security improvements across LLMs | Identified sharp security capability gaps between open-source and proprietary models |
Action Checklist
- Evaluate your current LLM reviewer setup against adversarial contexts Do not rely solely on the model's ability to spot bugs when the PR description is misleading
- Avoid granting auto-merge privileges to LLM reviewers Ensure human review remains mandatory for all external and high-impact contributions
- Deploy dedicated static analysis tools alongside LLMs Complement language models with traditional deterministic security scanners
- Incorporate SEVRA-BENCH principles into internal LLM evaluations Test review pipelines against historical CVE rollbacks to measure detection rates
Source: arXiv
This page summarizes the original source. Check the source for full details.


