GCSP research stipend | Fall 2025
Why do Generative AI Models Snitch? Investigating Deceptive Tool-calling Behaviors in Safety-aligned Language Models
Large language models (LLMs) undergo extensive safety alignment to refuse harmful requests and align with human values. However, recent works suggest this alignment may be fundamentally brittle. We investigate this brittleness in the context of tool-calling enabled agentic systems, where models have access to communication and data manipulation capabilities. We created a benchmark of 100 adversarial scenarios across 25 domains, and we find that safety-aligned models exhibit systematic deceptive tool-calling behaviors, such as whistleblowing and data exfiltration, even when explicitly instructed to maintain confidentiality. Then, we attempt to identify whether this behavior emerges from the alignment training process, results from over-alignment to safety training objectives, or emerges from misrepresented training objectives.