Aryan Keluskar

Computer science

Hometown: Mumbai and Hyderabad, India

Graduation date: Spring 2026

Additional details: Honors student

Security icon, disabled. A blue padlock, locked.

GCSP research stipend | Fall 2025

Why do Generative AI Models Snitch? Investigating Deceptive Tool-calling Behaviors in Safety-aligned Language Models

Large language models (LLMs) undergo extensive safety alignment to refuse harmful requests and align with human values. However, recent works suggest this alignment may be fundamentally brittle. We investigate this brittleness in the context of tool-calling enabled agentic systems, where models have access to communication and data manipulation capabilities. We created a benchmark of 100 adversarial scenarios across 25 domains, and we find that safety-aligned models exhibit systematic deceptive tool-calling behaviors, such as whistleblowing and data exfiltration, even when explicitly instructed to maintain confidentiality. Then, we attempt to identify whether this behavior emerges from the alignment training process, results from over-alignment to safety training objectives, or emerges from misrepresented training objectives.

Mentor:

View the poster
QR code for the current page

It’s hip to be square.

Students presenting projects at the Fulton Forge Student Research Expo are encouraged to download this personal QR code and include it within your poster. This allows expo attendees to explore more about your project and about you in the future. 

Right click the image to save it to your computer.

Additional projects from this student

AI models can be "merged" to expand their knowledge. We assess its effect on safety and detectability in creating trustworthy AI systems.

Mentor:

  • FURI
  • Spring 2025

Helping AI comprehend ambiguous human text will make it more reliable and effective in conversations and critical tasks alike.

Mentor:

  • FURI
  • Fall 2024