Distillation Attack on LLMs (Model Extraction via Knowledge Distillation)

What is a Distillation Attack? In normal machine learning, knowledge distillation is a training method. A large “teacher” model teaches a smaller “student” model. The student learns to copy the teacher’s outputs. This makes the student faster and cheaper to run (Hinton et al., 2015). In a distillation attack, the attacker uses distillation to steal a model. The attacker has access to a black-box API. The API is the teacher. The attacker sends queries to the API. The API returns outputs. The attacker uses these query-output pairs to train a student model. The student model copies the teacher’s behavior. This is also called model extraction or model stealing (Tramèr et al., arXiv:1609.02943). ...

03/01/2026 · 9 min · Digenaldo Neto

LLM Security: A Scientific Taxonomy of Attack Vectors

Introduction Security in Large Language Models (LLMs) is no longer a small topic inside NLP (Natural Language Processing). It has become its own field within computer security. Between 2021 and 2025, research moved from studying adversarial examples in classifiers to looking at bigger risks: alignment, memorization, context contamination, and models that keep behaving in harmful ways. The problem today is not only a bug in the code. It comes from how the system is built: the architecture, the training data, the alignment methods, and how the model is connected to other systems. ...

02/13/2026 · 13 min · Digenaldo Neto