Distillation Attack on LLMs (Model Extraction via Knowledge Distillation)

What is a Distillation Attack? In normal machine learning, knowledge distillation is a training method. A large “teacher” model teaches a smaller “student” model. The student learns to copy the teacher’s outputs. This makes the student faster and cheaper to run (Hinton et al., 2015). In a distillation attack, the attacker uses distillation to steal a model. The attacker has access to a black-box API. The API is the teacher. The attacker sends queries to the API. The API returns outputs. The attacker uses these query-output pairs to train a student model. The student model copies the teacher’s behavior. This is also called model extraction or model stealing (Tramèr et al., arXiv:1609.02943). ...

03/01/2026 · 9 min · Digenaldo Neto