EXCEED - Explaining Compiler Errors To Everyone
LLM-based framework that rewrites Python interpreter errors into skill-aware explanations.
Tech Stack
Project Gallery





Overview
Programming error messages often hinder progress, especially for learners. EXCEED (Explaining Compiler Errors To Everyone) is an LLM-based framework that rewrites Python's standard interpreter errors into concise, actionable, and skill-aware explanations. The system supports two styles:
- Pragmatic: short, action-oriented fixes.
- Contingent: scaffolded guidance that helps build understanding step-by-step.
Introduction
This research project (my Master's thesis), titled Beyond the Traceback: Using LLMs for Adaptive Explanations of Programming Errors, explored how large language models can automatically rephrase Python's standard interpreter error messages in ways that are more readable, actionable, and user-centered — while reducing cognitive load. Sitting at the intersection of Human–Computer Interaction, programming languages, and Generative AI, the research reflects my interest in using technology to improve programming education and personalize the learning experience.
While the original thesis title differs, the work is presented under the EXCEED umbrella: Explaining Compiler Errors To Everyone.
Key Features
- Skill-Aware Error Rewriting: Leveraged LLMs to generate error messages tailored to the user's skill level, providing either pragmatic (concise, action-oriented) or contingent (scaffolded guidance) explanations.
- User Skill Assessment: Developed and piloted an 8-item skill assessment to classify participants as novices or experts (in Python).
- Curated Error Dataset: Conducted a formative study to curate a diverse set of code snippets and corresponding errors for evaluation.
- Crowdsourced Evaluation: Designed and executed a study measuring fix rate, time-to-fix, attempts, and subjective perceptions of readability, cognitive load, and authoritativeness. The study was run on Prolific, involving real participants.
- Comprehensive Analysis: Analyzed both objective and subjective data to assess the effectiveness of LLM-rewritten messages compared to standard interpreter errors.
- Open-Source Implementation: Developed the EXCEED framework using Python, FastAPI, TypeScript, React, Tailwind CSS, Docker, PostgreSQL, and Ollama for LLM integration. The code, as linked above, is publicly available on GitHub, and can be re-used or extended by other researchers or developers interested in this area.
Challenges & Considerations
- Skill Level Classification: Accurately classifying users into novice and expert categories is challenging - assessments in general are notoriously hard to design and get right, and even then, skill is rather continuous than discrete. Either way, the pilot study helped refine the assessment in a pragmatic way, acknowledging that there is no such thing as a perfect measurement.
- Error Dataset Curation: Selecting representative code snippets and errors required careful consideration to ensure diversity and relevance. The formative study helped gather a well-rounded dataset for evaluation. That being said, the dataset is still limited in size and scope, and future work could expand it further.
- LLM Limitations: While LLMs are powerful, they can also produce inconsistent or incorrect outputs. Within educational settings, this is particularly problematic, as students may rely on the generated explanations, and incorrect information could lead to confusion or misconceptions. A temperature of 0.0 in our experiments helped to mostly mitigate this issue. However, in a genuine "in the wild" setting, this remains a challenge.
- Malicious Crowdsourced Participants: Running the main evaluation on Prolific introduced the risk of participants not engaging sincerely with the task, potentially skewing results. To address this, we implemented quality checks and data validation procedures to identify and exclude low-effort responses from the analysis. Things such as unusually fast completion times, inconsistent answers, attempts to hack the interface or failure to follow instructions were used as indicators of low-quality data.
Outcome
The project was defended as my Master's thesis at TU Delft. Results and code are available via the paper and GitHub links above. A natural next step is building an in-IDE plugin to provide real-time, skill-aware explanations and gather richer usage data.