Paper
OLAF: Towards Robust LLM-Based Annotation Framework in Empirical Software Engineering
Mia Mohammad Imran, Tarannum Shaila Zaman
Workshop on Methodological Issues with Empirical Studies in Software Engineering (WSESE), 2026
OLAF Framework
Overview
Large Language Models are increasingly used to annotate software engineering artifacts such as issues, commits, and qualitative data. Despite their growing adoption, methodological rigor, reproducibility, and reliability are often insufficiently addressed.
OLAF introduces a structured operationalization framework that treats LLM-based annotation as a measurement process rather than an automation shortcut.
Motivation
Current LLM-based annotation practices often suffer from:
- Missing configuration and prompt details
- Lack of reliability and calibration reporting
- Sensitivity to prompt and model variation
- Limited reproducibility across studies
OLAF addresses these gaps by defining explicit, measurable constructs.
Core Dimensions
OLAF is organized around six dimensions:
1. Reliability
Measures agreement among annotators using chance-corrected statistics such as Cohen’s kappa and Krippendorff’s alpha.
2. Consensus
Captures group-level agreement and task ambiguity using correlation-based measures.
3. Aggregation
Combines multiple annotators through majority voting or probabilistic models such as Dawid-Skene, GLAD, or MACE.
4. Transparency
Ensures reproducibility through full disclosure of model versions, prompts, parameters, and annotation settings.
5. Calibration
Evaluates whether model confidence scores correspond to empirical accuracy using metrics such as Expected Calibration Error and Brier Score.
6. Drift
Quantifies annotation stability under prompt, configuration, or model changes using metrics such as Jensen-Shannon Divergence or agreement deltas.
Annotation Configurations
OLAF supports multiple annotation workflows:
- Human-in-the-Loop
- Model-in-the-Loop
- Verifier-in-the-Loop
- LLM-as-a-Filter
- LLM-as-a-Judge
- LLM-as-an-Annotator
Each configuration implies different risks, benefits, and reporting requirements.
Guidelines
OLAF recommends that empirical studies:
- Explicitly declare the annotation configuration
- Fully document model and prompt details
- Report reliability and aggregation methods
- Track calibration and drift over time
- Treat LLM outputs as measurements requiring validation
Limitations
OLAF assumes constrained stability of LLMs and acknowledges limitations posed by opaque proprietary models and stochastic decoding. Metrics bound observable variability rather than guaranteeing deterministic behavior.
Reference
If you use OLAF, please cite the paper.
@inproceedings{imran2025olaf,
title={OLAF: Towards Robust LLM-Based Annotation Framework in Empirical Software Engineering},
author={Imran, Mia Mohammad and Zaman, Tarannum Shaila},
booktitle={Proceedings of the 3rd IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering (WSESE)},
year={2026}
}