OLAF Examples

Example 1: Code Comment Annotation
Example 2: Novice Programmers’ Emotion in Text

Task: Code Comment Annotation

Annotate source code comments to identify the software engineering intent expressed in each comment.

Each comment must receive exactly one label describing its primary purpose.

Use the following label set:

Implementation: The comment explains how code works or why code was implemented in a specific way.
Bug Fix: The comment documents a defect, workaround, or corrective change.
Enhancement: The comment describes an improvement, refactoring, or optimization.
Testing: The comment relates to test cases, test logic, or validation behavior.
Documentation: The comment provides descriptive or explanatory information for readers.
Indeterminate: The intent cannot be reliably inferred from the comment alone.

How to Do It

Step 0: Data Collection

Collect 10,000 source code comments from software repositories.
Each comment is stored with minimal context (file path, language, commit ID).
No labels are assigned at this stage.

Step 1: Define the Unit of Annotation

Each single code comment is one annotation unit.
Inline and block comments are treated equally.
Surrounding code is used only when necessary to interpret intent.

Step 2: Initial Human Annotation

Randomly sample 400 code comments from the collected set.
Assign two human annotators.
Provide written definitions and examples for each label.
Annotators independently label all 400 comments.
Measure inter-annotator agreement using Cohen’s κ.
Resolve disagreements and finalize the annotation guidelines.

This step establishes the reference interpretation of the task.

Step 3: LLM Annotation and Human–Model Agreement

Use two independent LLMs as annotators:

gpt-oss:20b
mistral-small-3.2:24b

Procedure:

Fix all inference parameters:
- Model version
- Temperature
- Prompt template
Apply both LLMs to the same 400 comments annotated by humans.
Require exactly one label per comment.
Allow the label Indeterminate when intent is unclear.
Compute:
- Agreement between each LLM and human annotators
- Agreement between the two LLMs

If agreement with humans is substantial, proceed to scaling.

Step 4: Scaling Annotation

Apply both LLMs to the remaining unlabeled comments from the original 10,000.
Treat each LLM as an independent annotator.
Aggregate labels using:
- Majority voting when LLMs agree, or
- Probabilistic aggregation when LLMs disagree
Assign one final label per comment.

This enables annotation at scale while preserving measurement consistency.

Step 5: Transparency

Document:

Label definitions
Annotation instructions
Prompt template
Model identifiers and parameters
Human-human agreement
Human-LLM agreement
Aggregation method

Step 6: Drift Monitoring

Retain the original 400-comment calibration set.
Re-run LLM annotation when:
- A model version changes/new model is used
- Prompt wording changes
Compare agreement against the calibration set.
Recalibrate if agreement drops below the predefined threshold.

Output

Annotated code comments
Annotation guidelines
Human-human agreement
Human-LLM agreement
Aggregated labels

Task: Novice Programmers’ Emotion

Filter a large collection of online discussion posts to retain only those that are relevant for manual annotation, based on predefined criteria.

The goal is to reduce manual labour during annotation while preserving recall for relevant instances.

Filtering Objective

Retain posts that satisfy both conditions:

Written by novice programmers
Contain non-neutral emotional expressions related to learning

Posts that do not meet both conditions are excluded from further analysis.

How to Do It

Step 1: Initial Data Collection

Collect a large, raw dataset of posts from online programming communities.
No labels are assigned at this stage.
Expect the majority of posts to be irrelevant for the target analysis.

Step 2: Define Filter Criteria

Formulate binary filter questions:

Q1: Does the post exhibit any learning-related emotion (e.g., confusion, frustration, curiosity)?
Q2: Is the post written by a novice or beginner programmer?

A post must satisfy both conditions to pass the filter.

Step 3: LLM-Based Filtering

Select one/more LLM to act as a filter, not an annotator.
Use a fixed prompt with:
- Explicit definitions of emotions
- Clear indicators of novice status
- Yes/No answers only
Apply the LLM to every post in the raw dataset.
Retain only posts where the LLMs answers Yes to both Q1 and Q2.

At this stage, the LLM is used only to reduce the dataset, not to generate labels.

Step 4: Human Verification

Randomly sample posts that passed the filter.
Have human reviewers verify:
- Emotional relevance
- Novice authenticity
Remove false positives identified by humans.

This step bounds filtering error and mitigates LLM hallucination.

Step 5: Selection for Annotation

From the verified filtered set, randomly sample posts for manual annotation.
Exclude all posts filtered out in earlier steps.
Proceed with full annotation only on the retained subset.

This ensures that manual effort is spent only on high-signal data.

Reliability and Transparency

Report:

Filtering prompt
Model version and parameters
Acceptance criteria
Human verification procedure
Proportion of data retained and discarded

The filter is treated as a measurement instrument, not ground truth.

Output

Filtered dataset
Human-verified subset
Filtering prompt and configuration
Retention statistics