Tutorial: Custom eval ground truth labels
Goal
Define your own ground truth labels, run Refract's evaluation harness against them, and interpret precision, recall, and F1 scores per analyzer. Extend the eval beyond the built-in benchmark pages.
How the eval works
The eval package compares Refract's pipeline output against independently verified outcome labels — human-labeled ground truth about what actually happened on a page at a specific time. Labels are stored separately from pipeline output. The pipeline never modifies them.
Each label specifies:
- Which event types should appear on the page at a given time
- What section they should appear in (optional)
- The source of the ground truth (talk page consensus, RFC closure, ArbCom decision)
Step 1: Create a ground truth labels file
Create a JSON file with your labels. Each label describes one known outcome:
[
{
"id": "covid-origin-consensus",
"source": "talk_page_consensus",
"pageTitle": "COVID-19",
"description": "Talk page consensus (Jan 2023) removed unsourced origin claims from lead",
"observedAt": "2023-01-15T00:00:00Z",
"resolution": "keep",
"referenceUrl": "https://en.wikipedia.org/wiki/Talk:COVID-19/Archive_50#Origin_claims",
"expectedEventTypes": ["sentence_removed", "section_reorganized"],
"expectedSection": "(lead)"
},
{
"id": "bitcoin-citation-dispute",
"source": "rfc_closure",
"pageTitle": "Bitcoin",
"description": "RFC closed June 2022: Nakamoto whitepaper citation must remain in lead",
"observedAt": "2022-06-20T00:00:00Z",
"resolution": "keep",
"referenceUrl": "https://en.wikipedia.org/wiki/Talk:Bitcoin/RFC_citation",
"expectedEventTypes": ["citation_added"],
"expectedSection": "(lead)"
},
{
"id": "trump-protection-event",
"source": "page_protection",
"pageTitle": "Donald_Trump",
"description": "Page protected Aug 2024 after edit war. Revert events expected in surrounding revisions.",
"observedAt": "2024-08-10T00:00:00Z",
"resolution": "other",
"referenceUrl": "https://en.wikipedia.org/wiki/Special:Log?page=Donald_Trump",
"expectedEventTypes": ["protection_changed", "revert_detected"]
}
]
Each label requires:
id: unique identifiersource: one oftalk_page_consensus,rfc_closure,arbcom_decision,page_protectionpageTitle: exact Wikipedia page titleexpectedEventTypes: which event types the pipeline should detectreferenceUrl: public permalink to the ground truth source
Step 2: Run the eval with custom labels
refract eval --ground-truth my-labels.json
Refract loads the labels, fetches the relevant pages, runs the pipeline, and compares output against your ground truth:
Loaded 3 ground truth labels from my-labels.json
[1/3] COVID-19 — talk_page_consensus: Talk page consensus (Jan 2023)...
PASS — precision=1.00 recall=1.00 f1=1.00
2/2 expected events matched (sentence_removed, section_reorganized)
[2/3] Bitcoin — rfc_closure: RFC closed June 2022...
PASS — precision=1.00 recall=1.00 f1=1.00
1/1 expected events matched (citation_added)
[3/3] Donald_Trump — page_protection: Page protected Aug 2024...
FAIL — precision=0.50 recall=1.00 f1=0.67
1/2 expected events matched (revert_detected)
Missed: protection_changed
=== Eval Summary ===
Total outcomes: 3
Passed: 2 / Failed: 1
Overall precision: 83.3%
Overall recall: 100.0%
Overall F1: 88.9%
Step 3: Interpret the results
| Outcome | What it means |
|---|---|
| Pass (P=1.0, R=1.0) | Pipeline detected all expected events. The analyzer is working as intended. |
| Pass (P<1.0, R=1.0) | Pipeline found the expected events plus extras (false positives). Precision < 1.0 may indicate the analyzer is too sensitive. |
| Fail (R=0.0) | Pipeline missed all expected events. The analyzer may not cover this scenario, or the revision range doesn't include the event. |
| Fail (P=0.5, R=1.0) | Pipeline found some expected events but missed others. The analyzer is partially working — check if the missed event type is in scope. |
Step 4: Diagnose misses
When an outcome fails, investigate the missed events:
- Run the same page manually:
refract analyze "Page" --depth forensic - Check the revision range around the
observedAttimestamp - Verify the
expectedEventTypesmatch what the analyzer produces - If the
expectedSectionis specified, check that the event fired in that section
The eval doesn't tell you why a miss happened — it tells you what was missed. Use the manual analysis to diagnose.
Step 5: Contribute labels back
If you've curated labels for notable Wikipedia events, consider contributing them to
the built-in GROUND_TRUTH_LABELS in packages/eval/src/ground-truth.ts. Each label
is a coded observation of a real editorial process outcome. The more labels the eval
has, the better it measures analyzer accuracy across diverse scenarios.
Format for contribution:
id: descriptive kebab-case (covid-origin-consensus-2023)referenceUrl: must be a public, permanent Wikipedia URLsource: must be one of the 4 supported typesexpectedEventTypes: must be members of the currentEventTypeunion
Next steps
- Evaluation harness reference — full eval API and benchmark pages
- Schema reference — all event types
- Build a custom analyzer — add new event types and test them