Analytics with DuckDB
Refract's event output is newline-delimited JSON (NDJSON). DuckDB can query it directly with SQL — no loading step, no schema definition, no extensions needed.
Quick start: Refract ships a `refract-analytics.sql` file with pre-built views. Load it and query immediately:
duckdb -c ".read refract-analytics.sql" -c "SELECT * FROM contested_claims;"
Views included: contested_claims, citation_churn_by_month, section_activity,
event_type_distribution, talk_content_ratio.
Install DuckDB
brew install duckdb
Or download from duckdb.org.
Export events and query
# Export events
refract export "Bitcoin" --format ndjson > bitcoin-events.jsonl
# Quick count
duckdb -c "SELECT count(*) FROM 'bitcoin-events.jsonl'"
DuckDB's JSON scanner detects column names and types automatically from the first batch of records. No CREATE TABLE or INSERT step.
Event type distribution
SELECT "eventType", count(*) as cnt
FROM 'events.jsonl'
GROUP BY "eventType"
ORDER BY cnt DESC;
Expected output for a typical Wikipedia page at detailed depth:
| "eventType" | cnt |
|---|---|
| sentence_modified | 85 |
| citation_added | 34 |
| sentence_first_seen | 28 |
| revert_detected | 15 |
| template_added | 12 |
| ... | ... |
Citation churn over time
SELECT strftime(timestamp, '%Y-%m') as month,
count(*) FILTER (WHERE "eventType" = 'citation_added') as added,
count(*) FILTER (WHERE "eventType" = 'citation_removed') as removed,
count(*) FILTER (WHERE "eventType" = 'citation_replaced') as replaced
FROM 'bitcoin-events.jsonl'
WHERE "eventType" LIKE 'citation_%'
GROUP BY month
ORDER BY month;
Months with more removals than additions indicate net citation loss — the page is losing sources. Months with high replacement counts indicate active source updating, not necessarily instability.
Find the most contested sections
Contested sections have both reverts and edit clusters:
SELECT section,
count(*) FILTER (WHERE "eventType" = 'revert_detected') as reverts,
count(*) FILTER (WHERE "eventType" = 'edit_cluster_detected') as clusters,
count(*) FILTER (WHERE "eventType" LIKE 'sentence_%') as sentence_events,
count(*) as total_events
FROM 'bitcoin-events.jsonl'
GROUP BY section
HAVING reverts > 0 OR clusters > 0
ORDER BY (reverts + clusters) DESC;
Track a specific claim's lifecycle
SELECT "eventType", timestamp, section,
before, after
FROM 'bitcoin-events.jsonl'
WHERE after LIKE '%decentralized%'
OR before LIKE '%decentralized%'
ORDER BY timestamp;
Section-level timeline
SELECT section, timestamp, "eventType",
count(*) OVER (PARTITION BY section ORDER BY to_revision_id)
as cumulative_events
FROM 'bitcoin-events.jsonl'
ORDER BY section, timestamp;
Talk page correlation
Pages with active talk page discussions alongside content changes suggest editorial deliberation rather than edit-warring:
SELECT strftime(timestamp, '%Y-%m-%d') as day,
count(*) FILTER (WHERE "eventType" LIKE 'talk_%') as talk_events,
count(*) FILTER (WHERE "eventType" = 'revert_detected') as reverts
FROM 'bitcoin-events.jsonl'
GROUP BY day
HAVING talk_events > 0 OR reverts > 0
ORDER BY day;
Multiple pages at once
# Create a pages file
echo "Bitcoin" > pages.txt
echo "Ethereum" >> pages.txt
echo "Dogecoin" >> pages.txt
# Batch analyze
refract analyze --pages-file pages.txt --depth detailed -c
# Export all
refract export "Bitcoin" --format ndjson > events.jsonl
refract export "Ethereum" --format ndjson >> events.jsonl
refract export "Dogecoin" --format ndjson >> events.jsonl
Then compare across pages:
SELECT page_title,
count(*) as events,
count(*) FILTER (WHERE "eventType" = 'revert_detected') as reverts,
count(*) FILTER (WHERE "eventType" LIKE 'citation_%') as citation_churn
FROM 'events.jsonl'
GROUP BY page_title
ORDER BY events DESC;
Validation
# Count events — should match refract analyze output
duckdb -c "SELECT count(*) FROM 'events.jsonl'"
# Verify event types are valid
duckdb -c "SELECT DISTINCT \"eventType\" FROM 'events.jsonl' ORDER BY \"eventType\""
# Check for empty sections
duckdb -c "SELECT count(*) FROM 'events.jsonl' WHERE section = ''"
All queries run on DuckDB 1.0+ with no extensions. Refract NDJSON follows the standard format that DuckDB's JSON scanner reads out of the box.