Work with notebooks

Refract's event output is standard NDJSON — load it into any notebook environment for interactive analysis. This page shows a complete workflow in Python (Jupyter, Marimo) and R.

Setup

pip install pandas altair
refract export "Bitcoin" --format ndjson > events.jsonl

Load and explore

Python (Jupyter / Marimo)

import pandas as pd
import json

events = []
with open("events.jsonl") as f:
    for line in f:
        if line.strip():
            events.append(json.loads(line))

df = pd.json_normalize(events)
df["timestamp"] = pd.to_datetime(df["timestamp"])

print(f"Loaded {len(df)} events")
print(f"Event types: {df['eventType'].nunique()}")

R

library(jsonlite)
library(dplyr)
library(ggplot2)

events <- stream_in(file("events.jsonl")) %>%
  mutate(timestamp = as.POSIXct(timestamp))

Event type distribution

Python

import altair as alt

dist = df["eventType"].value_counts().reset_index()
dist.columns = ["event_type", "count"]

alt.Chart(dist).mark_bar().encode(
    x=alt.X("event_type:N", sort="-y", title="Event type"),
    y=alt.Y("count:Q", title="Count"),
    color=alt.Color("event_type:N", legend=None)
).properties(width=600, height=300)

R

events %>%
  count(event_type) %>%
  ggplot(aes(reorder(event_type, n), n)) +
  geom_col() +
  coord_flip() +
  labs(x = "Event type", y = "Count")

Citation churn over time

Python

citation_df = df[df["eventType"].str.startswith("citation")]
citation_df["month"] = citation_df["timestamp"].dt.to_period("M").astype(str)

churn = citation_df.groupby(["month", "eventType"]).size().reset_index(name="count")

alt.Chart(churn).mark_line(point=True).encode(
    x=alt.X("month:T", title="Month"),
    y=alt.Y("count:Q", title="Events"),
    color=alt.Color("eventType:N", title="Citation event")
).properties(width=600, height=300)

R

events %>%
  filter(grepl("^citation", event_type)) %>%
  mutate(month = format(timestamp, "%Y-%m")) %>%
  count(month, event_type) %>%
  ggplot(aes(as.Date(paste0(month, "-01")), n, color = event_type)) +
  geom_line() +
  geom_point(size = 2) +
  scale_x_date(date_breaks = "3 months", date_labels = "%b %Y") +
  labs(x = "Month", y = "Events", color = "Citation event") +
  theme_minimal()

Claim stability scores

Python

Define contested vs. stable events:

contested_types = ["revert_detected", "edit_cluster_detected", "sentence_removed"]
df["stability"] = df["eventType"].apply(
    lambda t: "Contested" if t in contested_types else "Stable"
)

stability_counts = df["stability"].value_counts().reset_index()
stability_counts.columns = ["category", "count"]
stability_counts

R

events %>%
  mutate(stability = ifelse(
    event_type %in% c("revert_detected", "edit_cluster_detected", "sentence_removed"),
    "Contested", "Stable"
  )) %>%
  count(stability)

A high ratio of contested-to-stable events indicates an actively disputed page. A page with mostly stable events (citations added, sections reorganized, categories changed) is undergoing editorial improvement, not dispute.

Section-level analysis

Which sections have the most activity?

Python

section_activity = df["section"].value_counts().head(10)
section_activity

R

events %>%
  count(section, sort = TRUE) %>%
  top_n(10)

Sections with high event counts combined with high revert ratios are the most contested parts of the page. Drill into a specific section by filtering on its name and re-running the stability analysis.

Exporting results

Python

# Save stability analysis as CSV
stability_counts.to_csv("bitcoin-stability.csv", index=False)

# Save filtered events to new JSONL
contested = df[df["stability"] == "Contested"]
contested.to_json("contested-events.jsonl", orient="records", lines=True)

Next steps

Using the Python SDK (refract-py)

The refract-py package provides typed dataclasses and pandas integration for Python workflows:

pip install refract-py

Load events with typed dataclasses

from refract import RefractClient, EvidenceEvent

client = RefractClient()
events = client.analyze("Bitcoin", depth="detailed")
# → list[EvidenceEvent] with typed fields: event_type, timestamp, section, etc.

Load into pandas DataFrame

import pandas as pd

events_dict = [e.model_dump() for e in events]
df = pd.DataFrame(events_dict)
df["timestamp"] = pd.to_datetime(df["timestamp"])

# Quick exploration
df["event_type"].value_counts()
df.groupby("section")["event_type"].count().sort_values(ascending=False)

Analyze claim stability

contested_types = ["revert_detected", "edit_cluster_detected", "sentence_removed"]
df["stability"] = df["event_type"].apply(
    lambda t: "Contested" if t in contested_types else "Stable"
)

print(df["stability"].value_counts())
# → Stable: 285, Contested: 45

Export to Parquet for archival

df.to_parquet("bitcoin-events.parquet", index=False)

LangChain document loader

from refract_langchain import RefractDocumentLoader

loader = RefractDocumentLoader(page="Bitcoin")
docs = loader.load()
# → list[Document] with page_content = claim text, metadata = stability + provenance

The Python SDK wraps the @refract-org/cli npm package — install both for full functionality: pip install refract-py && npm install -g @refract-org/cli. Typed exceptions like RefractConfigError, RefractFetchError, and RefractInterpretationError provide structured error handling.

Type something to search...