Skip to content

VILA-Lab/OpAI-Bench

Repository files navigation

OpAI-Bench construction pipeline

Operation-Guided Progressive Human-to-AI Text Transformation Benchmark
for Multi-Granularity AI-Text Detection

Sondos Mahmoud BsharatJiacheng LiuXiaohan ZhaoTianjun YaoXinyi ShangYi Tang

Jiacheng CuiAhmed ElhagrySalwa K. Al KhatibHao LiSalman KhanZhiqiang Shen

Corresponding author


arXiv Paper   Hugging Face Dataset   Apache-2.0 License

Benchmark Versioned Samples Versions Edit Operations Domains Granularity


OpAI-Bench is a benchmark for evaluating AI-text detection under progressive human–AI co-editing, where AI revisions are introduced gradually into selected parts of a human-written document and tracked with document-, sentence-, token-, and span-level provenance.


🎯 Overview

Real-world writing is increasingly produced through progressive human-AI co-editing: a human draft may be polished, paraphrased, compressed, expanded, or stylistically rewritten by an AI assistant over multiple revision rounds. However, most AI-text detection benchmarks focus on static endpoint texts, treating documents as either fully human-written or fully AI-generated.

OpAI-Bench addresses this limitation by constructing operation-guided human-to-AI revision trajectories. Starting from human-written source documents, each sample is expanded into a nine-version trajectory from v0 to v8, where AI-edited coverage increases progressively while different edit operations are applied.

OpAI-Bench construction pipeline

Each document is first segmented into sentences. A deterministic shuffle order is then created using the document identifier as the seed. At each revision stage, OpAI-Bench selects the first k% of sentences from this fixed order, ensuring that the edited sentence set grows cumulatively:

S⁽⁰⁾ ⊆ S⁽¹⁾ ⊆ ... ⊆ S⁽⁸⁾

This design makes the benchmark reproducible, trajectory-aware, and suitable for analyzing how AI-authorship signals emerge, accumulate, or disappear across revision stages.


🔄 Cumulative Revision Trajectory

OpAI-Bench represents AI-assisted writing as a progressive human-to-AI revision trajectory, not a single endpoint classification.

Key design choices:

  • Nine ordered versions: each document is expanded from v0 to v8.
  • Human source: v0 is the original fully human-written document.
  • Controlled AI coverage: later versions introduce AI edits at predefined sentence-level ratios.
  • Deterministic selection: edited sentences are selected using a document-specific fixed shuffle order.
  • Cumulative editing: once a sentence is selected, it remains in the AI-edited set in later versions.
  • Multi-granularity provenance: AI involvement is tracked at document, sentence, token, and span levels.
Version Edit Operation AI Sentence Coverage
v0 None 0%
v1 Polish 15%
v2 Paraphrase 25%
v3 Style rewrite 40%
v4 Compress 50%
v5 Expand 60%
v6 Style rewrite 75%
v7 Paraphrase 90%
v8 Polish 100%

📈 Benchmark Statistics

OpAI-Bench contains 15,722 human-written source documents, expanded into 31,089 generator-specific revision trajectories and 279,794 versioned samples across four writing domains.

Domain Source Docs (v0) Revision Trajectories Versioned Samples (v0v8) Avg. Sentences Avg. Tokens
Student essays 3,969 7,906 71,154 21.0 398.8
News articles 3,998 7,892 71,028 24.0 491.3
Government reports 3,993 8,000 72,000 20.6 563.7
Scientific abstracts 3,762 7,291 65,612 11.0 234.3
Total 15,722 31,089 279,794 19.3 426.1

Notes. Source documents are distinct human-written v0 texts. Revision trajectories are generator-specific editing sequences initialized from v0. Versioned samples count the released texts along the full revision path from v0 to v8.

The main benchmark split uses GPT-5.4, GPT-5.4-nano, and Gemini 2.5 Flash. Qwen3-8B is reserved as a held-out generator for cross-generator evaluation.


🚀 Quick Start

Installation

git clone https://github.com/<ORG>/OpAI-Bench.git
cd OpAI-Bench

pip install -r requirements.txt

Load the Dataset

from datasets import load_dataset

ds = load_dataset("OpAI-Bench1/OpAI-Bench")
print(ds)

Export a Split to CSV

from datasets import load_dataset

test_ds = load_dataset("OpAI-Bench1/OpAI-Bench", split="test")
test_ds.to_csv("opaibench_test.csv")

Generate New Trajectories

python construction/build_opaibench.py \
    --input_csv data/input.csv \
    --output_csv outputs/opaibench.csv \
    --id_column id \
    --text_column text

The construction pipeline supports multiple LLM providers, including OpenAI, Gemini, DeepSeek, and Hugging Face Inference APIs.


🔍 Evaluation

OpAI-Bench supports evaluation across document, sentence, token, and span granularities. The benchmark evaluates detectors under three regimes:

  1. Zero-shot detectors evaluated without exposure to OpAI-Bench.
  2. LLM-as-detector baselines prompted to classify sentence-level authorship.
  3. OpAI-Bench-trained detectors fine-tuned on the training split and evaluated on in-distribution and held-out generators.

Evaluated Detector Families

Granularity Detectors
Document-level Desklib, DetectLLM, E5-Small, Fast-DetectGPT, OOD-LLM-Detect, RADAR, RoBERTa-OpenAI, GigaCheck
Sentence-level AdaLoc, GenAI-Sentence, GL-CLiC, SeqXGPT, GPT-5.4, Gemini 3 Flash, Claude Haiku 4.5
Token / Span-level DAMASHA, GigaCheck

Metrics include accuracy and AI-class F1, reported across versions, domains, generators, and edit operations.

Quick Evaluation

uv sync

export HF_DATASET=OpAI-Bench1/OpAI-Bench

# Smoke test
uv run python eval.py detector=e5-small max_samples=20 \
    dataset.hf_repo=$HF_DATASET

# Full evaluation
uv run python eval.py detector=e5-small \
    dataset.hf_repo=$HF_DATASET

For detector-specific checkpoints and local model setup, see CHECKPOINTS.md.

Common overrides:

uv run python eval.py detector=fast-detectgpt dataset.split=dev
uv run python eval.py -m detector=e5-small,desklib,fast-detectgpt

Outputs are written to results/ and include predictions, aggregate metrics, and a reproducible run configuration snapshot.


🏆 Key Findings

OpAI-Bench reveals that AI-text detectability is not monotonic with AI coverage.

Mixed-authorship intermediate versions can be harder to detect than both fully human and heavily AI-edited endpoints.

In particular, the benchmark identifies a difficult mixed-authorship region around v4, where 50% AI coverage coincides with compression. This suggests that reliable AI-text detection requires moving beyond static human-vs-AI endpoint classification toward trajectory-aware and operation-aware evaluation.


OpAI-Bench construction pipeline

Sentence-level detection accuracy across revision versions, domains, and generators. Mixed-authorship intermediate versions consistently emerge as the most challenging stage, highlighting the non-monotonic nature of AI-text detectability.


📁 Repository Structure

OpAI-Bench/
├── construction/           # Benchmark construction pipeline
├── conf/                   # Hydra configurations
├── src/                    # Core benchmark utilities
├── opai_bench_detectors/   # Unified detector wrappers
├── baseline/               # Upstream detector implementations
├── training/               # Fine-tuning scripts
├── scripts/                # Utility and helper scripts
├── assets/                 # Figures used in the paper
├── eval.py                 # Main evaluation entry point
├── CHECKPOINTS.md          # Detector checkpoint setup
├── THIRD_PARTY_NOTICES.md
├── pyproject.toml
├── LICENSE
└── README.md

📖 Citation

If you find OpAI-Bench useful in your research, please cite:

@article{bsharat2026opaibench,
  title = {Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection},
  author = {Bsharat, Sondos Mahmoud and Liu, Jiacheng and Zhao, Xiaohan and Yao, Tianjun and Shang, Xinyi and Tang, Yi and Cui, Jiacheng and Elhagry, Ahmed and Al Khatib, Salwa K. and Li, Hao and Khan, Salman and Shen, Zhiqiang},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year = {2026}
}

About

Operation-guided benchmark for AI-text detection under progressive human–AI revision trajectories.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors