This project investigates whether machine-generated text can be distinguished from human-written text using traditional linguistic features and classical machine learning models.
Dataset: human-ai-parallel-corpus-biber
Models: Random Forest, XGBoost, MLP
Techniques: PCA, t-SNE, supervised learning
Can LLM-generated text be distinguished from human writing using linguistic structure alone? This project evaluates whether features like clause type, syntactic density, and function word usage can effectively classify text source using classical ML models.
This project builds on and extends experiments from:
Reinhart, A., Brown, D. W., Markey, B., Laudenbach, M., Pantusen, K., Yurko, R., & Weinberg, G. (2024). Do LLMs Write Like Humans? Variation in Grammatical and Rhetorical Styles. arXiv:2410.16107.
The original paper examined variation in grammatical and rhetorical features between LLM- and human-written texts using linguistic analysis.
This project replicates that framework using open-access data and contributes additional insights by:
- Validates the framework using open-access data
- Evaluates new classifiers (XGBoost, MLP)
- Visualizes feature boundaries via PCA and t-SNE
These results support the robustness of feature-based detection and demonstrate how interpretable models can serve as architecture-agnostic tools for identifying LLM-generated content.
Problem
As LLMs become widespread, distinguishing machine- from human-written text is essential for academic integrity, authorship verification, and trust online. Most existing detectors rely on model-specific signals, which are hard to generalize across architectures.
Findings
- XGBoost achieved 70.65% test accuracy across 7 LLM sources
- Most predictive features: Present participial clauses, “that” clause frequency
- GPT-4 vs. human texts were easier to separate; LLaMA variants were more challenging
- Linguistic-feature-based models offer a transparent, generalizable alternative to proprietary detectors
detect-llm-generated-text/
├── llm_text_classifier.ipynb # Full notebook: loading, feature extraction, modeling, visualization
├── requirements.txt # Python dependencies
└── README.md
- Dataset: Paired human/LLM text samples (multiple genres)
- Preprocessing: Extracted 66 linguistic features from each text using Biber’s taxonomy
- Dimensionality Reduction: Visualized clusters using PCA and t-SNE
- Modeling: Trained Random Forest, XGBoost, and MLP
- Evaluation: Quantified performance and visual separation
datasets
pandas
numpy
matplotlib
seaborn
scikit-learn
xgboost
Install dependencies:
pip install -r requirements.txt
- Reinhart, A., Brown, D. W., Markey, B., Laudenbach, M., Pantusen, K., Yurko, R., & Weinberg, G. (2024). Do LLMs Write Like Humans? Variation in Grammatical and Rhetorical Styles
- Human-AI Parallel Corpus – HuggingFace
- Corpus of Contemporary American English (COCA)
- Created by Jaeun Park