Detecting LLM-Generated Text Using Linguistic Features

This project investigates whether machine-generated text can be distinguished from human-written text using traditional linguistic features and classical machine learning models.

Dataset: human-ai-parallel-corpus-biber
Models: Random Forest, XGBoost, MLP
Techniques: PCA, t-SNE, supervised learning

Project Objective

Can LLM-generated text be distinguished from human writing using linguistic structure alone? This project evaluates whether features like clause type, syntactic density, and function word usage can effectively classify text source using classical ML models.

Research Basis & Contributions

This project builds on and extends experiments from:

Reinhart, A., Brown, D. W., Markey, B., Laudenbach, M., Pantusen, K., Yurko, R., & Weinberg, G. (2024). Do LLMs Write Like Humans? Variation in Grammatical and Rhetorical Styles. arXiv:2410.16107.

The original paper examined variation in grammatical and rhetorical features between LLM- and human-written texts using linguistic analysis.

This project replicates that framework using open-access data and contributes additional insights by:

Validates the framework using open-access data
Evaluates new classifiers (XGBoost, MLP)
Visualizes feature boundaries via PCA and t-SNE

These results support the robustness of feature-based detection and demonstrate how interpretable models can serve as architecture-agnostic tools for identifying LLM-generated content.

Problem & Findings

Problem

As LLMs become widespread, distinguishing machine- from human-written text is essential for academic integrity, authorship verification, and trust online. Most existing detectors rely on model-specific signals, which are hard to generalize across architectures.

Findings

XGBoost achieved 70.65% test accuracy across 7 LLM sources
Most predictive features: Present participial clauses, “that” clause frequency
GPT-4 vs. human texts were easier to separate; LLaMA variants were more challenging
Linguistic-feature-based models offer a transparent, generalizable alternative to proprietary detectors

File Overview

detect-llm-generated-text/
├── llm_text_classifier.ipynb # Full notebook: loading, feature extraction, modeling, visualization
├── requirements.txt # Python dependencies
└── README.md

Approach Summary

Dataset: Paired human/LLM text samples (multiple genres)
Preprocessing: Extracted 66 linguistic features from each text using Biber’s taxonomy
Dimensionality Reduction: Visualized clusters using PCA and t-SNE
Modeling: Trained Random Forest, XGBoost, and MLP
Evaluation: Quantified performance and visual separation

Requirements

datasets
pandas
numpy
matplotlib
seaborn
scikit-learn
xgboost

Install dependencies:

pip install -r requirements.txt

Key References

Reinhart, A., Brown, D. W., Markey, B., Laudenbach, M., Pantusen, K., Yurko, R., & Weinberg, G. (2024). Do LLMs Write Like Humans? Variation in Grammatical and Rhetorical Styles
Human-AI Parallel Corpus – HuggingFace
Corpus of Contemporary American English (COCA)

Contact

Created by Jaeun Park

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
llm_text_classifier.ipynb		llm_text_classifier.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Detecting LLM-Generated Text Using Linguistic Features

Project Objective

Research Basis & Contributions

Problem & Findings

File Overview

Approach Summary

Requirements

Key References

Contact

About

Uh oh!

Releases

Packages

Languages

Jaeun-Park/detect-llm-generated-text

Folders and files

Latest commit

History

Repository files navigation

Detecting LLM-Generated Text Using Linguistic Features

Project Objective

Research Basis & Contributions

Problem & Findings

File Overview

Approach Summary

Requirements

Key References

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages