Extract structured construct–measured_by–justification triples from TEI‑encoded research papers using LLMs.
Built to extract construct–measurement–justification triples from psychology papers using LLMs with strong schema validation(medium.com, medium.com).
-
psychKG-IE-HuggingFace.py
Uses a local Hugging Face model via the Instructor + Pydantic pipeline. -
psychKG-IE-OpenAI.py
Uses OpenAI GPT‑4 API with function‑calling and Pydantic validation. Outputs todata/IE_output/o3. -
psychKG-IE-ChatAI.py
Connects to the KISSKI ChatAI API (via the GWDG/KISSKI HPC service) for various open-weights models including from the Qwen model family (e.g., Qwen 2.5‑72B), deepseek and GPT models. Outputs todata/IE_output/qwen2_5.
Raw TEI‑XML papers located in:
data/papers_input_tei_xml/
Extracted data saved as JSON to:
data/IE_output/
├── o3/ ← OpenAI‑ and ChatAI-based scripts output
└── qwen2_5/ ← ChatAI script output
Each JSON file contains a list of entries:
{
"construct": "...",
"measured_by": "...",
"justification": "..."
}Packages used:
transformers,instructor,pydantic,beautifulsoup4,openai- Access to KISSKI ChatAI endpoint (AcademicCloud / GWDG HPC)
- GPU recommended for Hugging Face script
python src/psychKG-IE-HuggingFace.py \
--input_dir data/papers_input_tei_xml \
--output_dir data/IE_output/qwen2_5python src/psychKG-IE-OpenAI.py \
--input_dir data/papers_input_tei_xml \
--output_dir data/IE_output/o3Ensure you have API access to KISSKI ChatAI (see GWDG/KISSKI LLM‑Service) and appropriate credentials, then run:
python src/psychKG-IE-ChatAI.py- Qwen output also comes via the KISSKI ChatAI endpoint using Qwen 2.5‑72B weights hosted by the service.
- KISSKI ChatAI API is a secure, OpenAI-compatible endpoint (supports GPT‑4 and open models) and adheres to data privacy rules (kisski.gwdg.de, dfn.de).
If you use this repository in your work, please cite:
D'Souza, J., & Wulff, D. (2025). psychKG-pilot: A Minimal Knowledge Graph for Psychology via LLM-based Structured Extraction (Version 0.1.0) [Computer software]. TIB & MPIB. https://github.com/sciknoworg/psychKG-pilot
Or use the CITATION.cff file for automatic citation formats.
BibTeX:
@software{dsouza2025psychkg,
author = {D'Souza, Jennifer and Wulff, Dirk},
title = {psychKG-pilot: A Minimal Knowledge Graph for Psychology via LLM-based Structured Extraction},
year = 2025,
version = {0.1.0},
publisher = {TIB & MPIB},
url = {https://github.com/sciknoworg/psychKG-pilot}
}This project is licensed under the MIT License.
If you have questions, feedback, or ideas to improve the project, feel free to open an issue or get in touch with us — we'd love to hear from you!