Because dataset preparation shouldn't be the hardest part of fine-tuning
Look, I'll be honest with you. I spent too much time manually creating training datasets for fine-tuning models as an experiment. You have all this domain knowledge sitting in PDFs, but turning that into actual training data? That's a different beast entirely.
I was working on a medical compliance project and I needed to fine-tune a model (using OpenAI Finetuning) to understand very specific regulatory requirements. Not general medical knowledge, but the exact wording of particular SOPs and how they applied to real scenarios. So that I didnt not have to build a complex application.
After manually writing datasets, I thought "there has to be a better way, specially when AI exists" So I built this tool. It worked so well for my use case that I figured other people might be dealing with the same frustration. I was able to bake my SOP's into an OpenAI GPT model and did a test and it was pretty accurate.
This tool takes your domain knowledge (PDF's) and generates realistic training scenarios with expert-level responses. It's not magic - it uses the knowledge you already have and creates training pairs that actually make sense for your specific field.
Here's what happened when I used it:
- Fed it some domain specific SOPs (those PDF documents nobody wants to read, just make sure the PDF's are OCR ready, else use a parsing tool to get acccurate extract's else the dataset will be of no use.)
- Got back 50+ realistic compliance scenarios with detailed, accurate responses
- Fine-tuned a model that actually understood the nuances of the regulations
- Saved probably 2-3 weeks of manual dataset creation and making of a complex app
git clone this-repo
cd ai-dataset-generator
pip install -r requirements.txtYou'll need an OpenAI API key. Create a .env file:
echo "OPENAI_API_KEY=your_actual_key_here" .envDon't use the free tier if you're generating a lot of data - you'll hit rate limits fast. I spent about $30 for this activity
This is where your domain expertise comes in. The tool can handle:
- PDFs (it'll extract the text for you)
- Markdown files
- Plain text documents
Just dump your knowledge files into the knowledge_base/ directory. I usually organize by domain:
knowledge_base/
├── medical/ # My original use case
├── legal/ # Friend's law firm compliance
├── financial/ # Another project
└── your_domain/ # Whatever you're working on
The tool comes with some templates I've already set up, but you'll probably want to customize it. Here's what I did for medical compliance:
from ai_dataset_generator import AIDatasetGenerator
from config import ConfigTemplates
# Start with a template
config = ConfigTemplates.medical_compliance()
# Point it to your knowledge
config.knowledge_sources = ["knowledge_base/medical/"]
# Decide how many training pairs you want
config.num_scenarios = 50 # Start small, see how it goes
# Generate the dataset
generator = AIDatasetGenerator(config)
dataset = generator.generate_dataset()That's it. You'll get a .jsonl file ready for OpenAI's fine-tuning API.
The quality depends a lot on how good your source material is. Garbage in, garbage out. I had to clean up some of my SOPs because they had inconsistent formatting.
Don't generate 500 scenarios on your first run. Start with 10-20, see if the quality is what you want, then scale up.
The default prompts work okay, but you'll probably want to tweak them for your specific domain. I spent time getting the "expert voice" right for medical compliance.
Some PDFs extract beautifully, others are a mess. If you have important documents that are image-based or have weird formatting, you might need to clean them up first using a good parsing tool.
Works great for SOPs, clinical protocols, regulatory requirements. The model learns to cite specific sections and requirements, which is crucial for compliance work.
Contract analysis, regulatory compliance, policy interpretation. Lawyers love citing specific statutes, and the model picks up on that pattern.
Banking regulations, risk assessment, securities rules. Lots of "if this, then that" logic that the model handles well.
The beauty is you can adapt it to whatever field you're in. Quality assurance, safety procedures, technical documentation - if you have structured knowledge, this can probably help.
Let me be straight with you - there are several Python files in this project, but you don't need all of them.
ai_dataset_generator.py- The main tool. This is what actually generates your datasets.config.py- Configuration system. Lets you customize for different domains without editing the main code.
knowledge_base_utils.py- PDF processing. If you have PDFs (and you probably do), this extracts text automatically. Without it, you'd need to copy/paste text manually.
get_started.py- Interactive setup. Walks you through first-time setup, but you can just follow this README instead.examples/quick_start_example.py- Usage examples. Shows different ways to use the tool.
legacy_tissue_dataset_generator.py- My original version for medical compliance. Just kept it for reference.
Bottom line: If you want the absolute minimum, just use ai_dataset_generator.py and config.py. If you have PDFs, grab knowledge_base_utils.py too. Everything else is just convenience.
Under the hood, this uses:
- Agno framework for the AI agent (handles the knowledge retrieval)
- OpenAI's API for the actual text generation
- PyMuPDF for PDF text extraction
- Python because, well, it's Python
The tool creates an AI agent that has access to your knowledge base and generates scenarios that require that knowledge to answer correctly. It's not just general knowledge - it's specifically based on your documents.
I've included some configs I've used:
# Medical compliance (my original)
config = ConfigTemplates.medical_compliance()
# Legal compliance (friend's law firm)
config = ConfigTemplates.legal_compliance()
# Custom domain (example: quality assurance)
config = ConfigTemplates.custom(
domain="Quality Assurance",
role="QA Expert",
task="quality assessment based on standards",
knowledge_path="knowledge_base/qa/"
)You can also build your own from scratch if the templates don't fit.
Using OpenAI's API isn't free. For my 50-scenario medical dataset, I spent maybe ~ $30 in API costs. Not terrible, but something to keep in mind. The gpt-4o-mini model is cheaper and works fine for most use cases.
The tool generates:
- Training dataset in JSONL format (ready for OpenAI fine-tuning)
- Raw scenarios in JSON (for analysis and review)
- Configuration file (so you can reproduce your results)
The training pairs look like real expert conversations:
{
"messages": [
{
"role": "user",
"content": "A contractor wants to use a new chemical cleaning agent in our facility. The safety data sheet shows it's flammable with a flash point of 85°F. Our facility policy requires flash points above 100°F. What's the safety assessment?"
},
{
"role": "assistant",
"content": "DECISION: REJECT\n\nRATIONALE: Per Safety Policy Section 3.2.1, all cleaning chemicals must have flash points >100°F for indoor use. At 85°F, this chemical does not meet our minimum safety requirements and poses an unacceptable fire risk in our facility environment..."
}
]
}- Quality over quantity: 50 good examples beats 200 mediocre ones
- Review before training: Always spot-check the generated scenarios
- Start with base models: I used
gpt-4.1-mini-2025-04-14for my first fine-tune - Test extensively: Your fine-tuned model will be very specific to your domain
- You have structured domain knowledge (SOPs, procedures, regulations)
- You need the model to cite specific sources or sections
- You want consistent, domain-specific responses
- You're dealing with compliance or regulatory stuff
- Your domain knowledge is mostly tacit/experiential (hard to document)
- You need the model to be creative rather than accurate
- Your source documents are really messy or inconsistent
- You're working with very visual or hands-on domains
I built this for my specific need, but I've tried to make it general enough for other people to use. If you find bugs or have ideas for improvements, feel free to contribute.
MIT License - use it however you want. If it saves you time, great. If you improve it, even better.
Dataset preparation used to be the most tedious part of fine-tuning for me. This tool doesn't eliminate the work entirely, but it makes it manageable. Instead of spending weeks writing training examples, I spend a few hours setting up the knowledge base and reviewing the output.
Your mileage may vary, but if you're sitting on a pile of domain-specific documents and thinking about fine-tuning, this might save you some headaches.
Built by someone who got tired of manual dataset creation. Shared because maybe you're tired of it too.