NLSQLRO is a Romanian NL-to-SQL dataset project built around three parts:
dataset_generator/: synthetic data generation pipelinedatasets_external/: external dataset preprocessing and translation workspaceresearch_plan/: source materials and SQL dump preparation scripts
The repository supports local generation with vLLM, dataset preparation for LLaMA-Factory, and staged fine-tuning workflows.
dataset_generator/: generation, validation, export, and training-data prepdatasets/: generated datasets and normalized training artifactsdatasets_external/: external pipeline scripts and raw merged artifactsresearch_plan/Faza_1/: scripts and input files for building SQLite SQL dumpstraining/llamafactory/: LLaMA-Factory training YAMLsscripts/: operational helper scriptstests/: endpoint checks and utility tests
- Python virtual environment in
.venv - CUDA-capable machine for vLLM generation
- SQL dump inputs in
research_plan/Faza_1/ - vLLM and model weights installed separately in your active environment
Activate the project environment:
source scripts/activate.sh
python -VThe generator expects these dump files:
research_plan/Faza_1/edu_reteaua_scolara.sqlresearch_plan/Faza_1/rail_mers_tren.sql
If they do not exist yet, build them with:
cd research_plan/Faza_1
python clean_educatie.py
python curatare_trenuri.py
cd ../..Example single-endpoint launch:
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 \
--host 0.0.0.0 \
--port 8001 \
--tensor-parallel-size 1 \
--dtype auto \
--max-model-len 32768 \
--generation-config vllmSmoke-check the endpoint:
python tests/check_vllm_qwen35_endpoint.py --base-url http://127.0.0.1:8001/v1Single-endpoint smoke run:
python -m dataset_generator.cli generate \
--config dataset_generator/configs/vllm.smoke.8001.json \
--progress-every 1
python -m dataset_generator.cli validate \
--config dataset_generator/configs/vllm.smoke.8001.jsonMain generation run:
python -m dataset_generator.cli generate \
--config dataset_generator/configs/vllm.template.json \
--progress-every 10
python -m dataset_generator.cli validate \
--config dataset_generator/configs/vllm.template.jsonMulti-GPU generation with one vLLM endpoint per GPU:
python -m dataset_generator.cli generate-multi-gpu \
--config dataset_generator/configs/vllm.template.json \
--gpus 0,1,2,3 \
--base-urls http://127.0.0.1:8001/v1,http://127.0.0.1:8002/v1,http://127.0.0.1:8003/v1,http://127.0.0.1:8004/v1 \
--artifact alpaca \
--work-dir datasets/multi_gpu_runs \
--progress-every 1Normalize local datasets into a LLaMA-Factory-ready layout:
python -m dataset_generator.cli prepare-llamafactory \
--out-dir datasets/llamafactoryThis produces normalized Alpaca JSONL files plus
datasets/llamafactory/dataset_info.json.
LLaMA-Factory training configs are under training/llamafactory/.
The full runbook is in FINETUNING.md.
Helper scripts are under scripts/:
source scripts/activate.shbash scripts/clean_gpu_mem.sh --gpus "0 1 2 3"bash scripts/train_all.sh --gpus "0,1,2,3"