Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs
If our project helps you, please give us a star β and cite our paper!
- [2026.04.06] π Our paper is accepted to ACL 2026 Main Conference!
Prompt optimization has become an important way to improve the performance of large language models, but existing methods often treat prompts as a single monolithic text block. This makes optimization difficult: when a prompt fails, it is hard to determine whether the issue comes from the reasoning strategy, output format, role description, task decomposition, or another hidden component.
Moreover, manually designing and tuning prompt programs requires substantial human effort. As tasks become more complex, prompt programs often need multiple interacting instructions, and searching over the whole prompt space directly can be inefficient, unstable, and difficult to interpret.
Our Adaptive Prompt Structure Factorization (aPSF) addresses these challenges by automatically discovering the latent structure of a prompt and optimizing it at the factor level. Instead of rewriting the entire prompt at each step, aPSF identifies which component matters most and performs targeted refinement.
aPSF first decomposes a prompt into a set of interpretable structural factors, such as task framing, reasoning style, answer format, constraint specification, and perspective. This factorized representation makes the prompt program easier to analyze, debug, and optimize.
Given model predictions and task feedback, aPSF identifies the factor that is most likely responsible for current errors. This allows the optimizer to focus on the most impactful part of the prompt rather than repeatedly modifying unrelated instructions.
After selecting a factor, aPSF performs targeted intervention on that single factor while keeping the remaining prompt structure stable. This improves controllability and helps isolate which prompt changes are truly beneficial.
aPSF evaluates optimized prompt candidates through a unified scoring pipeline and accepts updates that improve task performance. This creates an adaptive optimization loop that progressively discovers and improves compositional prompt programs.
- aPSF automatically discovers structured prompt factors from task descriptions and examples.
- aPSF performs targeted factor-level optimization instead of rewriting the whole prompt blindly.
- aPSF supports diverse reasoning and knowledge-intensive benchmarks, including math, logic, and knowledge tasks.
- aPSF is compatible with both hosted APIs and OpenAI-compatible local LLM endpoints.
| Category | Datasets |
|---|---|
| Math | gsm8k, multiarith, gsm_hard, aime2025, competition_math |
| Logic | aqua, bbh_all (27 tasks), bbh_<task_name> |
| Knowledge | mmlu, mmlu_<subject> (57 subjects), gpqa, gpqa_<domain> |
| Code | human_eval |
main.pyprovides a quick sanity check for aPSF.run_experiments.pyis the main entry point for running experiments.config.pycontains model, API, dataset, and experiment configurations.data_loader/contains dataset loaders.evaluation/contains task-specific evaluation logic.
- The Architect LLM discovers prompt structure and optimizes selected factors.
- The Worker LLM executes the current prompt program and generates task answers.
- The optimization loop performs factor discovery, error-guided factor selection, candidate generation, evaluation, and checkpointing.
- Python >= 3.9, recommended 3.10 or 3.11
Install dependencies:
pip install -r requirements.txtThe core dependencies include openai, torch, transformers, accelerate, datasets, numpy, tqdm, and tabulate. Please check requirements.txt for the full dependency list.
Set your API keys as environment variables:
export OPENAI_API_KEY="sk-..."
export SILICONFLOW_API_KEY="..."
export GROQ_API_KEY="..."
export DASHSCOPE_API_KEY="..."You can also edit config.py directly.
aPSF uses two LLM roles configured in config.py under MODELS:
| Role | Purpose | Example Models |
|---|---|---|
architect |
Structure discovery and factor optimization | Qwen3-8B, gpt-oss-120b |
worker |
Task execution and answer generation | Llama-3.1-8B, Qwen2.5-7B |
aPSF is compatible with any OpenAI-compatible endpoint. For local deployment, you may use Ollama or vLLM.
# Ollama
ollama run qwen2.5:7b
# Then set api_base_id to "ollama" in config.py
# vLLM
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--port 8000
# Then set api_base_id to "qwen_vllm" or "local_llm" in config.pyDataset paths are defined in config.py under DATA_PATHS. Please organize your data as follows:
data/
βββ gsm_data/ # GSM8K
βββ BIG-Bench-Hard-data/ # BBH, 27 tasks
βββ AQuA-data/ # AQuA
βββ MultiArith-data/ # MultiArith
βββ MMLU-data/ # MMLU, 57 subjects
βββ GSM-hard/ # GSM-Hard
βββ AIME2025/ # AIME 2025
βββ competition_math/ # Competition Math
βββ gpqa/ # GPQA
βββ human_eval/ # HumanEval
Update the paths in DATA_PATHS to point to your local data directory.
Verify your setup with the built-in sanity check:
python main.pyThis checks whether the Architect can perform structure discovery and whether the Worker LLM can generate answers successfully.
python run_experiments.py --dataset <DATASET> --method <METHOD> [OPTIONS]Required arguments:
| Argument | Description |
|---|---|
--dataset |
Dataset name, such as gsm8k, bbh_all, mmlu, or gpqa |
--method |
Method to run, such as apsf or an ablation variant |
Optional arguments:
| Argument | Description |
|---|---|
--feedback |
Enable reflection-based optimization using error feedback |
--resume |
Resume from the last checkpoint |
--step N |
Override the number of optimization steps |
--initial-prompt TEXT |
Start optimization from a given prompt; presets include cot, analytical, and expert |
# aPSF on GSM8K
python run_experiments.py --dataset gsm8k --method apsf
# Start with a Chain-of-Thought initial prompt
python run_experiments.py --dataset gsm8k --method apsf --initial-prompt cot
# Enable reflection-based optimization
python run_experiments.py --dataset gsm8k --method apsf --feedback
# Full BBH benchmark, 27 tasks, with checkpoint resume
python run_experiments.py --dataset bbh_all --method apsf --resume
# Single BBH task
python run_experiments.py --dataset bbh_web_of_lies --method apsf
# Single MMLU subject
python run_experiments.py --dataset mmlu_abstract_algebra --method apsf
# GPQA
python run_experiments.py --dataset gpqa --method apsfQ: How do I use a different LLM as the architect or worker?
A: Edit the MODELS section in config.py. Set provider, api_base_id, model_name, and api_key for each role. Any OpenAI-compatible endpoint can be used.
Q: Does aPSF support local models?
A: Yes. You can use local models through OpenAI-compatible endpoints, such as Ollama or vLLM.
Q: Can I resume an interrupted experiment?
A: Yes. Use the --resume flag when running run_experiments.py.
Q: How do I enable feedback-based optimization?
A: Add the --feedback flag. This enables reflection-based optimization using error feedback.
Q: How do I add a new dataset?
A: Create a loader in data_loader/ that inherits from BaseLoader, create an evaluator in evaluation/ if needed, and add the dataset configuration to DATASET_CONFIG in config.py.
We are grateful for the following awesome projects and resources:
If you find this project helpful, please consider citing our work:
@misc{liu2026adaptivepromptstructurefactorization,
title={Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs},
author={Haoyue Liu and Zhichao Wang and Yongxin Guo and Haoran Shou and Xiaoying Tang},
year={2026},
eprint={2604.06699},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.06699}
}