Use an RL algorithm (e.g., PPO) to train an LLM wrapper.
The wrapper will detect, block, or reframe prompts that cause undesirable completions.
Environment: Streams of incoming prompts → base LLM → candidate output.
Agent: The RL policy observes the state and decides on an action.
- Deploying new LLMs with static wrappers takes a lot of time.
- This creates a vulnerability period where the system is exposed to attacks.
- We need a temporary, dynamic workaround to secure models during this critical phase.
- Objective: Create an adaptive defense layer that learns to detect and recover from LLM jailbreaks in real time using deep reinforcement learning (RL).
- Goal: To outperform static wrappers and showcase an autonomous, self-healing defense prototype.
- Prompt Embeddings: Sentence-transformer embeddings of the prompt.
- Response Information: Candidate response embeddings and policy violation logits.
- Heuristics: Toxicity scores and jailbreak patterns.
- Wrapper Metadata: Guardrail flags, latency, etc.
- Synthetic and benchmark jailbreak streams.
- JailbreakBench & AdvPromptBench: Primary data sources.
- Benign Queries: Mixed in harmless queries to measure the false-positive rate.
- Curriculum learning (starting with simple attacks) and early stopping on reward plateaus.
Follow these steps to run the full Prompt Defender RL stack:
Run the backend API using Uvicorn:
cd modifiedrun/
uvicorn app:app --reloadRun the frontend development server:
cd prompt-defender-rl/
npm run devRun the sample prompt loader script to simulate quick responses to multiple prompts:
cd modifiedrun/
python3 load_sample_prompts.pyOrder does not matter, but the backend should be running before using the frontend.
Environment Data:
- Harmful prompts (We use an LLM to fuzz these as well for additional harmful prompts):
- Mixed prompts (LLM decides harmfulness):
- Benign prompts:
Evaluation Data:
- Unseen dataset:
The RL-based wrapper is highly effective at detecting and blocking harmful prompts, outperforming static wrappers in real-time scenarios. The confusion matrix above shows strong detection rates for adversarial inputs, with a high but acceptable false positive rate (FPR) due to the aggressive defense policy. This trade-off ensures robust protection during the critical deployment phase, while maintaining reasonable usability for benign queries.
- Benchmark Report: Generate a report comparing our metrics against existing static wrappers (e.g., Rebuff, PromptGuard).
- Ablation Study: Analyze the impact of different features on performance.
- Self-Healing: Deployment on ONNX is solid, next steps would be to receive feedback and continue adapting the environment.

