Skip to content

Commit ef03cfa

Browse files
authored
Add agent steering guidelines (#2968)
1 parent 66ccb5a commit ef03cfa

File tree

7 files changed

+507
-1
lines changed

7 files changed

+507
-1
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,5 +41,6 @@ dist/
4141
*.egg-info/
4242
*.pt
4343

44-
.kiro
44+
.kiro/*
45+
!.kiro/steering/
4546
venv

.kiro/steering/development.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Development Guidelines
2+
3+
## Repository Access
4+
5+
DJL, DJL-Serving, and LMI are open-source projects under the deepjavalibrary GitHub organization.
6+
7+
### Getting Started
8+
9+
1. Complete Open Source Training
10+
2. Link GitHub account with AWS/Amazon
11+
3. Join the deepjavalibrary GitHub organization
12+
4. Request access to djl-admin or djl-committer groups
13+
14+
### Key Repositories
15+
16+
- https://github.com/deepjavalibrary/djl
17+
- https://github.com/deepjavalibrary/djl-serving
18+
- https://github.com/deepjavalibrary/djl-demo
19+
20+
## Development Workflow
21+
22+
### Setup
23+
```bash
24+
# Fork repos, then clone and track upstream
25+
git clone [email protected]:<username>/djl-serving.git
26+
cd djl-serving
27+
git remote add upstream https://github.com/deepjavalibrary/djl-serving
28+
29+
# Sync with upstream
30+
git fetch upstream && git rebase upstream/master && git push
31+
```
32+
33+
### Making Changes
34+
```bash
35+
git checkout -b my-feature-branch
36+
# Make changes
37+
git add . && git commit -m "Description"
38+
git push -u origin my-feature-branch
39+
# Create PR from fork to upstream/master via GitHub UI
40+
```
41+
42+
## Building LMI Containers
43+
44+
### Container Types
45+
46+
**DLC and DockerHub:**
47+
- LMI-vLLM
48+
- LMI-TensorRT-LLM
49+
- LMI-Neuron
50+
51+
**DockerHub Only:**
52+
- CPU-Full (PyTorch/OnnxRuntime/MxNet/TensorFlow)
53+
- CPU (no engines bundled)
54+
- PyTorch-GPU
55+
- Aarch64 (Graviton support)
56+
57+
### Build Process
58+
59+
```bash
60+
# Prepare build
61+
cd djl-serving
62+
rm -rf serving/docker/distributions
63+
./gradlew clean && ./gradlew --refresh-dependencies :serving:dockerDeb -Psnapshot
64+
65+
# Get versions
66+
cd serving/docker
67+
export DJL_VERSION=$(awk -F '=' '/djl / {gsub(/ ?"/, "", $2); print $2}' ../../gradle/libs.versions.toml)
68+
export SERVING_VERSION=$(awk -F '=' '/serving / {gsub(/ ?"/, "", $2); print $2}' ../../gradle/libs.versions.toml)
69+
70+
# Build specific container
71+
docker compose build --build-arg djl_version=${DJL_VERSION} --build-arg djl_serving_version=${SERVING_VERSION} lmi
72+
docker compose build --build-arg djl_version=${DJL_VERSION} --build-arg djl_serving_version=${SERVING_VERSION} tensorrt-llm
73+
docker compose build --build-arg djl_version=${DJL_VERSION} --build-arg djl_serving_version=${SERVING_VERSION} pytorch-inf2
74+
```
75+
76+
See `serving/docker/docker-compose.yml` for all available targets.
77+
78+
## Testing
79+
80+
### Local Integration Tests
81+
82+
```bash
83+
cd tests/integration
84+
OVERRIDE_TEST_CONTAINER=<image_name> python -m pytest tests.py::<TestClass>::<test_name>
85+
86+
# Example
87+
OVERRIDE_TEST_CONTAINER=deepjavalibrary/djl-serving:lmi python -m pytest tests.py::TestVllm1_g6::test_gemma_2b
88+
```
89+
90+
Full test suite: `tests/integration/tests.py`
91+
92+
## Key Development Areas (Priority Order)
93+
94+
### DJL-Serving
95+
1. **Python Engine** - `engines/python/setup/djl_python/` (vLLM, TensorRT-LLM, rolling batch, chat completions)
96+
2. **Python Engine Java** - `engines/python/src/main/java/ai/djl/python/engine/`
97+
3. **WLM** - `wlm/` (backend ML/DL engine integration)
98+
4. **Serving** - `serving/` (frontend web server)
99+
100+
### DJL (Less Frequent)
101+
PyTorch, HuggingFace Tokenizer, OnnxRuntime, Rust/Candle engines
102+
103+
## CI/CD Workflows
104+
105+
### DJL Repository
106+
- `continuous.yml` - PR checks
107+
- `native_jni_s3_pytorch.yml` - Publish native code to S3
108+
- `nightly_publish.yml` - SNAPSHOT to Maven
109+
- `serving-publish.yml` - DJL-Serving SNAPSHOT to S3
110+
111+
### DJL-Serving Repository
112+
- `nightly.yml` - Build containers → Run tests → Publish to staging
113+
- `docker-nightly-publish.yml` - Build/publish to dev repo (ad-hoc)
114+
- `integration.yml` - Run all tests with custom image (ad-hoc)
115+
- `docker_publish.yml` - Sync dev to staging
116+
- `integration_execute.yml` - Single test on specific instance
117+
118+
## Versioning
119+
- **DJL** → Maven (stable + SNAPSHOT)
120+
- **DJL-Serving** → S3 (stable + SNAPSHOT)
121+
- **Source**`gradle/libs.versions.toml`
122+
- **Nightly** → SNAPSHOT, **Release** → Stable

.kiro/steering/partitioning.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Model Partitioning and Optimization
2+
3+
The partition system (`serving/docker/partition/`) provides tools for model preparation, including tensor parallelism sharding, quantization, and multi-node setup.
4+
5+
## Core Scripts
6+
7+
### partition.py - Main Entry Point
8+
Handles S3 download, requirements install, partitioning, quantization (AWQ/FP8), S3 upload.
9+
10+
**Features:** HF downloads, `OPTION_*` env vars, MPI mode, auto-cleanup
11+
12+
```bash
13+
python partition.py \
14+
--model-id <hf_model_id_or_s3_uri> \
15+
--tensor-parallel-degree 4 \
16+
--quantization awq \
17+
--save-mp-checkpoint-path /tmp/output
18+
```
19+
20+
### run_partition.py - Custom Handlers
21+
Invokes user-provided partition handlers via `partition_handler` property.
22+
23+
### run_multi_node_setup.py - Cluster Coordination
24+
Multi-node setup: queries leader for model info, downloads to workers, exchanges SSH keys, reports readiness.
25+
26+
**Env Vars:** `DJL_LEADER_ADDR`, `LWS_LEADER_ADDR`, `DJL_CACHE_DIR`
27+
28+
### trt_llm_partition.py - TensorRT-LLM Compilation
29+
Builds TensorRT engines with BuildConfig (batch/seq limits), QuantConfig (AWQ/FP8/SmoothQuant), CalibConfig (calibration data).
30+
31+
### SageMaker Neo Integration
32+
33+
Partition scripts power **SageMaker Neo's CreateOptimizationJob API** - managed service for compilation, quantization, and sharding.
34+
35+
**API:** https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateOptimizationJob.html
36+
37+
**Optimization Types:**
38+
1. Compilation (TensorRT-LLM engines)
39+
2. Quantization (AWQ, FP8)
40+
3. Sharding (Fast Model Loader TP)
41+
42+
**Neo Environment Variables:**
43+
- `SM_NEO_INPUT_MODEL_DIR`, `SM_NEO_OUTPUT_MODEL_DIR`
44+
- `SM_NEO_COMPILATION_PARAMS` (JSON config)
45+
- `SERVING_FEATURES` (vllm, trtllm)
46+
47+
**Neo Scripts:**
48+
- `sm_neo_dispatcher.py` - Routes jobs: vllm→Quantize/Shard, trtllm→Compile
49+
- `sm_neo_trt_llm_partition.py` - TensorRT-LLM compilation
50+
- `sm_neo_quantize.py` - Quantization workflows
51+
- `sm_neo_utils.py` - Env var helpers
52+
53+
**Workflow:**
54+
CreateOptimizationJob(source S3, config, output S3, container) → Neo launches container → Dispatcher routes → Handler optimizes → Artifacts to output S3 → Deploy to SageMaker
55+
56+
## Quantization
57+
58+
### AWQ (4-bit, AutoAWQ library)
59+
```properties
60+
option.quantize=awq
61+
option.awq_zero_point=true
62+
option.awq_block_size=128
63+
option.awq_weight_bit_width=4
64+
option.awq_mm_version=GEMM
65+
option.awq_ignore_layers=lm_head
66+
```
67+
68+
### FP8 (llm-compressor, CNN/DailyMail calibration)
69+
```properties
70+
option.quantize=fp8
71+
option.fp8_scheme=FP8_DYNAMIC
72+
option.fp8_ignore=lm_head
73+
option.calib_size=512
74+
option.max_model_len=2048
75+
```
76+
77+
## Multi-Node
78+
79+
### MPI Mode (engine=MPI or TP > 1)
80+
```bash
81+
mpirun -N <tp_degree> --allow-run-as-root \
82+
--mca btl_vader_single_copy_mechanism none \
83+
-x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 \
84+
python run_partition.py --properties '{...}'
85+
```
86+
87+
### Cluster Setup (LeaderWorkerSet/K8s)
88+
1. Leader generates SSH keys
89+
2. Workers query `/cluster/models` for model info
90+
3. Workers download model, exchange SSH keys via `/cluster/sshpublickey`
91+
4. Workers report to `/cluster/status?message=OK`
92+
5. Leader loads model
93+
94+
## Configuration
95+
96+
### properties_manager.py
97+
Loads `serving.properties`, merges `OPTION_*` env vars, validates, generates output.
98+
99+
**Key Properties:**
100+
- `option.model_id` - HF model ID or S3 URI
101+
- `option.tensor_parallel_degree`, `option.pipeline_parallel_degree`
102+
- `option.save_mp_checkpoint_path` - Output dir
103+
- `option.quantize` - awq, fp8, static_int8
104+
- `engine` - Python, MPI
105+
106+
### utils.py Helpers
107+
`get_partition_cmd()`, `extract_python_jar()`, `load_properties()`, `update_kwargs_with_env_vars()`, `remove_option_from_properties()`, `load_hf_config_and_tokenizer()`
108+
109+
## Container Integration
110+
Scripts at `/opt/djl/partition/` invoked via:
111+
1. Neo compilation (`sm_neo_dispatcher.py`)
112+
2. Container startup (on-the-fly partitioning)
113+
3. Management API (dynamic registration)
114+
115+
## Common Workflows
116+
117+
```bash
118+
# Tensor Parallelism
119+
python partition.py --model-id meta-llama/Llama-2-70b-hf \
120+
--tensor-parallel-degree 8 --save-mp-checkpoint-path /tmp/output
121+
122+
# AWQ Quantization
123+
python partition.py --model-id meta-llama/Llama-2-7b-hf \
124+
--quantization awq --save-mp-checkpoint-path /tmp/output
125+
126+
# TensorRT-LLM Engine
127+
python trt_llm_partition.py --properties_dir /opt/ml/model \
128+
--trt_llm_model_repo /tmp/engine --model_path /tmp/model \
129+
--tensor_parallel_degree 4 --pipeline_parallel_degree 1
130+
```
131+
132+
## Error Handling
133+
Non-zero exit on failure, real-time stdout/stderr, cleanup on success, S3 upload only after success.

.kiro/steering/product.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# DJL Serving - Product Overview
2+
3+
High-performance universal model serving solution powered by Deep Java Library (DJL). Serves ML models through REST APIs with automatic scaling, dynamic batching, and multi-engine support.
4+
5+
## Architecture
6+
7+
**3-Layer Design:**
8+
1. **Frontend** - Netty HTTP server (Inference + Management APIs)
9+
2. **Workflows** - Multi-model execution pipelines
10+
3. **WorkLoadManager (WLM)** - Worker thread pools with batching/routing
11+
12+
**Python Engine** - Runs Python-based models and custom handlers
13+
**LMI** - Large Model Inference with vLLM, TensorRT-LLM, HuggingFace Accelerate
14+
15+
## Supported Models
16+
17+
PyTorch TorchScript, SKLearn models, ONNX, Python scripts, XGBoost, Sentencepiece, HuggingFace models
18+
19+
## Primary Use Cases
20+
21+
1. **LLM Serving** - Optimized backends (vLLM, TensorRT-LLM) with LoRA adapters
22+
2. **Multi-Model Endpoints** - Version management, workflows
23+
3. **Custom Handlers** - Python preprocessing/postprocessing
24+
4. **Embeddings & Multimodal** - Text embeddings, vision-language models
25+
5. **AWS Integration** - SageMaker deployment, Neo optimization (compilation, quantization, sharding)
26+
27+
## Key Features
28+
29+
- Auto-scaling worker threads based on load
30+
- Dynamic batching for throughput optimization
31+
- Multi-engine support (serve different frameworks simultaneously)
32+
- Plugin architecture for extensibility
33+
- OpenAPI-compatible REST endpoints

0 commit comments

Comments
 (0)