|
| 1 | +# Model Partitioning and Optimization |
| 2 | + |
| 3 | +The partition system (`serving/docker/partition/`) provides tools for model preparation, including tensor parallelism sharding, quantization, and multi-node setup. |
| 4 | + |
| 5 | +## Core Scripts |
| 6 | + |
| 7 | +### partition.py - Main Entry Point |
| 8 | +Handles S3 download, requirements install, partitioning, quantization (AWQ/FP8), S3 upload. |
| 9 | + |
| 10 | +**Features:** HF downloads, `OPTION_*` env vars, MPI mode, auto-cleanup |
| 11 | + |
| 12 | +```bash |
| 13 | +python partition.py \ |
| 14 | + --model-id <hf_model_id_or_s3_uri> \ |
| 15 | + --tensor-parallel-degree 4 \ |
| 16 | + --quantization awq \ |
| 17 | + --save-mp-checkpoint-path /tmp/output |
| 18 | +``` |
| 19 | + |
| 20 | +### run_partition.py - Custom Handlers |
| 21 | +Invokes user-provided partition handlers via `partition_handler` property. |
| 22 | + |
| 23 | +### run_multi_node_setup.py - Cluster Coordination |
| 24 | +Multi-node setup: queries leader for model info, downloads to workers, exchanges SSH keys, reports readiness. |
| 25 | + |
| 26 | +**Env Vars:** `DJL_LEADER_ADDR`, `LWS_LEADER_ADDR`, `DJL_CACHE_DIR` |
| 27 | + |
| 28 | +### trt_llm_partition.py - TensorRT-LLM Compilation |
| 29 | +Builds TensorRT engines with BuildConfig (batch/seq limits), QuantConfig (AWQ/FP8/SmoothQuant), CalibConfig (calibration data). |
| 30 | + |
| 31 | +### SageMaker Neo Integration |
| 32 | + |
| 33 | +Partition scripts power **SageMaker Neo's CreateOptimizationJob API** - managed service for compilation, quantization, and sharding. |
| 34 | + |
| 35 | +**API:** https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateOptimizationJob.html |
| 36 | + |
| 37 | +**Optimization Types:** |
| 38 | +1. Compilation (TensorRT-LLM engines) |
| 39 | +2. Quantization (AWQ, FP8) |
| 40 | +3. Sharding (Fast Model Loader TP) |
| 41 | + |
| 42 | +**Neo Environment Variables:** |
| 43 | +- `SM_NEO_INPUT_MODEL_DIR`, `SM_NEO_OUTPUT_MODEL_DIR` |
| 44 | +- `SM_NEO_COMPILATION_PARAMS` (JSON config) |
| 45 | +- `SERVING_FEATURES` (vllm, trtllm) |
| 46 | + |
| 47 | +**Neo Scripts:** |
| 48 | +- `sm_neo_dispatcher.py` - Routes jobs: vllm→Quantize/Shard, trtllm→Compile |
| 49 | +- `sm_neo_trt_llm_partition.py` - TensorRT-LLM compilation |
| 50 | +- `sm_neo_quantize.py` - Quantization workflows |
| 51 | +- `sm_neo_utils.py` - Env var helpers |
| 52 | + |
| 53 | +**Workflow:** |
| 54 | +CreateOptimizationJob(source S3, config, output S3, container) → Neo launches container → Dispatcher routes → Handler optimizes → Artifacts to output S3 → Deploy to SageMaker |
| 55 | + |
| 56 | +## Quantization |
| 57 | + |
| 58 | +### AWQ (4-bit, AutoAWQ library) |
| 59 | +```properties |
| 60 | +option.quantize=awq |
| 61 | +option.awq_zero_point=true |
| 62 | +option.awq_block_size=128 |
| 63 | +option.awq_weight_bit_width=4 |
| 64 | +option.awq_mm_version=GEMM |
| 65 | +option.awq_ignore_layers=lm_head |
| 66 | +``` |
| 67 | + |
| 68 | +### FP8 (llm-compressor, CNN/DailyMail calibration) |
| 69 | +```properties |
| 70 | +option.quantize=fp8 |
| 71 | +option.fp8_scheme=FP8_DYNAMIC |
| 72 | +option.fp8_ignore=lm_head |
| 73 | +option.calib_size=512 |
| 74 | +option.max_model_len=2048 |
| 75 | +``` |
| 76 | + |
| 77 | +## Multi-Node |
| 78 | + |
| 79 | +### MPI Mode (engine=MPI or TP > 1) |
| 80 | +```bash |
| 81 | +mpirun -N <tp_degree> --allow-run-as-root \ |
| 82 | + --mca btl_vader_single_copy_mechanism none \ |
| 83 | + -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1 \ |
| 84 | + python run_partition.py --properties '{...}' |
| 85 | +``` |
| 86 | + |
| 87 | +### Cluster Setup (LeaderWorkerSet/K8s) |
| 88 | +1. Leader generates SSH keys |
| 89 | +2. Workers query `/cluster/models` for model info |
| 90 | +3. Workers download model, exchange SSH keys via `/cluster/sshpublickey` |
| 91 | +4. Workers report to `/cluster/status?message=OK` |
| 92 | +5. Leader loads model |
| 93 | + |
| 94 | +## Configuration |
| 95 | + |
| 96 | +### properties_manager.py |
| 97 | +Loads `serving.properties`, merges `OPTION_*` env vars, validates, generates output. |
| 98 | + |
| 99 | +**Key Properties:** |
| 100 | +- `option.model_id` - HF model ID or S3 URI |
| 101 | +- `option.tensor_parallel_degree`, `option.pipeline_parallel_degree` |
| 102 | +- `option.save_mp_checkpoint_path` - Output dir |
| 103 | +- `option.quantize` - awq, fp8, static_int8 |
| 104 | +- `engine` - Python, MPI |
| 105 | + |
| 106 | +### utils.py Helpers |
| 107 | +`get_partition_cmd()`, `extract_python_jar()`, `load_properties()`, `update_kwargs_with_env_vars()`, `remove_option_from_properties()`, `load_hf_config_and_tokenizer()` |
| 108 | + |
| 109 | +## Container Integration |
| 110 | +Scripts at `/opt/djl/partition/` invoked via: |
| 111 | +1. Neo compilation (`sm_neo_dispatcher.py`) |
| 112 | +2. Container startup (on-the-fly partitioning) |
| 113 | +3. Management API (dynamic registration) |
| 114 | + |
| 115 | +## Common Workflows |
| 116 | + |
| 117 | +```bash |
| 118 | +# Tensor Parallelism |
| 119 | +python partition.py --model-id meta-llama/Llama-2-70b-hf \ |
| 120 | + --tensor-parallel-degree 8 --save-mp-checkpoint-path /tmp/output |
| 121 | + |
| 122 | +# AWQ Quantization |
| 123 | +python partition.py --model-id meta-llama/Llama-2-7b-hf \ |
| 124 | + --quantization awq --save-mp-checkpoint-path /tmp/output |
| 125 | + |
| 126 | +# TensorRT-LLM Engine |
| 127 | +python trt_llm_partition.py --properties_dir /opt/ml/model \ |
| 128 | + --trt_llm_model_repo /tmp/engine --model_path /tmp/model \ |
| 129 | + --tensor_parallel_degree 4 --pipeline_parallel_degree 1 |
| 130 | +``` |
| 131 | + |
| 132 | +## Error Handling |
| 133 | +Non-zero exit on failure, real-time stdout/stderr, cleanup on success, S3 upload only after success. |
0 commit comments