alauda · EdisonSu768 · Jan 14, 2026
diff --git a/.cspell/compound.txt b/.cspell/compound.txt
@@ -4,3 +4,4 @@ knative
 kserve
 xinference
 servicemeshv1
+ipynb
diff --git a/.gitignore b/.gitignore
@@ -9,4 +9,7 @@
 **/public/_remotes
 .idea
 
-.DS_Store
+.DS_Store
+
+.claude
+CLAUDE.md
diff --git a/docs/en/llm-compressor/how_to/compressor_by_workbench.mdx b/docs/en/llm-compressor/how_to/compressor_by_workbench.mdx
@@ -0,0 +1,193 @@
+---
+weight: 30
+---
+
+# LLM Compressor with Alauda AI
+
+This document describes how to use the LLM Compressor integration with the Alauda AI platform to perform model compression workflows. The Alauda AI integration of LLM Compressor provides two example workflows:
+
+- A workbench image and the [data-free-compressor.ipynb](/data-free-compressor.ipynb) that demonstrate how to compress a model.
+- A workbench image and the [calibration-compressor.ipynb](/calibration-compressor.ipynb) that demonstrate how to compress a model using a calibration dataset.
+
+<a href="/data-free-compressor.ipynb" download="data-free-compressor.ipynb"  rel="noopener noreferrer">notebook</a>
+
+## Supported Model Compression Workflows
+
+On the Alauda AI platform, you can use the Workbench feature to run LLM Compressor on models stored in your model repository. The following workflow outlines the typical steps for compressing a model.
+
+### Create a Workbench
+
+Follow the instructions in [Create Workbench](../../workbench/how_to/create_workbench.mdx) to create a new Workbench instance. Note that model compression is currently supported only within **JupyterLab**.
+
+### Create a Model Repository and Upload Models
+
+Refer to [Upload Models Using Notebook](../../model_inference/model_management/how_to/upload_models_using_notebook.mdx) for detailed steps on creating a model repository and uploading your model files. The example notebooks in this guide use the [TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) model.
+
+```python title=data-free-compressor.ipynb
+from llmcompressor.modifiers.quantization import QuantizationModifier
+
+model_id = "./TinyLlama-1.1B-Chat-v1.0" #[!code callout]
+recipe = QuantizationModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]) #[!code callout]
+```
+
+<Callouts>
+  1. Model to compress. **You can modify this line if you want to use your own model**.
+  2. This recipe will quantize all Linear layers except those in the `lm_head`,
+     which is often sensitive to quantization. The `W4A16` scheme compresses
+     weights to `4-bit` integers while retaining `16-bit` activations.
+</Callouts>
+
+
+### (Optional) Prepare and Upload a Dataset
+
+:::note
+If you plan to use the **data-free compressor notebook**, you can skip this step.
+:::
+
+To use the **calibration compressor notebook**, you must prepare and upload a calibration dataset. Prepare your dataset using the same process described in *Upload Models Using Notebook*. The example calibration notebook uses the [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset.
+
+```python title=calibration-compressor.ipynb
+from datasets import load_dataset
+
+dataset_id = "./ultrachat_200k" #[!code callout]
+
+num_calibration_samples = 512 if use_gpu else 4 #[!code callout]
+max_sequence_length = 2048 if use_gpu else 16
+
+ds = load_dataset(dataset_id, split="train_sft") #[!code callout]
+ds = ds.shuffle(seed=42).select(range(num_calibration_samples)) #[!code callout]
+
+def preprocess(example): #[!code callout]
+    text = tokenizer.apply_chat_template(
+        example["messages"],
+        tokenize=False,
+    )
+    return tokenizer(
+        text,
+        padding=False,
+        max_length=max_sequence_length,
+        truncation=True,
+        add_special_tokens=False,
+    )
+
+ds = ds.map(preprocess, remove_columns=ds.column_names)
+```
+
+<Callouts>
+  1. Create the calibration dataset, using Huggingface datasets API. **You can modify this line if you want to use your own dataset**.
+  2. Select number of samples. 512 samples is a good place to start. Increasing the number of samples can improve accuracy.
+  3. Load dataset.
+  4. Shuffle and grab only the number of samples we need.
+  5. Preprocess and tokenize into format the model uses.
+</Callouts>
+
+### (Optional) Upload Dataset into S3 Storage
+
+If you wish to upload datasets into S3, you can run those codes in `JupyterLab`.
+
+```python
+import os
+from boto3.s3.transfer import TransferConfig
+import boto3
+
+local_folder = "./ultrachat_200k" #[!code callout]
+bucket_name = "datasets"
+
+config = TransferConfig(
+    multipart_threshold=100*1024*1024,
+    max_concurrency=10,
+    multipart_chunksize=100*1024*1024,
+    use_threads=True
+) #[!code callout]
+
+for root, dirs, files in os.walk(local_folder):
+    for filename in files:
+        local_path = os path.join(root, filename)
+        relative_path = os.path.relpath(local_path, local_folder)
+        s3_key = f"ultrachat_200k/{relative_path.replace(os.sep, '/')}"
+        s3.upload_file(local_path, bucket_name, s3_key, Config=config)
+        print(f"Uploaded {local_path} -> {s3_key}")
+```
+
+<Callouts>
+  1. **You can modify this line if you want to use your own dataset**.
+  2. Configure multipart upload with 100 MB chunks and a maximum of 10 concurrent threads.
+</Callouts>
+
+### (Optional) Use Dataset in S3 Storage
+
+If you wish to use datasets from S3, you can first install the `s3fs` tool and then modify the dataset loading section in the example by following the code below.
+
+```bash
+pip install s3fs -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+```python title=calibration-compressor.ipynb
+import os
+from datasets import load_dataset
+
+os.environ["AWS_ACCESS_KEY_ID"] = "@7Apples@" #[!code callout]
+os.environ["AWS_SECRET_ACCESS_KEY"] = "07Apples@"
+
+storage_options = {
+  "key": "07Apples@",
+  "secret": "O7Apples@",
+  "client_kwargs": {
+    "endpoint_url": "http://minio.minio-system.svc.cluster.local:80" #[!code callout]
+  }
+}
+
+ds = load_dataset(
+      'parquet',
+      data_files='s3://datasets/ultrachat_200k/data/train_sft-*.parquet', #[!code callout]
+      storage_options=storage_options, #[!code callout]
+      split="train"
+)
+```
+
+<Callouts>
+  1. Set environment variables (as a backup, some underlying components will use them).
+  2. Define storage configuration; you must explicitly specify the endpoint_url to connect to MinIO.
+  3. If the dataset is split, this is equivalent to `split="train_sft"` in the example.
+</Callouts>
+
+### Clone Models and Datasets in JupyterLab
+
+In the JupyterLab terminal, use `git clone` to download the model repository (and dataset, if applicable) to your workspace. The data-free compressor notebook does not require a dataset.
+
+If you are in a restricted network environment, you can use the Hugging Face mirror for accelerated access:
+```bash
+export HF_ENDPOINT=https://hf-mirror.com
+```
+You can use the `hf` command-line tool to download models and datasets directly. For example, to download the TinyLlama model:
+```bash
+hf download TinyLlama/TinyLlama-1.1B-Chat-v1.0 --local-dir TinyLlama-1.1B-Chat-v1.0
+```
+For calibration datasets, download them similarly:
+```bash
+hf download --repo-type dataset HuggingFaceH4/ultrachat_200k --local-dir ultrachat_200k
+```
+
+### Create and Run Compression Notebooks
+
+Download the appropriate example notebook for your use case: the **calibration compressor notebook** if you are using a dataset, or the **data-free compressor notebook** otherwise. Click the upward arrow button on the JupyterLab page to upload the downloaded notebook file.
+
+### Upload the Compressed Model to the Repository
+
+Once compression is complete, upload the compressed model back to the model repository. See [Upload Models Using Notebook](../../model_inference/model_management/how_to/upload_models_using_notebook.mdx) for detailed steps on uploading model files to the model repository.
+
+```python
+model_dir = "./" + model_id.split("/")[-1] + "-W4A16" #[!code callout]
+model.save_pretrained(model_dir)
+tokenizer.save_pretrained(model_dir);
+```
+
+<Callouts>
+  1. Save model and tokenizer. **You can modify this line if you want to change the name of output**.
+</Callouts>
+
+### Deploy and Use the Compressed Model for Inference
+
+Quantized and sparse models that you create with LLM Compressor are saved using the `compressed-tensors` library (an extension of [Safetensors](https://huggingface.co/docs/safetensors/en/index)).
+The compression format matches the model's quantization or sparsity type. These formats are natively supported in vLLM, enabling fast inference through optimized deployment kernels by using Alauda AI Inference Server.
+Follow the instructions in [create inference service](../../model_inference/inference_service/functions/inference_service.mdx#create-inference-service) to complete this step.
diff --git a/docs/en/llm-compressor/how_to/index.mdx b/docs/en/llm-compressor/how_to/index.mdx
@@ -0,0 +1,7 @@
+---
+weight: 60
+---
+
+# How To
+
+<Overview />
diff --git a/docs/en/llm-compressor/index.mdx b/docs/en/llm-compressor/index.mdx
@@ -0,0 +1,7 @@
+---
+weight: 82
+---
+
+# LLM Compressor
+
+<Overview />
diff --git a/docs/en/llm-compressor/intro.mdx b/docs/en/llm-compressor/intro.mdx
@@ -0,0 +1,35 @@
+---
+weight: 10
+---
+
+# Introduction
+
+## Preface
+
+[LLM Compressor](https://github.com/vllm-project/llm-compressor), part of [the vLLM project](https://docs.vllm.ai/en/latest/) for efficient serving of LLMs, integrates the latest model compression research into a single open-source library enabling the generation of efficient, compressed models with minimal effort.
+
+The framework allows users to apply some of the most recent research on model compression techniques to improve generative AI (gen AI) models' efficiency, scalability and performance while maintaining accuracy. With native support for Hugging Face and vLLM, the compressed models can be integrated into deployment pipelines, delivering faster and more cost-effective inference at scale.
+
+LLM Compressor allows you to perform model optimization techniques such as quantization, sparsity, and compression to reduce memory use, model size, and improve inference without affecting the accuracy of model responses. The following compression methodologies are supported by LLM Compressor:
+
+- **Quantization**: Converts model weights and activations to lower-bit formats such as int8, reducing memory usage.
+- **Sparsity**: Sets a portion of model weights to zero, often in fixed patterns, allowing for more efficient computation.
+- **Compression**: Shrinks the saved model file size, ideally with minimal impact on performance.
+
+Use these methods together to deploy models more efficiently on resource-limited hardware.
+
+## LLM Compressor supports a wide variety of compression techniques:
+
+- Weight-only quantization (W4A16) compresses model weights to 4-bit precision, valuable for AI applications with limited hardware resources or high sensitivity to latency.
+- Weight and activation quantization (W8A8) compresses both weights and activations to 8-bit precision, targeting general server scenarios for integer and floating-point formats.
+
+## LLM Compressor supports several compression algorithms:
+
+- AWQ: Weight only `INT4` quantization
+- GPTQ: Weight-only `INT4` quantization
+- FP8: Dynamic per-token quantization
+- SparseGPT: Post-training sparsity
+- SmoothQuant: Activation quantization
+
+For more information about compression algorithms and formats, please refer to the [documentation](https://docs.vllm.ai/projects/llm-compressor/en/latest/guides/compression_schemes/) and examples in the [llmcompressor](https://github.com/vllm-project/llm-compressor) repository.
+Each of these compression methods computes optimal scales and zero-points for weights and activations. Optimized scales can be per tensor, channel, group, or token. The final result is a compressed model saved with all its applied quantization parameters.
diff --git a/docs/en/workbench/how_to/create_workbench.mdx b/docs/en/workbench/how_to/create_workbench.mdx
@@ -35,10 +35,75 @@ After creating a workbench instance, click `Workbench` in the left navigation ba
 
 :::info
 We have built-in WorkspaceKind resources that are ready to use out of the box; you can see the two options we provide in the dropdown menu.
-- [jupyter-lab](https://hub.docker.com/r/alaudadockerhub/base-notebook)
-- [codeserver](https://hub.docker.com/r/alaudadockerhub/base-codeserver)
+- [jupyter-lab](https://hub.docker.com/r/alaudadockerhub/odh-workbench-jupyter-datascience-cpu-py312-ubi9)
+- [codeserver](https://hub.docker.com/r/alaudadockerhub/odh-workbench-codeserver-datascience-cpu-py312-ubi9)
 :::
 
+The following additional workbench images are available but are **not built into the platform by default**:
+
+- alaudadockerhub/odh-workbench-jupyter-tensorflow-cuda-py312-ubi9
+- alaudadockerhub/odh-workbench-jupyter-pytorch-llmcompressor-cuda-py312-ubi9
+- alaudadockerhub/odh-workbench-jupyter-pytorch-cuda-py312-ubi9
+
+If you want to use these images, you must first **manually synchronize them to your own image registry** (for example, by using a tool such as `skopeo`). After the image is available in your registry, you also need to **add the corresponding configuration to the `imageConfig` field of the WorkspaceKind resource** that you plan to use.
+
+Below is an example patch YAML that adds a new image configuration to an existing WorkspaceKind:
+
+```json title=add-llmcompressor-image-patch.json
+[
+  {
+    "op": "add",
+    "path": "/spec/podTemplate/options/imageConfig/values/-",
+    "value": {
+      "id": "jupyter-pytorch-llmcompressor-cuda-py312",
+      "spawner": {
+        "displayName": "Jupyter | PyTorch LLM Compressor | CUDA | Python 3.12",
+        "description": "JupyterLab with PyTorch and LLM Compressor for CUDA",
+        "labels": [
+          {
+            "key": "python_version",
+            "value": "3.12"
+          },
+          {
+            "key": "framework",
+            "value": "pytorch"
+          },
+          {
+            "key": "accelerator",
+            "value": "cuda"
+          }
+        ]
+      },
+      "spec": {
+        "image": "mlops/workbench-images/odh-workbench-jupyter-pytorch-llmcompressor-cuda-py312-ubi9:3.4_ea1-v1.41",
+        "imagePullPolicy": "IfNotPresent",
+        "ports": [
+          {
+            "id": "jupyterlab",
+            "displayName": "JupyterLab",
+            "port": 8888,
+            "protocol": "HTTP"
+          }
+        ]
+      }
+    }
+  }
+]
+```
+
+You can apply the patch to the WorkspaceKind you are using with a command similar to the following:
+
+```bash
+kubectl patch workspacekind jupyterlab-internal-3-4-ea1-v1-41 \
+  --type=json \
+  --patch-file add-llmcompressor-image-patch.json \
+  -o yaml
+```
+
+This command applies the JSON patch file to the specified WorkspaceKind and updates its `imageConfig` so the new workbench image becomes available in the workbench creation UI.
+
+In practice, you can adapt the `name`, `image`, and `description` fields according to the image you synchronized and the naming conventions used in your cluster.
+
 :::info
 We have also built in some resource options, which you can see in the dropdown menu.
 :::
-Original file line number
+Diff line change
@@ Expand Up / @@ -4,3 +4,4 @@ knative @@
     kserve
     xinference
     servicemeshv1
+    ipynb