Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions content/en/docs/plugins.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,4 +173,50 @@ The Numa-Aware Plugin aims to address these limitations.

Common scenarios for NUMA-Aware are computation-intensive jobs that are sensitive to CPU parameters, scheduling delays. Such as scientific calculation, video decoding, animation rendering, big data offline processing and other specific scenes.

### Usage

#### Overview
The Usage-based scheduling plugin evaluates actual real-time resource utilization (e.g., CPU, Memory) collected from monitoring systems like Prometheus instead of only depending on requested resources. It prevents new pods from being scheduled onto overloaded nodes and actively balances the cluster workload.

#### Scenario
Useful in clusters experiencing unbalanced node resource consumption where some nodes are overburdened while others remain idle despite having similar requested resources.

### Rescheduling

#### Overview
The Rescheduling plugin periodically rebalances the cluster by evaluating real resource utilization. It actively evicts pods from heavily utilized nodes and shuffles them to under-utilized nodes based on configured target thresholds and strategies like LowNodeUtilization or OfflineOnly.

#### Scenario
Perfect for long-running clusters where dynamic workload lifecycles lead to fragmentation and resource imbalances over time.

### ResourceQuota

#### Overview
The ResourceQuota plugin interfaces with Kubernetes' native `ResourceQuota` objects to ensure that a PodGroup is only enqueued if there is sufficient resource capacity in its namespace.

#### Scenario
Highly beneficial in multi-tenant environments to prevent jobs from entering the scheduling pipeline and clogging the queue when they have no chance of running due to namespace quota restrictions.

### Pod Disruption Budget (PDB)

#### Overview
The PDB Plugin ensures that Volcano respects user-defined Kubernetes PodDisruptionBudget (PDB) constraints during any eviction-based scheduling actions, such as `reclaim`, `preempt`, and `shuffle`.

#### Scenario
Crucial for highly available workloads where simultaneous eviction of multiple replicas could result in service disruption.

### Overcommit

#### Overview
The Overcommit Plugin allows the scheduler to artificially inflate the apparent "idle resources" of the cluster by a configurable factor (e.g., 1.2), permitting more jobs to enqueue in the scheduling pipeline than the physical capacity.

#### Scenario
Useful when administrators want the scheduler to tolerate a larger backlog of `pending` pods waiting for resources without rejecting them outright during peak loads.

### DeviceShare

#### Overview
The DeviceShare Plugin provides a unified framework for sharing specialized hardware devices such as GPUs, NPUs, and FPGAs across multiple pods.

#### Scenario
Ideal for advanced AI/ML environments needing granular hardware sharing, like vGPU, vNPU, and GPU exclusive deployments.
53 changes: 53 additions & 0 deletions content/en/docs/user_guide_how_to_use_deviceshare_plugin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
+++
title = "DeviceShare Plugin"

date = 2026-05-11
lastmod = 2026-05-11

draft = false
toc = true
type = "docs"

linktitle = "DeviceShare"
[menu.docs]
parent = "user-guide"
weight = 4
+++

## Introduction

The **DeviceShare Plugin** is an advanced resource scheduling plugin in Volcano that provides a common framework for sharing specialized hardware devices (like GPUs, NPUs, FPGAs) across multiple pods.

Rather than implementing fragmented logic for each new hardware accelerator, Volcano exposes a unified `Devices` interface. The `deviceshare` plugin leverages this interface to perform robust allocation, node filtering, and resource tracking for shared devices.

## Mechanism

The DeviceShare plugin works in conjunction with device-specific implementations. It exposes standard scheduling operations such as `Predicate` (filtering nodes based on available device capacity) and `Allocate`/`Release` (assigning portions of a device to specific pods).

Currently, the `deviceshare` plugin serves as the underlying engine powering features like:
- **GPU Sharing**: Allowing multiple pods to request fractions of a single physical GPU's memory.
- **vGPU and vNPU**: Virtualizing accelerator slices.
- **GPU Exclusive**: Restricting a pod to exclusively own a GPU to avoid contention.

## Configuration and Usage

The `deviceshare` plugin is typically enabled implicitly when you enable device sharing predicates in the Volcano scheduler config map. However, if you are developing custom device sharing logic or need to explicitly declare it, it can be configured in your `volcano-scheduler-configmap`:

```yaml
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- name: deviceshare # Enable the device share framework plugin
- plugins:
- name: overcommit
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
```

> **Note:** For specific guides on how to configure your workloads to request shared GPUs or NPUs, please refer to the dedicated guides for [GPU Sharing](../user_guide_how_to_use_gpu_sharing) and [vNPU](../user_guide_how_to_use_vnpu).
71 changes: 71 additions & 0 deletions content/en/docs/user_guide_how_to_use_hcclrank_plugin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
+++
title = "HCCLRank Plugin"

date = 2026-05-11
lastmod = 2026-05-11

draft = false
toc = true
type = "docs"

linktitle = "HCCLRank"
[menu.docs]
parent = "user-guide"
weight = 4
+++

## Introduction

In distributed AI training, particularly when using Ascend NPUs (Neural Processing Units) or MindSpore frameworks, the compute nodes need a deterministic rank or index to communicate over HCCL (Huawei Collective Communication Library).

The **HCCLRank Plugin** is a Volcano Job plugin that automatically injects a `hccl/rankIndex` annotation into the Pods of a Volcano Job. It calculates a unique rank for each pod based on its task type (`master` or `worker`) and its replica index.

## Mechanism

During the Pod creation phase (`OnPodCreate`), the HCCLRank Plugin intercepts the pod and adds the `hccl/rankIndex` annotation to it.

The calculation is as follows:
- **Master Role**: Rank = Pod Index
- **Worker Role**: Rank = (Total Master Replicas) + Pod Index

If the Pod already has a `RANK` environment variable defined in its container specifications, the plugin will use that value instead and simply map it to the `hccl/rankIndex` annotation.

## Configuration

To enable the HCCLRank plugin, configure it within the Volcano job controller's configuration or add it to the `plugins` field of your `VolcanoJob` spec.

```yaml
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: ascend-distributed-training
spec:
minAvailable: 3
schedulerName: volcano
plugins:
hcclrank:
- --master=master
- --worker=worker
tasks:
- replicas: 1
name: master
template:
spec:
containers:
- name: master
image: my-ascend-image
- replicas: 2
name: worker
template:
spec:
containers:
- name: worker
image: my-ascend-image
```

### Arguments

The HCCLRank plugin supports overriding the default task names used to identify master and worker roles:

- **`--master`**: The name of the master role task in your Job spec. Default is `master`.
- **`--worker`**: The name of the worker role task in your Job spec. Default is `worker`.
56 changes: 56 additions & 0 deletions content/en/docs/user_guide_how_to_use_overcommit_plugin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
+++
title = "Overcommit Plugin"

date = 2026-05-11
lastmod = 2026-05-11

draft = false
toc = true
type = "docs"

linktitle = "Overcommit"
[menu.docs]
parent = "user-guide"
weight = 4
+++

## Introduction

In typical cluster environments, the scheduler calculates available idle resources strictly based on physical node capacity minus allocated resources. However, when cluster resources are nearly fully utilized, many PodGroups are rejected from entering the scheduling pipeline and are left completely un-enqueued, which might not be desirable for scenarios where you want the scheduler to tolerate a larger backlog of `pending` pods.

The **Overcommit Plugin** allows the scheduler to artificially inflate the apparent "idle resources" of the cluster by applying an `overcommit-factor`. This permits more jobs to be enqueued and wait in the scheduling pipeline than the physical resources might typically allow.

## Mechanism

The Overcommit plugin evaluates whether a job can be enqueued based on the requested `MinResources` of the PodGroup and the expanded idle resources.

Expanded idle resource is calculated as:
`Idle Resource = (Total Resource * overcommit-factor) - Used Resource`

If the job's minimal requested resources can fit into this expanded idle resource pool, the job is permitted to be enqueued.

## Configuration

To use the Overcommit Plugin, add it to your `volcano-scheduler-configmap` under the `enqueue` tier, and provide an `overcommit-factor`.

```yaml
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: overcommit # Enable the overcommit plugin
arguments:
overcommit-factor: 1.2 # The overcommit factor. Default is 1.2
- name: priority
- name: gang
- name: conformance
- plugins:
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
```

### Arguments

- **`overcommit-factor`**: A float value greater than or equal to `1.0`. For example, `1.2` means the scheduler will pretend the cluster has 20% more total resources when deciding whether to enqueue jobs into the pipeline. If a value less than `1.0` is provided, the plugin will automatically fallback to the default value of `1.2`.
55 changes: 55 additions & 0 deletions content/en/docs/user_guide_how_to_use_pdb_plugin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
+++
title = "Pod Disruption Budget (PDB) Plugin"

date = 2026-05-11
lastmod = 2026-05-11

draft = false
toc = true
type = "docs"

linktitle = "Pod Disruption Budget"
[menu.docs]
parent = "user-guide"
weight = 4
+++

## Introduction

When users deploy highly available jobs or applications on Volcano, they often need to limit the number of pod replicas that can be evicted or destroyed simultaneously to avoid downtime. This constraint is managed via Kubernetes **PodDisruptionBudget (PDB)** resources.

The **PDB Plugin** ensures that Volcano respects user-defined PDB constraints during the scheduling process, specifically during eviction actions like `reclaim`, `preempt`, and `shuffle`.

## Prerequisites

- Your Kubernetes version must be 1.21 or later.
- You must have created valid `PodDisruptionBudget` resources for your workloads.

## Mechanism

The PDB Plugin registers several functions (`ReclaimableFn`, `PreemptableFn`, and `VictimTasksFn`) under the `reclaim`, `preempt`, and `shuffle` actions. It maintains a cache of PDBs using `v1.PodDisruptionBudgetLister`.

During eviction scenarios, the plugin filters out tasks whose eviction would violate the configured PDB constraints. It calculates the `DisruptedPods` (pods whose eviction was processed but not yet observed by the PDB controller) and ensures the remaining available replicas satisfy the budget.

## Configuration

To enable the PDB Plugin, update the `volcano-scheduler-configmap` to include the `pdb` plugin in your configuration tiers.

```yaml
actions: "reclaim, preempt, shuffle"
tiers:
- plugins:
- name: pdb # Enable the PDB plugin
- name: priority
- name: gang
- name: conformance
- plugins:
- name: overcommit
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
```

*Note: The PDB plugin will be actively invoked when actions like `reclaim`, `preempt`, or `shuffle` are executed in the scheduler workflow.*
84 changes: 84 additions & 0 deletions content/en/docs/user_guide_how_to_use_rescheduling_plugin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
+++
title = "Rescheduling Plugin"

date = 2026-05-11
lastmod = 2026-05-11

draft = false
toc = true
type = "docs"

linktitle = "Rescheduling"
[menu.docs]
parent = "user-guide"
weight = 4
+++

## Introduction

Unbalanced resource utilization across a Kubernetes cluster often occurs due to unreasonable scheduling strategies, dynamic changes in job lifecycles, and node status changes (such as added/removed nodes or taint/affinity modifications).

The **Rescheduling** plugin addresses these issues by actively rebalancing the cluster's resource utilization among nodes. It accomplishes this by evaluating real resource utilization (via Prometheus metrics) instead of merely the requested resource amounts, and it periodically evicts pods based on custom configured rescheduling strategies.

## Rescheduling Workflow

1. **Resource Filter**: Filters workloads which are eligible to be evicted based on queues or labels.
2. **Strategy Evaluation**: Evaluates filtered workloads against the configured rescheduling strategies to determine which ones should be evicted.
3. **Eviction**: Evicts the pods attached to the identified workloads.
4. **Periodical Execution**: Executes the above process periodically.

## Rescheduling Strategies

Volcano's rescheduling plugin supports multiple strategies to select potential evictees:

- **LowNodeUtilization**: Targets unbalanced nodes by evicting pods from highly utilized nodes and shuffling them to low utilized nodes based on configured target thresholds.
- **OfflineOnly (OLO)**: Only selects offline workloads (annotated with `preemptable: true`) for rescheduling.
- **LowPriorityFirst (LPF)**: Sorts workloads by priority and evicts lower priority pods first.
- **ShortLifeTimeFirst (SLTF)**: Sorts workloads by running time. Pods with the shortest life time will be rescheduled first to ensure long-running workloads are not interrupted.
- **BigObjectFirst (BOF)**: Selects workloads which request the most dominant resource and reschedules them first to improve system throughput and avoid small workloads starvation.
- **MoreReplicasFirst (MRF)**: Sorts workloads by replica number. Workloads with the most replicas are rescheduled first, making it friendly to `gang` scheduling by considering `minAvailable`.

## Configuration

To enable the Rescheduling plugin, you must configure the `volcano-scheduler-configmap` by adding the `shuffle` action and configuring the `rescheduling` plugin within the tiers.

```yaml
actions: "enqueue, allocate, backfill, shuffle" ## Add 'shuffle' action
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- name: rescheduling ## Rescheduling plugin
arguments:
interval: 5m ## Optional. Frequency at which the strategies are called. Default is 5m.
metricsPeriod: 5m ## Optional. The duration of metrics to consider. Default is 5m.
strategies: ## Required. Strategies to execute in order.
- name: offlineOnly
- name: lowPriorityFirst
- name: lowNodeUtilization
params:
thresholds:
"cpu" : 20 ## Threshold below which a node is considered under-utilized
"memory": 20
"pods": 20
targetThresholds:
"cpu" : 50 ## Target utilization to reach for balance
"memory": 50
"pods": 50
queueSelector: ## Optional. Select workloads in specified queues as potential evictees. All queues by default.
- default
- test-queue
labelSelector: ## Optional. Select workloads with specified labels as potential evictees. All labels by default.
business: offline
team: test
- plugins:
- name: overcommit
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
```

> **Note:** The rescheduling decisions consider metrics collected from Prometheus. Ensure your metrics configuration is correctly set up as it evaluates real node resource utilization instead of requested resource amounts.
Loading