Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 18 additions & 12 deletions docs/en/installation/kubeflow.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Deploy Kubeflow plugins in Alauda AI >= 2.0. Including:

- kfbase: Kubeflow Base components, including authentication and authorization, central dashboard, notebook, pvc-viewer, tensorboards, volumes, model registry ui, kserve endpoints ui, model catalog API service, etc.
- chart-kubeflow-model-registry: Kubeflow Model Registry instance (Helm Chart)
- model-registry-operator: Kubeflow Model Registry Operator
- kfp: Kubeflow Pipeline
- kftraining: Kubeflow Training Operator (deprecated)
- kubeflow-trainer: Kubeflow Training job management plugin, aka. Kubeflow Trainer v2 (replaces kftraining)
Expand Down Expand Up @@ -79,7 +79,7 @@ violet push --platform-address="https://192.168.171.123" \
```

- kfbase: Kubeflow Base functionality
- chart-kubeflow-model-registry: Kubeflow Model Registry
- model-registry-operator: Kubeflow Model Registry Operator
- kfp: Kubeflow Pipeline functionality
- kftraining: Kubeflow Training Operator (deprecated)
- kubeflow-trainer: Kubeflow Training job management plugin (replaces kftraining)
Expand Down Expand Up @@ -195,22 +195,26 @@ As above, in **Cluster Plugins**, find kfp (Kubeflow Pipeline) and kftrainer (Ku
**Note: After Kubeflow Pipeline deployment, Pipeline related functions can be used in the Kubeflow interface.**
**Note: Kubeflow Training Operator is a background task scheduler and will not appear in the UI menu and functions.**

### 5. Deploy chart-kubeflow-model-registry (Kubeflow Model Registry)
### 5. Deploy Kubeflow Model Registry

In **Catalog** or **Administrator** - **Marketplace** - **Chart Repositories**, find chart-kubeflow-model-registry, click the "Create" button, fill in the deployment name, project, namespace (example deployment location), Chart Version, then copy the `values.yaml` configuration information from the right to the left, modify the following content according to the cluster information:
In **Administrator** - **MarketPlace** - **OperatorHub**, find Model Registry Operator, click the "Install" button to complete the deployment of the operator.

> **Note: Must install in a namespace that has already been bound to a Kubeflow user Profile, otherwise the Model Registry UI will not be displayed**
After the operator is installed, you need to create a `ModelRegistry` instance in the user's namespace, switch to **All Instances** tab, click "Create" button to create the instance.

- global.registry.address: The image registry address used by the current platform
- mysqlStorageClass: The mysql storage class used by Model Registry. Needs to be a storage class supported by the target deployment cluster.
- mysqlStorageSize: The mysql storage size used by Model Registry.
- mysqlDataBase: Database name (will be created automatically).
- modelRegistryDisplayName: The name of the Model Registry instance to be deployed
- modelRegistryDescription: Brief description of the Model Registry instance to be deployed
> **Note: Must create the instance in a namespace that has already been bound to a Kubeflow user Profile, otherwise the Model Registry UI will not be displayed**

When creating the instance, configure the following parameters as needed:

- **Name**: Configure the name of the Model Registry instance.
- **Namespace**: Configure the namespace where the Model Registry instance is located, it must be a namespace that has been bound to a Kubeflow user Profile.
- **MySQL Storage Class**: Configure the MySQL storage class, which is used to store the metadata of the Model Registry. Fill in according to the available storage classes in your cluster, for example, `standard`.
- **MySQL Storage Size**: Configure the MySQL storage size, which is used to store the metadata of the Model Registry. The default value is `10Gi`, you can adjust it according to your needs.
- **DisplayName**: The name of the Model Registry instance to be displayed.
- **Description**: Brief description of the Model Registry instance.

**Note: After the Model Registry instance starts, refresh the Model Registry menu in the left navigation of the Kubeflow page to see the instance deployed in the above steps. Before deploying the first instance, the Kubeflow Model Registry interface will display empty.**

**Note: The Model Registry instance will restrict network requests from non-current namespaces. If you need to allow more namespaces to access, you need to manually modify `kubectl -n <your-namespace> edit authorizationpolicy model-registry-service` and according to the istio documentation, add the namespaces that are allowed to access.**
**Note: The Model Registry instance will restrict network requests from non-current namespaces. If you need to allow more namespaces to access, you need to manually modify `kubectl -n <your-namespace> edit authorizationpolicy <model-registry-name>` and according to the istio documentation, add the namespaces that are allowed to access.**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and view the context around line 217
git ls-files docs/en/installation/kubeflow.mdx

Repository: alauda/aml-docs

Length of output: 92


🏁 Script executed:

# View the specific line and surrounding context
sed -n '210,225p' docs/en/installation/kubeflow.mdx

Repository: alauda/aml-docs

Length of output: 1554


🌐 Web query:

Kubeflow model-registry-operator AuthorizationPolicy naming convention

💡 Result:

In Kubeflow (including when accessing Model Registry through Istio), the Profile controller / KFAM uses these Istio AuthorizationPolicy naming conventions:

  • Profile owner policy: ns-owner-access-istio (kubeflow.org)
  • Contributor policy (edit/view): user-<SAFE_USER_EMAIL>-clusterrole-<USER_ROLE> (kubeflow.org)
    • <SAFE_USER_EMAIL> = user email lowercased, with special characters replaced by - (kubeflow.org)
    • <USER_ROLE> = edit or view (kubeflow.org)

Citations:


🌐 Web query:

Kubeflow model-registry ModelRegistry CRD AuthorizationPolicy resource relationship

💡 Result:

  • They’re different layers and don’t “point to” each other.

    • ModelRegistry is a Kubernetes Custom Resource (CR) (backed by a ModelRegistry CRD in operator-style installs) that a controller reconciles into the runtime resources (Deployment/Service/etc.) for a Model Registry instance. (github.com)
    • AuthorizationPolicy is an Istio security resource that controls which authenticated/identified callers can reach workloads (typically at the namespace/workload level). In Kubeflow, Profiles create per-namespace Istio AuthorizationPolicy objects for the owner and contributors. (kubeflow.org)
  • Where the “relationship” shows up in practice (Kubeflow multi-user):

    1. Kubeflow recommends installing Model Registry into a user Profile namespace (not the shared kubeflow namespace) so it fits Kubeflow’s multi-user isolation model. (kubeflow.org)
    2. A Profile namespace automatically gets Istio AuthorizationPolicy objects (e.g., ns-owner-access-istio, and per-contributor policies). These policies commonly apply namespace-wide (no workload selector), so they effectively govern access to all services in that namespace, including the Model Registry API/UI you deployed there. (kubeflow.org)
  • So: the ModelRegistry CR/CRD creates the Model Registry workloads; the Profile-created Istio AuthorizationPolicy restricts network access to those workloads—without any explicit ownerReference or direct linkage between the two resource kinds. (github.com)

Citations:


Clarify the AuthorizationPolicy identification step to prevent command failures.

The documentation assumes <model-registry-name> is the AuthorizationPolicy resource name, but Kubeflow's Profile controller generates AuthorizationPolicy with different naming conventions (e.g., ns-owner-access-istio). Users following this instruction verbatim will encounter a "resource not found" error. Add a step to identify the correct AuthorizationPolicy before editing:

kubectl -n <your-namespace> get authorizationpolicy

Then use the actual resource name in the edit command.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/installation/kubeflow.mdx` at line 217, The docs assume
`<model-registry-name>` is known but Kubeflow may create AuthorizationPolicy
resources with different names (e.g., `ns-owner-access-istio`), causing
"resource not found"; update the instruction to first list AuthorizationPolicy
resources in the target namespace using `kubectl -n <your-namespace> get
authorizationpolicy` to discover the actual resource name and then run the edit
command using that discovered AuthorizationPolicy resource name (i.e., replace
`<model-registry-name>` with the name returned by the get command).


**Note: You can install multiple Model Registry instances in different namespaces, each instance is independent of each other.**

Expand All @@ -223,6 +227,8 @@ before deploying kubeflow-trainer, if you have already deployed kftraining.
> Note: make sure to install LWS (Alauda Build of LeaderWorkerSet) plugin before deploying
kubeflow-trainer, as LWS is a dependency of kubeflow-trainer.

> Note: Kubeflow Trainer v2 requires minimum Kubernetes version `1.32.3`, older Kubernetes versions may cause unexpected issues.

In **Cluster Plugins**, find kubeflow-trainer (Kubeflow Trainer v2),
click the "Install" button, select the options of whether to enable `JobSet`
and click the "Install" button to complete the deployment.
Expand Down
121 changes: 121 additions & 0 deletions docs/en/kubeflow/how_to/model-registry.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
---
weight: 40
---

# Use Kubeflow Model Registry

The Kubeflow Model Registry is a central repository for managing machine learning models, their versions, and associated metadata. It allows data scientists to publish models, track their lineage, and collaborate on model development.

## Access the Model Registry

1. **Open Dashboard**: Log in to the Kubeflow central dashboard.
2. **Model Registry**: Click on **Model Registry** in the sidebar. This will take you to the list of registered models in your namespace.
> **Note**: If you do not see the Model Registry, ensure a Model Registry instance has been deployed in your namespace by the platform administrator.

## Register a Model

You can register models either through the user interface or programmatically using the Python client.

### Option 1: Using the UI

1. **Create Registered Model**:
- Click **Register Model**.
- **Model Name**: Enter a unique name (e.g., `fraud-detection`).
- **Description**: Add a description.
- **Version details**: Optionally add version information, tags, and metadata.
- **Model Location**: Provide the S3/URI to the model artifact (e.g., `s3://my-bucket/models/fraud-detection/v1/`).
- Click **Create**.

2. **Create Version**:
- Click on drop down menu next to the **Registered Model** and select **Register New Version**.
- Enter version name, description, metadata and artifact URI.
- Click **Register new version**.

### Option 2: Using Python Client

You can register models directly from your Jupyter Notebook using the `model-registry` Python client.

**Prerequisites**:
- Install the client: `python -m pip install model-registry=="0.3.5" kserve=="0.13"`
- Ensure you have access to the Model Registry service. If running inside a Kubeflow Notebook, you can use the internal service DNS (e.g. `http://model-registry-service.<namespace>.svc:8080`).

**Sample Code**:

The following example demonstrates how to register a model stored in S3.

```python
from model_registry import ModelRegistry

# 1. Connect to the Model Registry
# Replace with your actual Model Registry service host/port
# Inside cluster, typically: "http://model-registry-service.<namespace>.svc.cluster.local:8080"
registry = ModelRegistry(
server_address="http://model-registry-service.kubeflow.svc.cluster.local",
port=8080,
author="your name",
is_secure=False
)

# 2. Register a new Model
rm = registry.register_model(
"iris",
"s3://kfserving-examples/models/sklearn/1.0/model",
model_format_name="sklearn",
model_format_version="1",
version="v1",
description="Iris scikit-learn model",
metadata={
"accuracy": 3.14,
"license": "BSD 3-Clause License",
}
)

# 3. Retrieve model information
model = registry.get_registered_model("iris")
print("Registered Model:", model, "with ID", model.id)

version = registry.get_model_version("iris", "v1")
print("Model Version:", version, "with ID", version.id)

art = registry.get_model_artifact("iris", "v1")
print("Model Artifact:", art, "with ID", art.id)

```

## Deploy a Registered Model

Once a model is registered, you can deploy it as an **InferenceService** using KServe.

To deploy, you typically need the URI of the model artifact. You can retrieve this from the Registry UI or via the Python API:

```python
from kubernetes import client
import kserve

isvc = kserve.V1beta1InferenceService(
api_version=kserve.constants.KSERVE_GROUP + "/v1beta1",
kind=kserve.constants.KSERVE_KIND,
metadata=client.V1ObjectMeta(
name="iris-model",
namespace=kserve.utils.get_default_target_namespace(),
labels={
"modelregistry/registered-model-id": model.id,
"modelregistry/model-version-id": version.id,
},
),
spec=kserve.V1beta1InferenceServiceSpec(
predictor=kserve.V1beta1PredictorSpec(
model=kserve.V1beta1ModelSpec(
storage_uri=art.uri,
model_format=kserve.V1beta1ModelFormat(
name=art.model_format_name, version=art.model_format_version
),
)
)
),
)
ks_client = kserve.KServeClient()
ks_client.create(isvc)
```

Once deployed, the KServe controller will pull the model from the specified S3 URI and start the inference server.
159 changes: 159 additions & 0 deletions docs/en/kubeflow/how_to/notebooks.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
---
weight: 10
---

# Use Kubeflow Notebooks

Kubeflow Notebooks provide a Kubernetes-native Jupyter environment for data scientists to develop, train, and deploy machine learning models. Each notebook server runs as a separate Pod in your namespace, ensuring isolation and dedicated resources.

> **NOTE**: We recommend using Alauda AI Workbench for a more integrated experience with additional features like resource types, configurations, and better integration with other components. However, you can also use the native Kubeflow Notebooks if you prefer a more lightweight setup or need specific features from the upstream project.

## Concepts

- **Notebook Server**: A JupyterLab instance running in a container.
- **Custom Image**: You can use standard pre-built images (e.g., containing TensorFlow, PyTorch) or provide your own custom Docker image with specific libraries.
- **Persistent Storage**: By default, notebook servers are attached to Persistent Volume Claims (PVCs) to store your workspace directory (usually `/home/jovyan`). This ensures your notebooks and data are saved even if the server is restarted or updated.

## Create a Notebook Server

1. **Access the Dashboard**:
Navigate to the **Notebooks** section in the Kubeflow dashboard.

2. **New Notebook**:
Click **New Notebook**. Make sure to select the correct namespace on top of the dashboard where you want to create the notebook server.

3. **Configure the Server**:
- **Name**: Enter a unique name for your notebook server.
- **Image**:
- **Select Type**: Choose the type of image including JupyterLab, Visual Studio Code, or RStudio.
- **Select Image**: Choose from a list of pre-built images or specify a custom image by providing the Docker image URL.
- **CPU / RAM**: Allocate CPU and Memory resources based on your workload. Start small (e.g., 1 CPU, 2GB RAM) and increase if needed.
- **GPUs**: Request GPUs (e.g., NVIDIA) if you plan to run deep learning training or inference tasks that require acceleration.
- **Workspace Volume**: This volume mounts to your home directory (`/home/jovyan`). Create a new volume (default) or attach an existing one to access previous work.
- **Data Volumes**: (Optional) Attach additional existing PVCs to access large datasets without copying them to your workspace.
- **Configurations**: (Optional) Select PodDefaults (if available) to inject generic configurations like S3 credentials, Git config, or environment variables.

4. **Launch**:
Click **Launch**. The server will be provisioned. Wait for the status display to turn **Running** (green).

## Connect to the Notebook

Once the server status is **Running**:
1. Click **Connect**.
2. This opens the **JupyterLab/VS Code/RStudio** interface in a new browser tab.
3. You can now create Python 3 notebooks, open a terminal, or manage files.

## Environment Management

### Installing Python Packages

While you can install packages in your home directory to persist them, it is best practice to use a custom image for reproducibility.

Create an "venv" directory in your home and install packages there:

```bash
python -m venv ~/venv
source ~/venv/bin/activate
python -m pip install transformers datasets
```

When you start a new terminal session, remember to activate the virtual environment to access the installed packages.

To use the virtual environment in Jupyter notebooks, you can install `ipykernel` and create a new kernel:

```bash
source ~/venv/bin/activate
python -m pip install ipykernel
python -m ipykernel install --user --name=venv --display-name "Python (venv)"
```

Then, in your Jupyter notebook, you can select the "Python (venv)" kernel to use the packages installed in your virtual environment.

Virtual environments are persisted in your home directory, so they will remain available even if you stop and restart the notebook server. However, if you need to share the environment across multiple notebook servers or want better reproducibility, consider building a custom Docker image with the required packages pre-installed.

### Using Custom Images

For production environments or complex dependencies (e.g., system libraries), build a Docker image containing all required libraries and use it as your **Custom Image** when creating the notebook. This ensures exact reproducibility.

## Manage Configurations (PodDefaults)

Kubeflow uses `PodDefault` resources (often labeled as **Configurations** in the UI) to inject common configurations—such as environment variables, volumes, and volume mounts—into Notebooks. This is the standard way to securely provide credentials for Object Storage (S3, MinIO) without hardcoding them in your notebooks.

### Create a PodDefault

You can create a PodDefault by applying a YAML manifest.

Define a `PodDefault` that selects pods with a specific label.


```yaml
apiVersion: kubeflow.org/v1alpha1
kind: PodDefault
metadata:
name: add-gcp-secret
namespace: MY_PROFILE_NAMESPACE
spec:
selector:
matchLabels:
add-gcp-secret: "true"
desc: "add gcp credential"
volumeMounts:
- name: secret-volume
mountPath: /secret/gcp
volumes:
- name: secret-volume
secret:
secretName: gcp-secret
```

### Apply Configuration

When creating a new Notebook Server:
1. Scroll to the **Configurations** section.
2. You will see a list of available PodDefaults (e.g., `s3-access`).
3. Check the box to apply it.

This will automatically inject the specified environment variables or volumes into your Notebook container.

## Accessing Data

### Using Mounted Volumes
If you attached a data volume (PVC) during creation, it will be available at the specified mount point.

```python
import pandas as pd

# Assuming you mounted a data volume at /home/jovyan/data
df = pd.read_csv('/home/jovyan/data/dataset.csv')
print(df.head())
```

### Using Object Storage (S3 / MinIO)
To access data in S3-compatible storage, use libraries like `boto3` or `s3fs`. If your administrator has configured PodDefaults for credentials, environment variables (like `AWS_ACCESS_KEY_ID`) will be pre-populated.

```python
import os
import s3fs
import pandas as pd

# Check if credentials are injected
print(os.getenv("AWS_S3_ENDPOINT"))

# Read directly from S3
fs = s3fs.S3FileSystem(
client_kwargs={'endpoint_url': os.getenv('AWS_S3_ENDPOINT')},
key=os.getenv('AWS_ACCESS_KEY_ID'),
secret=os.getenv('AWS_SECRET_ACCESS_KEY')
)

with fs.open('s3://my-bucket/data/train.csv') as f:
df = pd.read_csv(f)
```

## Best Practices

- **Stop Unused Servers**: Notebook servers consume cluster resources (especially GPUs) even when idle. Stop them when you are not actively working.
- **Git Integration**: Use the Git extension in JupyterLab (or the terminal) to version control your notebooks. Avoid storing large datasets in Git.
- **Resource Monitoring**: Monitor your resource usage. If your kernel crashes frequently (OOM), you may need to stop the server and restart it with more Memory limits.
- **Clean Up**: Periodically delete old notebook servers and their associated PVCs if the data is no longer needed.

Loading