diff --git a/docs/en/installation/kubeflow.mdx b/docs/en/installation/kubeflow.mdx index e44bc43..3130179 100644 --- a/docs/en/installation/kubeflow.mdx +++ b/docs/en/installation/kubeflow.mdx @@ -3,7 +3,7 @@ Deploy Kubeflow plugins in Alauda AI >= 2.0. Including: - kfbase: Kubeflow Base components, including authentication and authorization, central dashboard, notebook, pvc-viewer, tensorboards, volumes, model registry ui, kserve endpoints ui, model catalog API service, etc. -- chart-kubeflow-model-registry: Kubeflow Model Registry instance (Helm Chart) +- model-registry-operator: Kubeflow Model Registry Operator - kfp: Kubeflow Pipeline - kftraining: Kubeflow Training Operator (deprecated) - kubeflow-trainer: Kubeflow Training job management plugin, aka. Kubeflow Trainer v2 (replaces kftraining) @@ -79,7 +79,7 @@ violet push --platform-address="https://192.168.171.123" \ ``` - kfbase: Kubeflow Base functionality -- chart-kubeflow-model-registry: Kubeflow Model Registry +- model-registry-operator: Kubeflow Model Registry Operator - kfp: Kubeflow Pipeline functionality - kftraining: Kubeflow Training Operator (deprecated) - kubeflow-trainer: Kubeflow Training job management plugin (replaces kftraining) @@ -195,22 +195,26 @@ As above, in **Cluster Plugins**, find kfp (Kubeflow Pipeline) and kftrainer (Ku **Note: After Kubeflow Pipeline deployment, Pipeline related functions can be used in the Kubeflow interface.** **Note: Kubeflow Training Operator is a background task scheduler and will not appear in the UI menu and functions.** -### 5. Deploy chart-kubeflow-model-registry (Kubeflow Model Registry) +### 5. Deploy Kubeflow Model Registry -In **Catalog** or **Administrator** - **Marketplace** - **Chart Repositories**, find chart-kubeflow-model-registry, click the "Create" button, fill in the deployment name, project, namespace (example deployment location), Chart Version, then copy the `values.yaml` configuration information from the right to the left, modify the following content according to the cluster information: +In **Administrator** - **MarketPlace** - **OperatorHub**, find Model Registry Operator, click the "Install" button to complete the deployment of the operator. -> **Note: Must install in a namespace that has already been bound to a Kubeflow user Profile, otherwise the Model Registry UI will not be displayed** +After the operator is installed, you need to create a `ModelRegistry` instance in the user's namespace, switch to **All Instances** tab, click "Create" button to create the instance. -- global.registry.address: The image registry address used by the current platform -- mysqlStorageClass: The mysql storage class used by Model Registry. Needs to be a storage class supported by the target deployment cluster. -- mysqlStorageSize: The mysql storage size used by Model Registry. -- mysqlDataBase: Database name (will be created automatically). -- modelRegistryDisplayName: The name of the Model Registry instance to be deployed -- modelRegistryDescription: Brief description of the Model Registry instance to be deployed +> **Note: Must create the instance in a namespace that has already been bound to a Kubeflow user Profile, otherwise the Model Registry UI will not be displayed** + +When creating the instance, configure the following parameters as needed: + +- **Name**: Configure the name of the Model Registry instance. +- **Namespace**: Configure the namespace where the Model Registry instance is located, it must be a namespace that has been bound to a Kubeflow user Profile. +- **MySQL Storage Class**: Configure the MySQL storage class, which is used to store the metadata of the Model Registry. Fill in according to the available storage classes in your cluster, for example, `standard`. +- **MySQL Storage Size**: Configure the MySQL storage size, which is used to store the metadata of the Model Registry. The default value is `10Gi`, you can adjust it according to your needs. +- **DisplayName**: The name of the Model Registry instance to be displayed. +- **Description**: Brief description of the Model Registry instance. **Note: After the Model Registry instance starts, refresh the Model Registry menu in the left navigation of the Kubeflow page to see the instance deployed in the above steps. Before deploying the first instance, the Kubeflow Model Registry interface will display empty.** -**Note: The Model Registry instance will restrict network requests from non-current namespaces. If you need to allow more namespaces to access, you need to manually modify `kubectl -n edit authorizationpolicy model-registry-service` and according to the istio documentation, add the namespaces that are allowed to access.** +**Note: The Model Registry instance will restrict network requests from non-current namespaces. If you need to allow more namespaces to access, you need to manually modify `kubectl -n edit authorizationpolicy ` and according to the istio documentation, add the namespaces that are allowed to access.** **Note: You can install multiple Model Registry instances in different namespaces, each instance is independent of each other.** @@ -223,6 +227,8 @@ before deploying kubeflow-trainer, if you have already deployed kftraining. > Note: make sure to install LWS (Alauda Build of LeaderWorkerSet) plugin before deploying kubeflow-trainer, as LWS is a dependency of kubeflow-trainer. +> Note: Kubeflow Trainer v2 requires minimum Kubernetes version `1.32.3`, older Kubernetes versions may cause unexpected issues. + In **Cluster Plugins**, find kubeflow-trainer (Kubeflow Trainer v2), click the "Install" button, select the options of whether to enable `JobSet` and click the "Install" button to complete the deployment. diff --git a/docs/en/kubeflow/how_to/model-registry.mdx b/docs/en/kubeflow/how_to/model-registry.mdx new file mode 100644 index 0000000..14f96f6 --- /dev/null +++ b/docs/en/kubeflow/how_to/model-registry.mdx @@ -0,0 +1,121 @@ +--- +weight: 40 +--- + +# Use Kubeflow Model Registry + +The Kubeflow Model Registry is a central repository for managing machine learning models, their versions, and associated metadata. It allows data scientists to publish models, track their lineage, and collaborate on model development. + +## Access the Model Registry + +1. **Open Dashboard**: Log in to the Kubeflow central dashboard. +2. **Model Registry**: Click on **Model Registry** in the sidebar. This will take you to the list of registered models in your namespace. + > **Note**: If you do not see the Model Registry, ensure a Model Registry instance has been deployed in your namespace by the platform administrator. + +## Register a Model + +You can register models either through the user interface or programmatically using the Python client. + +### Option 1: Using the UI + +1. **Create Registered Model**: + - Click **Register Model**. + - **Model Name**: Enter a unique name (e.g., `fraud-detection`). + - **Description**: Add a description. + - **Version details**: Optionally add version information, tags, and metadata. + - **Model Location**: Provide the S3/URI to the model artifact (e.g., `s3://my-bucket/models/fraud-detection/v1/`). + - Click **Create**. + +2. **Create Version**: + - Click on drop down menu next to the **Registered Model** and select **Register New Version**. + - Enter version name, description, metadata and artifact URI. + - Click **Register new version**. + +### Option 2: Using Python Client + +You can register models directly from your Jupyter Notebook using the `model-registry` Python client. + +**Prerequisites**: +- Install the client: `python -m pip install model-registry=="0.3.5" kserve=="0.13"` +- Ensure you have access to the Model Registry service. If running inside a Kubeflow Notebook, you can use the internal service DNS (e.g. `http://model-registry-service..svc:8080`). + +**Sample Code**: + +The following example demonstrates how to register a model stored in S3. + +```python +from model_registry import ModelRegistry + +# 1. Connect to the Model Registry +# Replace with your actual Model Registry service host/port +# Inside cluster, typically: "http://model-registry-service..svc.cluster.local:8080" +registry = ModelRegistry( + server_address="http://model-registry-service.kubeflow.svc.cluster.local", + port=8080, + author="your name", + is_secure=False +) + +# 2. Register a new Model +rm = registry.register_model( + "iris", + "s3://kfserving-examples/models/sklearn/1.0/model", + model_format_name="sklearn", + model_format_version="1", + version="v1", + description="Iris scikit-learn model", + metadata={ + "accuracy": 3.14, + "license": "BSD 3-Clause License", + } +) + +# 3. Retrieve model information +model = registry.get_registered_model("iris") +print("Registered Model:", model, "with ID", model.id) + +version = registry.get_model_version("iris", "v1") +print("Model Version:", version, "with ID", version.id) + +art = registry.get_model_artifact("iris", "v1") +print("Model Artifact:", art, "with ID", art.id) + +``` + +## Deploy a Registered Model + +Once a model is registered, you can deploy it as an **InferenceService** using KServe. + +To deploy, you typically need the URI of the model artifact. You can retrieve this from the Registry UI or via the Python API: + +```python +from kubernetes import client +import kserve + +isvc = kserve.V1beta1InferenceService( + api_version=kserve.constants.KSERVE_GROUP + "/v1beta1", + kind=kserve.constants.KSERVE_KIND, + metadata=client.V1ObjectMeta( + name="iris-model", + namespace=kserve.utils.get_default_target_namespace(), + labels={ + "modelregistry/registered-model-id": model.id, + "modelregistry/model-version-id": version.id, + }, + ), + spec=kserve.V1beta1InferenceServiceSpec( + predictor=kserve.V1beta1PredictorSpec( + model=kserve.V1beta1ModelSpec( + storage_uri=art.uri, + model_format=kserve.V1beta1ModelFormat( + name=art.model_format_name, version=art.model_format_version + ), + ) + ) + ), +) +ks_client = kserve.KServeClient() +ks_client.create(isvc) +``` + +Once deployed, the KServe controller will pull the model from the specified S3 URI and start the inference server. diff --git a/docs/en/kubeflow/how_to/notebooks.mdx b/docs/en/kubeflow/how_to/notebooks.mdx new file mode 100644 index 0000000..5c69fd0 --- /dev/null +++ b/docs/en/kubeflow/how_to/notebooks.mdx @@ -0,0 +1,159 @@ +--- +weight: 10 +--- + +# Use Kubeflow Notebooks + +Kubeflow Notebooks provide a Kubernetes-native Jupyter environment for data scientists to develop, train, and deploy machine learning models. Each notebook server runs as a separate Pod in your namespace, ensuring isolation and dedicated resources. + +> **NOTE**: We recommend using Alauda AI Workbench for a more integrated experience with additional features like resource types, configurations, and better integration with other components. However, you can also use the native Kubeflow Notebooks if you prefer a more lightweight setup or need specific features from the upstream project. + +## Concepts + +- **Notebook Server**: A JupyterLab instance running in a container. +- **Custom Image**: You can use standard pre-built images (e.g., containing TensorFlow, PyTorch) or provide your own custom Docker image with specific libraries. +- **Persistent Storage**: By default, notebook servers are attached to Persistent Volume Claims (PVCs) to store your workspace directory (usually `/home/jovyan`). This ensures your notebooks and data are saved even if the server is restarted or updated. + +## Create a Notebook Server + +1. **Access the Dashboard**: + Navigate to the **Notebooks** section in the Kubeflow dashboard. + +2. **New Notebook**: + Click **New Notebook**. Make sure to select the correct namespace on top of the dashboard where you want to create the notebook server. + +3. **Configure the Server**: + - **Name**: Enter a unique name for your notebook server. + - **Image**: + - **Select Type**: Choose the type of image including JupyterLab, Visual Studio Code, or RStudio. + - **Select Image**: Choose from a list of pre-built images or specify a custom image by providing the Docker image URL. + - **CPU / RAM**: Allocate CPU and Memory resources based on your workload. Start small (e.g., 1 CPU, 2GB RAM) and increase if needed. + - **GPUs**: Request GPUs (e.g., NVIDIA) if you plan to run deep learning training or inference tasks that require acceleration. + - **Workspace Volume**: This volume mounts to your home directory (`/home/jovyan`). Create a new volume (default) or attach an existing one to access previous work. + - **Data Volumes**: (Optional) Attach additional existing PVCs to access large datasets without copying them to your workspace. + - **Configurations**: (Optional) Select PodDefaults (if available) to inject generic configurations like S3 credentials, Git config, or environment variables. + +4. **Launch**: + Click **Launch**. The server will be provisioned. Wait for the status display to turn **Running** (green). + +## Connect to the Notebook + +Once the server status is **Running**: +1. Click **Connect**. +2. This opens the **JupyterLab/VS Code/RStudio** interface in a new browser tab. +3. You can now create Python 3 notebooks, open a terminal, or manage files. + +## Environment Management + +### Installing Python Packages + +While you can install packages in your home directory to persist them, it is best practice to use a custom image for reproducibility. + +Create an "venv" directory in your home and install packages there: + +```bash +python -m venv ~/venv +source ~/venv/bin/activate +python -m pip install transformers datasets +``` + +When you start a new terminal session, remember to activate the virtual environment to access the installed packages. + +To use the virtual environment in Jupyter notebooks, you can install `ipykernel` and create a new kernel: + +```bash +source ~/venv/bin/activate +python -m pip install ipykernel +python -m ipykernel install --user --name=venv --display-name "Python (venv)" +``` + +Then, in your Jupyter notebook, you can select the "Python (venv)" kernel to use the packages installed in your virtual environment. + +Virtual environments are persisted in your home directory, so they will remain available even if you stop and restart the notebook server. However, if you need to share the environment across multiple notebook servers or want better reproducibility, consider building a custom Docker image with the required packages pre-installed. + +### Using Custom Images + +For production environments or complex dependencies (e.g., system libraries), build a Docker image containing all required libraries and use it as your **Custom Image** when creating the notebook. This ensures exact reproducibility. + +## Manage Configurations (PodDefaults) + +Kubeflow uses `PodDefault` resources (often labeled as **Configurations** in the UI) to inject common configurations—such as environment variables, volumes, and volume mounts—into Notebooks. This is the standard way to securely provide credentials for Object Storage (S3, MinIO) without hardcoding them in your notebooks. + +### Create a PodDefault + +You can create a PodDefault by applying a YAML manifest. + +Define a `PodDefault` that selects pods with a specific label. + + +```yaml +apiVersion: kubeflow.org/v1alpha1 +kind: PodDefault +metadata: + name: add-gcp-secret + namespace: MY_PROFILE_NAMESPACE +spec: + selector: + matchLabels: + add-gcp-secret: "true" + desc: "add gcp credential" + volumeMounts: + - name: secret-volume + mountPath: /secret/gcp + volumes: + - name: secret-volume + secret: + secretName: gcp-secret +``` + +### Apply Configuration + +When creating a new Notebook Server: +1. Scroll to the **Configurations** section. +2. You will see a list of available PodDefaults (e.g., `s3-access`). +3. Check the box to apply it. + +This will automatically inject the specified environment variables or volumes into your Notebook container. + +## Accessing Data + +### Using Mounted Volumes +If you attached a data volume (PVC) during creation, it will be available at the specified mount point. + +```python +import pandas as pd + +# Assuming you mounted a data volume at /home/jovyan/data +df = pd.read_csv('/home/jovyan/data/dataset.csv') +print(df.head()) +``` + +### Using Object Storage (S3 / MinIO) +To access data in S3-compatible storage, use libraries like `boto3` or `s3fs`. If your administrator has configured PodDefaults for credentials, environment variables (like `AWS_ACCESS_KEY_ID`) will be pre-populated. + +```python +import os +import s3fs +import pandas as pd + +# Check if credentials are injected +print(os.getenv("AWS_S3_ENDPOINT")) + +# Read directly from S3 +fs = s3fs.S3FileSystem( + client_kwargs={'endpoint_url': os.getenv('AWS_S3_ENDPOINT')}, + key=os.getenv('AWS_ACCESS_KEY_ID'), + secret=os.getenv('AWS_SECRET_ACCESS_KEY') +) + +with fs.open('s3://my-bucket/data/train.csv') as f: + df = pd.read_csv(f) +``` + +## Best Practices + +- **Stop Unused Servers**: Notebook servers consume cluster resources (especially GPUs) even when idle. Stop them when you are not actively working. +- **Git Integration**: Use the Git extension in JupyterLab (or the terminal) to version control your notebooks. Avoid storing large datasets in Git. +- **Resource Monitoring**: Monitor your resource usage. If your kernel crashes frequently (OOM), you may need to stop the server and restart it with more Memory limits. +- **Clean Up**: Periodically delete old notebook servers and their associated PVCs if the data is no longer needed. + diff --git a/docs/en/kubeflow/how_to/pipelines.mdx b/docs/en/kubeflow/how_to/pipelines.mdx new file mode 100644 index 0000000..a8fa395 --- /dev/null +++ b/docs/en/kubeflow/how_to/pipelines.mdx @@ -0,0 +1,145 @@ +--- +weight: 50 +--- + +# Use Kubeflow Pipelines + +Kubeflow Pipelines (KFP) is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers. The KFP SDK allows you to define and manipulate pipelines and components using Python. + +## Prerequisites + +### Install KFP SDK + +Start a Jupyter Notebook (or Workbench) in your namespace and install the KFP SDK: + +```bash +python -m pip install kfp +``` + +### Configure KFP to Run with your Object Storage + +When you installed Kubeflow with an external S3/MinIO storage service, you need to add a "KFP Launcher" configmap to setup storage used by current namespace or user. You can checkout Kubeflow document https://www.kubeflow.org/docs/components/pipelines/operator-guides/configure-object-store/#s3-and-s3-compatible-provider for more details. If no configuation is set, the pipeline runs may still accessing the default service address like "minio-service.kubeflow:9000" which may not be correct. + +Below is a simple sample for you to start: + +```yaml +apiVersion: v1 +data: + defaultPipelineRoot: s3://mlpipeline + providers: |- + s3: + default: + endpoint: minio.minio-system.svc:80 + disableSSL: true + region: us-east-2 + forcePathStyle: true + credentials: + fromEnv: false + secretRef: + secretName: mlpipeline-minio-artifact + accessKeyKey: accesskey + secretKeyKey: secretkey +kind: ConfigMap +metadata: + name: kfp-launcher + namespace: wy-testns +``` + +For example, you should setup below values in this configmap to point to your own S3/MinIO storage + +defaultPipelineRoot: where to store the pipeline intermediate data +endpoint: s3/MinIO service endpoint. Note, should NOT start with "http" or "https" +disableSSL: whether disable "https" access to the endpoint +region: s3 region. If using MinIO, any value will be fine +credentials: AK/SK in the secrets + + +After add this configmap, the newly started Kubeflow Pipeline Runs will automatically read this configration, and save stuff that is used by Kubeflow Pipeline. + +## Quick Start Example + +A pipeline is a description of an ML workflow, including all of the components in the workflow and how they combine in the form of a graph. + +Below is a simple example of defining a pipeline that prints "Hello, World!" using the KFP SDK. + +```python +from kfp import dsl +from kfp import compiler +from kfp.client import Client + +@dsl.component +def say_hello(name: str) -> str: + hello_text = f'Hello, {name}!' + print(hello_text) + return hello_text + +@dsl.pipeline +def hello_pipeline(recipient: str) -> str: + hello_task = say_hello(name=recipient) + return hello_task.output + + +# Compile the pipeline to a YAML file +compiler.Compiler().compile(hello_pipeline, 'pipeline.yaml') + +# Create a KFP client and submit the pipeline run +client = Client(host='') +run = client.create_run_from_pipeline_package( + 'pipeline.yaml', + arguments={ + 'recipient': 'World', + }, +) +``` + +For more details about how to define and run pipelines, please refer to the official KFP documentation: https://www.kubeflow.org/docs/components/pipelines/user-guides/ + + +## Manage Pipelines in the UI + +You can also manage pipelines, experiments, and runs directly from the Kubeflow Dashboard. + +### Access the Pipelines Dashboard + +1. Log in to the Kubeflow central dashboard. +2. Click **Pipelines** in the sidebar menu. + +### Upload a Pipeline + +If you have compiled your pipeline to a YAML file (e.g., `pipeline.yaml` from the example above), you can upload it: + +1. Click **Pipelines** -> **Upload Pipeline**. +2. **Upload a file**: Select your `pipeline.yaml`. +3. **Pipeline Name**: Give it a name (e.g., `Hello World Pipeline`). +4. Click **Create**. + +### Create a Run + +To execute the pipeline you just uploaded: + +1. Click on the pipeline name to open its details. +2. Click **Create Run**. +3. **Run Name**: Enter a descriptive name. +4. **Experiment**: Select an existing experiment or create a new one. Experiments help group related runs. +5. **Run Parameters**: Enter values for any pipeline arguments (e.g., `recipient`: `World`). +6. Click **Start**. + +### Inspect Run Details + +Once the run starts, you will be redirected to the **Run Details** page. + +- **Graph**: Visualize the steps (components) of your pipeline and their status (Running, Succeeded, Failed). +- **Logs**: Click on a specific step in the graph to view its container logs in the side panel. This is crucial for debugging. +- **Inputs/Outputs**: View the artifacts passed between steps or produced as final outputs. +- **Visualizations**: If your pipeline generates metrics or plots, they will appear in the **Run Output** or **Visualizations** tab. + +### Recurring Runs + +You can schedule pipelines to run automatically at specific intervals: + +1. In the **Pipelines** list, identify your pipeline. +2. Click **Create Run** but choose **Recurring Run** as the run type (or navigate to **Experiments (KFP)** -> **Create Recurring Run**). +3. **Trigger**: Set the schedule (e.g., Periodic, Cron). +4. **Parameters**: Configure the inputs that will be used for every scheduled execution. +5. Click **Start**. + diff --git a/docs/en/kubeflow/how_to/tensorboards.mdx b/docs/en/kubeflow/how_to/tensorboards.mdx new file mode 100644 index 0000000..76d56a2 --- /dev/null +++ b/docs/en/kubeflow/how_to/tensorboards.mdx @@ -0,0 +1,92 @@ +--- +weight: 20 +--- + +# Use Kubeflow Tensorboards + +TensorFlow's visualization toolkit, TensorBoard, is a powerful dashboard for visualizing machine learning experiments. It allows you to track metrics like loss and accuracy, visualize the model graph, view histograms of weights and biases, and much more. + +Kubeflow provides a native way to spawn TensorBoard instances directly within your Kubernetes cluster, pointing them to existing logs stored on Persistent Volume Claims (PVCs) or Object Storage (S3, MinIO). + +## Prerequisites + +Before creating a TensorBoard instance, ensure that your training jobs are writing logs to a location accessible by the cluster. + +- **PVC**: If your training job writes logs to a Persistent Volume, note the PVC name and the path within it. +- **Object Storage**: If your training job writes logs to S3/MinIO, ensure you have the necessary credentials (often configured via PodDefaults) and the bucket URI (e.g., `s3://my-bucket/logs/experiment-1`). + +## Generating Logs with PyTorch + +To visualize your training metrics, your PyTorch code must write events to a log directory. The `SummaryWriter` class is the main entry point for logging data for consumption by TensorBoard. + +```python +import torch +import torchvision +from torch.utils.tensorboard import SummaryWriter +from torchvision import datasets, transforms + +# Writer will output to ./runs/ directory by default +writer = SummaryWriter() + +transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]) +trainset = datasets.MNIST('mnist_train', train=True, download=True, transform=transform) +trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True) +model = torchvision.models.resnet50(False) +# Have ResNet model take in grayscale rather than RGB +model.conv1 = torch.nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False) +images, labels = next(iter(trainloader)) + +grid = torchvision.utils.make_grid(images) +writer.add_image('images', grid, 0) +writer.add_graph(model, images) +writer.close() +``` + +## Create a TensorBoard Instance + +1. **Access the Kubeflow Dashboard**: + Navigate to the **TensorBoards** section in the Kubeflow central dashboard. + +2. **New TensorBoard**: + Click the **New TensorBoard** button. + +3. **Configure the Instance**: + - **Name**: Enter a unique name for your TensorBoard instance (e.g., `experiment-1-viz`). + - **PVC Source**: + - Check this box if your logs are on a PVC. + - **PVC Name**: Select the PVC from the dropdown. + - **Mount Path**: Specify the path inside the PVC where logs are stored (e.g., `/logs/run1`). + - **Object Storage Source**: + - Check this box if your logs are in cloud storage. + - **Object Store Link**: Provide the full URI to the log directory (e.g., `s3://my-bucket/my-model/logs/`). + - **Configuration**: Select a configuration (PodDefault) if your bucket requires credentials. + +4. **Create**: + Click **Create**. The TensorBoard instance will be provisioned as a Pod in your namespace. + +## Accessing the Dashboard + +Once the status of your TensorBoard instance changes to **Running**: + +1. Click **Connect** next to the instance name. +2. The TensorBoard UI will open in a new tab. +3. You can now explore the scalars, graphs, distributions, and other visualizations generated by your training run. + +## Usage Scenarios + +### Visualizing Training Metrics +Use the **Scalars** tab to view plots of accuracy, loss, and learning rate over time. This helps diagnose if your model is overfitting or if the learning rate needs adjustment. + +### Comparing Runs +If you point TensorBoard to a parent directory containing subdirectories for multiple runs (e.g., `run1`, `run2`), TensorBoard will automatically overlay the metrics from these runs, allowing you to compare performance across different hyperparameters. + +### Debugging Model Architecture +Use the **Graphs** tab to visualize the computational graph of your model. This ensures that the model is built as expected and helps identify structural issues. + +## Cleanup + +TensorBoard instances consume cluster resources (CPU/Memory). When you are finished analyzing your experiments: + +1. Go back to the **TensorBoards** list. +2. Click the **Delete** (trash icon) button next to your instance. +3. Confirm deletions. This removes the visualization server but **does not** delete your training logs or models stored on the PVC or Object Storage. diff --git a/docs/en/kubeflow/how_to/volumes-kserve.mdx b/docs/en/kubeflow/how_to/volumes-kserve.mdx new file mode 100644 index 0000000..0989e3b --- /dev/null +++ b/docs/en/kubeflow/how_to/volumes-kserve.mdx @@ -0,0 +1,79 @@ +--- +weight: 30 +--- + +# Use Kubeflow Volumes + +Volumes in Kubeflow are managed as Kubernetes Persistent Volume Claims (PVCs). They provide persistent storage for your data, workspaces, and models, independent of the lifecycle of your Notebook servers or other workloads. + +## Create a Volume + +1. **Access the Dashboard**: + Click **Volumes** in the Kubeflow central dashboard sidebar. +2. **New Volume**: + Click **New Volume**. +3. **Configure**: + - **Name**: Enter a unique name for the volume. + - **Storage Class**: Select the Storage Class (e.g., topolvm, nfs) if multiple are available. + - **Size**: Specify the size of the volume in `Gi` (e.g., `10`). + - **Access Mode**: + - **ReadWriteOnce (RWO)**: Mounted by a single node (Common for block storage). + - **ReadWriteMany (RWX)**: Mounted by many nodes (Common for NFS/File storage). +4. **Create**: + Click **Create**. The volume status will change to **Bound** once provisioned. + +## Manage Volumes + +- **Open PVC Viewer**: Click the "Folder" icon next to a volume to create a temporary Pod that mounts the volume and opens a file browser. This allows you to view/upload/download files directly to the volume. Click "Close" to delete the temporary Pod when done. +- **Delete**: Click the delete icon (trash can) next to a volume to remove it. **Note**: This permanently deletes the data. +- **Filter**: Filter volumes by name, status, or storage class using the search bar. + +## Use a Volume in Notebooks + +To use a volume in a Notebook Server: +1. When creating a **New Notebook**, create a standard **Workspace Volume** (mounted at `/home/jovyan`) or... +2. Scroll to **Data Volumes** to attach additional existing volumes. +3. Click **Attach Existing Volume** and select your volume. +4. Specify the **Mount Path** (e.g., `/home/jovyan/data`). + +# Use Kubeflow KServe Endpoints + +The KServe Endpoints UI allows you to deploy, manage, and monitor inference services for your machine learning models directly from the Kubeflow dashboard. + +## Access the Endpoints UI + +1. Click **KServe Endpoints** in the central dashboard sidebar. +2. Select your namespace at the top of the page. +3. You will see a list of deployed InferenceServices with their status and URLs. + +## Deploy a New Model + +1. **New Endpoint**: + Click **New Endpoint**. +2. **InferneceService YAML**: + - Provide the YAML definition for your InferenceService. You can use the sample YAML below as a template. +3. **Deploy**: + Click **Create**. + + ```yaml + apiVersion: serving.kserve.io/v1beta1 + kind: InferenceService + metadata: + name: my-model + namespace: my-namespace + spec: + predictor: + model: + modelFormat: + name: "transformers" + runtime: aml-vllm-0.9.2-cuda-12.6 + storageUri: "hf://model-repo/model-name" + ``` + +## Monitor and Test + +After deployment, wait for the status to become **Ready**. +- **Inspect**: Click on the model name to see YAML details, logs. +- **Get URL**: Copy the provided endpoint URL (e.g., `http://model-name.namespace.svc.cluster.local/v1/models/model-name:predict` or external URL). +- **Test**: Use `curl` or a Python client to send a prediction request. + diff --git a/docs/en/kubeflow/index.mdx b/docs/en/kubeflow/index.mdx new file mode 100644 index 0000000..dae986a --- /dev/null +++ b/docs/en/kubeflow/index.mdx @@ -0,0 +1,6 @@ +--- +weight: 61 +--- +# Alauda support for Kubeflow + + diff --git a/docs/en/kubeflow/intro.mdx b/docs/en/kubeflow/intro.mdx new file mode 100644 index 0000000..57c6c98 --- /dev/null +++ b/docs/en/kubeflow/intro.mdx @@ -0,0 +1,10 @@ +--- +weight: 10 +--- +# Introduction + +Alauda support for Kubeflow provides a Kubernetes-native machine learning platform that enables users to build, deploy, and manage machine learning models at scale. It integrates with various components of the Kubeflow ecosystem, such as Kubeflow Pipelines for workflow orchestration, Kubeflow Training for training job management, and Kubeflow Model Registry for model versioning and management. + +See [Kubeflow Docs](https://www.kubeflow.org/docs/) for more details about Kubeflow components and features. + +> **NOTE: ** You need to set the namespace PSA to privileged in order to use Kubeflow components. Please contact your cluster administrator to set the namespace PSA to privileged if you encounter permission issues when using Kubeflow components.