Managing Workflows and Advanced Topics

This guide covers day-to-day workflows, pod management, file operations, and important best practices when working with the EPFL RCP cluster.

Using VS Code
Managing Pods
Important Notes and Workflow
The HaaS Machine
File Management

Using VS Code

To easily attach a VS Code window to a running pod:

Install required extensions
- Kubernetes
- Dev Containers
Connect to your pod
- In VS Code, navigate to: Kubernetes → rcp-cluster → Workloads → Pods
- Right-click on your pod and select Attach Visual Studio Code
- This will start a VS Code session attached to your pod
Navigate to your workspace
- The symlinks ensure that settings and extensions are stored in /mloscratch/homes/<gaspar_username> and shared across pods
- Note: VS Code opens the home folder of the pod (not scratch!) by default
- Navigate to /mloscratch/homes/<your_username> to access your working directory

Pictorial guide: See the official RCP documentation for screenshots and additional details.

Managing Pods

After starting pods with csub.py, you can manage them using run:ai commands:

Basic Commands

# Open an interactive shell on the pod
runai exec <pod_name> -it -- zsh

# Kill the job and remove it from the list
runai delete job <pod_name>

# Show information on the status/execution of the job
runai describe job <pod_name>

# List all jobs and their status
runai list jobs

# Show the output/logs for the job
runai logs <pod_name>

Useful One-Liners

# Clean up succeeded jobs from run:ai
runai list | grep " Succeeded " | awk '{print $1}' | parallel runai delete job {}

# Overview of active jobs that fits on your screen
runai list jobs | sed '1d' | awk '{printf "%-42s %-20s\n", $1, $2}'

# Auto-updating listing of jobs and their states (refreshes every 10 seconds)
watch -n 10 "runai list | sed 1d | awk '{printf \"%0-40s %0-20s\n\", \$1, \$2}'"

Important Notes and Workflow

Job Types

Interactive jobs (default)
- Used for development and debugging
- Each user can have 1 interactive GPU
- Higher priority than training jobs
- Can live up to 12 hours
- Created with: python csub.py -n sandbox
Training jobs (use --train flag)
- Used for actual experiments
- Allows more than 1 GPU (up to 8 on one node)
- Pod is automatically killed when your code finishes (saves money!)
- Created with: python csub.py -n experiment --train --command "..."

GPU Selection

When choosing GPU types on the RCP cluster, consider both cost and memory/compute requirements:

GPU Type	Memory	Best For	Cost
V100	40GB	Older hardware, not memory-intensive	Lower cost, faster to schedule
A100-40GB	40GB	Most use cases (default)	Moderate
A100-80GB	80GB	Standard memory-intensive workloads	Moderate
H100	80GB	High compute requirements	Higher cost
H200	140GB	Very memory-intensive workloads	Highest cost

Usage:

With csub.py: Use --node_type flag (e.g., --node_type default for A100)
With CLI directly: Use --node-pools flag

Options: v100, h100, h200, default (A100), a100-40g

Tip

If you plan to run a series of jobs, especially with high-end GPUs like H100/H200, inform your supervisor in advance.

Best Practices

Important

Work within /mloscratch

This is the shared storage mounted to your pod
Create a directory: /mloscratch/homes/<your_username> (automatically created by csub.py)
All your files should be kept inside your personal folder
Use a GitHub repo to store your code and clone it inside your folder

Important

Remember: your job can get killed anytime

run:ai may preempt your job to make space for other users
Always implement checkpointing and recovery in your scripts
Check job status regularly with runai list
Failed jobs must be deleted before restarting: runai delete job <name>

Tip

Recommended workflow

CPU-only pod for development: Create a cheap (~3 CHF/month) CPU-only pod for code development and debugging through VS Code
GPU pods for experiments: When your code is ready, create GPU pods to run experiments
Use training jobs: Always use --train flag for experiments to automatically kill pods when finished

Caution

Using the cluster creates costs

Do not forget to stop your jobs when not in use!
Use training jobs to automatically kill pods when experiments finish
Monitor your active jobs regularly

The HaaS Machine

The HaaS machine is provided by IT and allows you to:

Move files between storage systems
Create folders
Copy files between mlodata1, mloraw1, and mloscratch
Access storage without creating a pod

Accessing the HaaS Machine

ssh <gaspar_username>@haas001.rcp.epfl.ch

Use your Gaspar password to log in.

Storage Locations

The volumes are mounted at:

/mnt/mlo/mlodata1 - Long-term replicated storage
/mnt/mlo/mloraw1 - Reserved for future use
/mnt/mlo/scratch - High-performance working storage

File Management

Understanding Storage

Important

Cluster pods are ephemeral

Any file created inside a pod (outside mounted storage) will be deleted when the pod is killed
Always store your work on mounted network disks

Storage Types

`mloscratch` (Primary Working Storage)

Purpose: All code and experimentation
Location: /mloscratch/homes/<your_username>
Characteristics:
- High-performance storage
- Mounted to all pods
- Not replicated across multiple hard drives (but generally reliable)
- All your daily work should be here

`mlodata1` (Long-term Archive)

Purpose: Long-term storage with replication
Characteristics:
- Backed up carefully with replication
- Stored on multiple hard drives
- For artifacts you want to keep indefinitely (e.g., paper results, final checkpoints)
- Cannot be mounted on pods (use HaaS machine to access)

`mloraw1` (Reserved)

Status: Not currently in active use (status: December 2023)
Cannot be mounted on pods (use HaaS machine to access)

Moving Data Between Storage

Since mloscratch is not replicated, move important artifacts to mlodata1 for permanent storage.

To move files between mlodata1 and scratch:

SSH into the HaaS machine:

ssh <gaspar_username>@haas001.rcp.epfl.ch

Copy files using cp or rsync:

# Example: Copy from scratch to mlodata1
rsync -avP /mnt/mlo/scratch/homes/<username>/results /mnt/mlo/mlodata1/<username>/

# Example: Copy from mlodata1 to scratch
rsync -avP /mnt/mlo/mlodata1/<username>/dataset /mnt/mlo/scratch/homes/<username>/

Note

TODO: This section will be updated with permanent machine information for MLO once available.

Additional Resources

FAQ: See faq.md for common questions and troubleshooting
csub.py reference: See main README.md for detailed argument documentation
Architecture details: See README.md for deep dive into how the setup works
Multi-node training: See multinode.md for distributed training documentation
Run:ai CLI: See runai_cli.md for alternative workflows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managing Workflows and Advanced Topics

Table of Contents

Using VS Code

Managing Pods

Basic Commands

Useful One-Liners

Important Notes and Workflow

Job Types

GPU Selection

Best Practices

The HaaS Machine

Accessing the HaaS Machine

Storage Locations

File Management

Understanding Storage

Storage Types

`mloscratch` (Primary Working Storage)

`mlodata1` (Long-term Archive)

`mloraw1` (Reserved)

Moving Data Between Storage

Additional Resources

FilesExpand file tree

managing_workflows.md

Latest commit

History

managing_workflows.md

File metadata and controls

Managing Workflows and Advanced Topics

Table of Contents

Using VS Code

Managing Pods

Basic Commands

Useful One-Liners

Important Notes and Workflow

Job Types

GPU Selection

Best Practices

The HaaS Machine

Accessing the HaaS Machine

Storage Locations

File Management

Understanding Storage

Storage Types

mloscratch (Primary Working Storage)

mlodata1 (Long-term Archive)

mloraw1 (Reserved)

Moving Data Between Storage

Additional Resources

`mloscratch` (Primary Working Storage)

`mlodata1` (Long-term Archive)

`mloraw1` (Reserved)