Skip to content

Conversation

@Kovbo
Copy link
Collaborator

@Kovbo Kovbo commented Nov 26, 2025

Add functionality that allows downloading a checkpoint to a local path.
Add a method that deletes a model from the serverless backend.

@Kovbo Kovbo force-pushed the download-checkpoint-branch branch from a25634a to e4dfd1a Compare November 26, 2025 23:01
@arcticfly
Copy link
Contributor

This looks pretty good so far. Could you also add support for deploying to W&B Inference? That might inform changes we need to make to the API design.

@Kovbo
Copy link
Collaborator Author

Kovbo commented Dec 2, 2025

Added a function to upload to W&B inference.

if os.path.exists(original_checkpoint_dir):
if local_path is not None:
# Copy from original location to custom location
target_checkpoint_dir = os.path.join(local_path, f"{resolved_step:04d}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer just copying the checkpoint to whatever local_path is, so that the developer can just add that path to the artifact directly.

)

# Check if checkpoint exists in original location
if os.path.exists(original_checkpoint_dir):
Copy link
Contributor

@arcticfly arcticfly Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now a lot of logic goes into a single if statement with duplicated code:

1. Check if checkpoint exists locally
  1. If it does, check if local path is provided
    1. If it is, copy from original checkpoint dir to local path and return local path
    2. If not, just return original checkpoint dir
  2. If it does not, check if local path is provided
    1. If it is, download from s3 and save to local path and return local path
    2. If not, download from s3 and return original checkpoint dir

I think there would be fewer lines of code and they would be easier to read if we broke this into two if statements:

1. Check if checkpoint exists locally
  1. If it does, continue
  2. If not, download checkpoint from s3 to original checkpoint dir (default location)
2. Check if local path is provided
  1. If it is, copy checkpoint from original checkpoint dir to local path
  2. If not, return original checkpoint dir

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it looks better! Updated.

pull_s3: bool = True,
wait_for_completion: bool = True,
) -> LoRADeploymentJob:
pull_checkpoint: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's actually not let the user pull the checkpoint as part of the deployment function. They should just use _experimental_pull_model_checkpoint for that.

Copy link
Contributor

@arcticfly arcticfly Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, could you deprecate _experimental_deploy on all the backends? We want users to just call deploy_model directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should remove the _experimental_deploy function entirely as its only purpose was to combine the _experimental_pull_model_checkpoint and deploy_model utils.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose removing altogether is probably fine, that's why we marked these functions as _experimental

Copy link
Contributor

@arcticfly arcticfly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for taking care of this!

@Kovbo Kovbo merged commit 7f9f6a0 into main Dec 3, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants