Skip to content

Conversation

@corwinjoy
Copy link
Contributor

Description

Add support for creation/management of shallow clones (feature since 2.3) via delta-rs with python bindings.

Related Issue(s)

Closes issue #2456

Documentation

Delta Lake Clone
https://delta.io/blog/delta-lake-clone/

Use Case
Shallow clones are very valuable when wanting to test new features in ephemeral environments against production data, without huge memory usage or disruption to production systems. Being able to use a one-liner to effectively create an isolated test environment is especially valuable where users are granted read-only access to the table, but can use this feature to cheaply create their own writable branch of the data for testing new features.

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Nov 17, 2025
@github-actions
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@corwinjoy corwinjoy changed the title Support Shallow Clones for Filesystems feat: support Shallow Clones for Filesystems Nov 17, 2025
@corwinjoy
Copy link
Contributor Author

This is a basic implementation of the shallow clone feature for delta-rs. While coding this, I ran into a couple of limitations that I could use feedback on.

  1. It seems that delta-rs does not fully support absolute file paths yet. When I first tried this, I had the code add files with absolute paths. But, the tables wanted to prepend the table directory to the file paths, so this did not work. As a result, for now, I create symbolic links to obtain a usable feature. The goal would be to replace this eventually.

  2. I also tried this with deletion vectors. I have a test case for this using a table in the test directory with simple deletion vectors. However, this results in the error Error: Transaction { source: UnsupportedReaderFeatures([DeletionVectors]) }, so perhaps this feature is not yet supported? Or do I need to add something for this case?

@corwinjoy
Copy link
Contributor Author

Summary via copilot

Pull Request Overview

This PR adds a shallow_clone method to create Delta table clones that reference the same data files as the source table without copying actual data. The implementation uses symlinks to reference data files from the cloned table to the source table.

Key changes:

  • Adds shallow_clone method to the Python DeltaTable API accepting a target URI
  • Implements CloneBuilder in Rust core operations with symlink-based file sharing
  • Adds test coverage in both Python and Rust for the cloning functionality

Changed Files

Show a summary per file
File Description
python/tests/test_shallow_clone.py Adds Python integration test for shallow cloning functionality
python/src/lib.rs Adds Python binding for shallow_clone method on RawDeltaTable
python/deltalake/table.py Adds public Python API method for shallow cloning
python/deltalake/_internal.pyi Adds type stub for shallow_clone method
crates/core/src/operations/mod.rs Integrates CloneBuilder into DeltaOps API
crates/core/src/operations/clone.rs Core implementation of shallow clone operation with symlinks

@corwinjoy
Copy link
Contributor Author

@rtyler @adamreeve

Ok(())
}

#[cfg(all(test, feature = "datafusion"))]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up requiring datafusion for the test here so that I could verify the data in the clone matched the data in the original at the same version. Not sure if this is a problem.


log_store
.write_commit_entry(commit_version, commit_bytes.clone(), operation_id)
.await?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do the file adds as a second commit. I think this is in line with how tables are usually done in delta-rs. That is, version 0 for the metadata. Then version 1 where the files are added.

@hntd187
Copy link
Collaborator

hntd187 commented Nov 18, 2025

  1. It seems that delta-rs does not fully support absolute file paths yet. When I first tried this, I had the code add files with absolute paths. But, the tables wanted to prepend the table directory to the file paths, so this did not work. As a result, for now, I create symbolic links to obtain a usable feature. The goal would be to replace this eventually.

This is mostly due to URL handling in datafusion. It's been something to resolve for a long time.

@rtyler rtyler marked this pull request as draft November 18, 2025 18:21
@corwinjoy corwinjoy marked this pull request as ready for review December 2, 2025 22:24
@corwinjoy
Copy link
Contributor Author

Not quite sure why this got marked as draft since I believe it is ready to review?

Also, a couple of notes:

  1. I believe the full file paths are quite doable in delta-rs. I have a draft PR for this which supports full file paths that I plan to post soon.
  2. Even with full file paths I think a shallow clone with symbolic links may be useful. The reason is security. Fundamentally, enabling full file paths creates a potentially security risk since now delta-rs is accessing files outside of the delta table directory. So, one may want a clone with a more restricted scope where symbolic links are created with more limited file access.

Anyway, looking forward to feedback on this!

@hntd187
Copy link
Collaborator

hntd187 commented Dec 2, 2025

It's unlikely we are going to merge something that works only for local file system paths. This needs more work to work for all various object store impls, thus the draft designation.

That having been said how else would shallow clones work without the full path to the object? Their entire premise is to clone only the metadata.

@corwinjoy
Copy link
Contributor Author

It's unlikely we are going to merge something that works only for local file system paths. This needs more work to work for all various object store impls, thus the draft designation.

That having been said how else would shallow clones work without the full path to the object? Their entire premise is to clone only the metadata.

So what this logic does is it:

  1. Clones the metadata
  2. For each parquet file it creates a symbolic link that points to the original file. (So these look like sub-files but point to the original files).

This essentially follows the original Scala logic as per below. But, in the Java version they have support for absolute paths.
So, in Step 2 they are able to replace relative file paths with absolute file paths.

Here is the logic from Scala:

  1. Create a new table with metadata as of the given version.
  2. Obtain a list of active files as of that version, and for each file, perform an ADD operation in the clone. For this ADD operation use absolute file paths pointing to the ORIGINAL directory rather than relative file paths.
    See core clone code here:
    https://github.com/delta-io/delta/blob/3f262005700b89d3c81345c0f3a47d05f045e843/spark/src/main/scala/org/apache/spark/sql/delta/commands/CloneTableBase.scala#L156

So, my argument here is that the code is the same as it would be in the final version with the exception of using absolute paths (which are not yet supported). So, this lets one test and develop the logic using symbolic links. Then, once absolute paths are supported in delta-rs, the ADD logic can be changed to set absolute paths in the clone. In the meantime, this gives a useful shallow clone operation for those of us using file system back ends.

for view in file_views {
let mut add = view.add_action();
add.data_change = true;
// Absolute paths are not supported for now, create symlinks instead.
Copy link
Contributor Author

@corwinjoy corwinjoy Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hntd187 Here is where the final logic would change (slightly). Instead of a call to a function to create a symlink, we set add.path = absolute_path_to_original_file. This absolute path would also need to handle object stores but the rest of the code should stay the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, the symbolic links are acting as proxies for full file paths until delta-rs is able to support full file paths (or fully qualified URIs).

@corwinjoy
Copy link
Contributor Author

@hntd187 @rtyler
I have a related draft PR to support full file paths. Although, @rtyler mentioned that full file paths may already be in progress as part of the kernel work? (#2456 (comment))
Anyway, here is the draft PR for file paths:
https://github.com/delta-io/delta-rs/pull/3963/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/python Issues for the Python package binding/rust Issues for the Rust crate

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants