-
Notifications
You must be signed in to change notification settings - Fork 555
feat: support Shallow Clones for Filesystems #3938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Corwin Joy <[email protected]>
…ent name. Signed-off-by: Corwin Joy <[email protected]>
|
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
|
This is a basic implementation of the shallow clone feature for delta-rs. While coding this, I ran into a couple of limitations that I could use feedback on.
|
|
Summary via copilot Pull Request OverviewThis PR adds a Key changes:
Changed Files
|
| Ok(()) | ||
| } | ||
|
|
||
| #[cfg(all(test, feature = "datafusion"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up requiring datafusion for the test here so that I could verify the data in the clone matched the data in the original at the same version. Not sure if this is a problem.
|
|
||
| log_store | ||
| .write_commit_entry(commit_version, commit_bytes.clone(), operation_id) | ||
| .await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do the file adds as a second commit. I think this is in line with how tables are usually done in delta-rs. That is, version 0 for the metadata. Then version 1 where the files are added.
This is mostly due to URL handling in datafusion. It's been something to resolve for a long time. |
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
Signed-off-by: Corwin Joy <[email protected]>
|
Not quite sure why this got marked as draft since I believe it is ready to review? Also, a couple of notes:
Anyway, looking forward to feedback on this! |
|
It's unlikely we are going to merge something that works only for local file system paths. This needs more work to work for all various object store impls, thus the draft designation. That having been said how else would shallow clones work without the full path to the object? Their entire premise is to clone only the metadata. |
So what this logic does is it:
This essentially follows the original Scala logic as per below. But, in the Java version they have support for absolute paths. Here is the logic from Scala:
So, my argument here is that the code is the same as it would be in the final version with the exception of using absolute paths (which are not yet supported). So, this lets one test and develop the logic using symbolic links. Then, once absolute paths are supported in delta-rs, the ADD logic can be changed to set absolute paths in the clone. In the meantime, this gives a useful shallow clone operation for those of us using file system back ends. |
| for view in file_views { | ||
| let mut add = view.add_action(); | ||
| add.data_change = true; | ||
| // Absolute paths are not supported for now, create symlinks instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hntd187 Here is where the final logic would change (slightly). Instead of a call to a function to create a symlink, we set add.path = absolute_path_to_original_file. This absolute path would also need to handle object stores but the rest of the code should stay the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially, the symbolic links are acting as proxies for full file paths until delta-rs is able to support full file paths (or fully qualified URIs).
|
@hntd187 @rtyler |
Description
Add support for creation/management of shallow clones (feature since 2.3) via delta-rs with python bindings.
Related Issue(s)
Closes issue #2456
Documentation
Delta Lake Clone
https://delta.io/blog/delta-lake-clone/
Use Case
Shallow clones are very valuable when wanting to test new features in ephemeral environments against production data, without huge memory usage or disruption to production systems. Being able to use a one-liner to effectively create an isolated test environment is especially valuable where users are granted read-only access to the table, but can use this feature to cheaply create their own writable branch of the data for testing new features.