Skip to content

feat: parse Ref from version and version numbers#5584

Open
Xuanwo wants to merge 4 commits intomainfrom
xuanwo/parse-dataset-ref
Open

feat: parse Ref from version and version numbers#5584
Xuanwo wants to merge 4 commits intomainfrom
xuanwo/parse-dataset-ref

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Dec 28, 2025

This PR allows users to parse a str into Ref.

For example:

("0", Ref::VersionNumber(0)),
("42", Ref::VersionNumber(42)),
("main", Ref::Version(None, None)),
("main:", Ref::Version(None, None)),
("main:latest", Ref::Version(None, None)),
("main:10", Ref::Version(None, Some(10))),
("feature/a", Ref::Version(Some("feature/a".to_string()), None)),
("feature/a:", Ref::Version(Some("feature/a".to_string()), None)),
("feature/a:latest", Ref::Version(Some("feature/a".to_string()), None)),
("feature/a:10", Ref::Version(Some("feature/a".to_string()), Some(10))),
("tag1", Ref::Tag("tag1".to_string()))

Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

@github-actions github-actions Bot added the enhancement New feature or request label Dec 28, 2025
@github-actions
Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@github-actions
Copy link
Copy Markdown
Contributor

Code Review

P0 - Ambiguity between tags and branches without "/":

The FromStr implementation uses the presence of / to distinguish between branches and tags when there is no : separator. This creates an ambiguity: how would a user specify a branch named "feature" (no slash) vs a tag named "feature"?

Currently "tag1".parse::<Ref>() returns Tag("tag1"), but there is no way to parse a branch name without a slash (e.g., "dev" branch) unless using the colon syntax ("dev:").

Consider either:

  1. Documenting this limitation clearly
  2. Requiring explicit syntax for disambiguation (e.g., tag:name vs branch:name)
  3. Adding methods like Ref::parse_as_branch() / Ref::parse_as_tag() for explicit parsing

P1 - Inconsistency with existing From<&str> impl:

The existing From<&str> impl (line 46-49) unconditionally converts to Tag:

impl From<&str> for Ref {
    fn from(reference: &str) -> Self {
        Tag(reference.to_string())
    }
}

But FromStr has different behavior - "42".parse::<Ref>() returns VersionNumber(42) while Ref::from("42") returns Tag("42"). This inconsistency could cause subtle bugs. Consider deprecating or removing the From<&str> impl, or documenting the behavioral difference prominently.

@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 28, 2025

Codecov Report

❌ Patch coverage is 78.72340% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/refs.rs 78.26% 7 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

@Xuanwo Xuanwo changed the title feat: Add FromStr for Ref feat: Allow version and version number been parsed into Ref Dec 28, 2025
@Xuanwo Xuanwo changed the title feat: Allow version and version number been parsed into Ref feat: parse Ref from version and version numbers Dec 28, 2025
@wjones127
Copy link
Copy Markdown
Contributor

I'd +1 the automated code review concern here: how do we differentiate from branches? We can maybe add some additional rules for what is a valid tag. It seems like this might be worth designing the API in an issue first before going into implementation.

There any background on this feature? Why did you decide to implement it?

@wjones127 wjones127 self-assigned this Dec 30, 2025
@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented Jan 4, 2026

I'd +1 the automated code review concern here: how do we differentiate from branches? We can maybe add some additional rules for what is a valid tag.

My current idea is to use logic similar to Git: we do not allow the same name for refs, including branches and tags. Therefore, for any given string, we only need to check whether it is a tag or a branch.

It seems like this might be worth designing the API in an issue first before going into implementation.

I'm willing to start one if needed

There any background on this feature? Why did you decide to implement it?

I'm working on duckdb's versioning feature for lance: lance-format/lance-duckdb#88, for example:

-- latest version on branch
SELECT * FROM my_lance.default.t AT (VERSION => 'feat')
-- branch with version
SELECT * FROM my_lance.default.t AT (VERSION => 'feat:45')
-- tag
SELECT * FROM my_lance.default.t AT (VERSION => 'v1.0')

SELECT * FROM my_lance.default.t AT (TIMESTAMP => TIMESTAMP '2025-09-22 12:32:43.217')

Note that DuckDB does not allow extending the AT clause to support BRANCH or TAG. Therefore, I plan to accept a Ref here instead. Users can pass any valid Ref to read as a version.

@prrao87 prrao87 requested a review from jackye1995 February 12, 2026 17:50
@hamersaw
Copy link
Copy Markdown
Contributor

hamersaw commented Feb 12, 2026

Linking a similar Spark PR as related discussions.

IMO I think requiring the implementation to discern whether a string passed as version is a branch or a tag presents some ambiguity and an unnecessary restriction that you can not have a branch and a tag with the same name -- something we don't enforce today IIUC.

The proposal on Spark-side is to extend the table identifier to include a branch. So from your examples, something like:

-- branch 'feat' with version '45'
SELECT * FROM my_lance.default.t.b_feat AT (VERSION => '45')

Looking into duckdb statments I would make a similar argument here that our solution to support branches should span further than SELECT. Users will want to perform all operations on specific branches (ex. INSERT, CREATE INDEX, UPDATED, etc), so we would be building different branch support for those operations later anyways.

#[rstest]
fn test_parse_ref_ok(
#[values(
("0", Ref::VersionNumber(0)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a ref/ prefix before the branch or tag, so that we can differentiate a version 100 vs a branch/tag "100"?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought I proposed in Spark is that ref/ could indicate the delimiter is /. That means if we do ref/branch/tag then we know to use / to split branch and tag.

If a branch has a / (which is common), we an use a different delimiter, e.g. ref:feature/abc:100.

Basically the first 3 characters equaling ref indicates its not a version number, and the forth character is used as the delimiter.

What do we think? @Xuanwo @wjones127 @hamersaw

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great to me! Just a few clarifying questions:

  • Do we need to prepend all tags with branches when using ref then? ex. with ref/foo is "foo" a branch or a tag (ie. tag "foo" on branch "main")?
  • How do we handle ambiguity if a branch / version / tag is specified in both the table ID and in the VERSION AS OF clause? Just error out?

I've been trying to convince myself that it would be more simple (parsing + API) to only support branches in table IDs and versions / tags in VERSION AS OF clauses (within SELECT and CREATE TAG statements), but that's probably a bad idea if we parse things into our existing Ref type (which contains branch, version, and tag already).

@jackye1995
Copy link
Copy Markdown
Contributor

jackye1995 commented Feb 20, 2026

The proposal on Spark-side is to extend the table identifier to include a branch.

I think we definitely don't want to do another level of nesting like my_lance.default.t.b_feat, anything that is not 3 level is just causing additional incompatibility.

If we use the ref for tags that I proposed, one way is to do my_lance.default.t__ref/main/10, so we can match the keyword __ref to know if a table is referencing a specific branch or tag.

With that, I think it would be convenient to also reserve main to always mean the main branch. I remember we have thought about that but have not fully decided if we want to go with it or not. @majin1102 curious what you think.

@majin1102
Copy link
Copy Markdown
Contributor

majin1102 commented Feb 21, 2026

With that, I think it would be convenient to also reserve main to always mean the main branch. I remember we have thought about that but have not fully decided if we want to go with it or not. @majin1102 curious what you think.

The "main" has been already reversed and it always points to the actual main branch. This behavior is guaranteed by check_valid_branch() and standardize_branch().

If we use the ref for tags that I proposed, one way is to do my_lance.default.t__ref/main/10, so we can match the keyword __ref to know if a table is referencing a specific branch or tag.

Totally agree with this approach about the syntax handling.

Back to this PR. I'm wondering if introducing complex string conversion rules at this low level is necessary. I've considered this before and have a few concerns:

  • Flexibility: Different engines should be allowed to have their own syntax rules. Enforcing uniformity at the low level might sacrifice flexibility, even if we have a unified top-level design.
  • API Complexity: This adds cognitive load (or understanding cost) to the ref API

There are alternative approaches, such as adding dedicated APIs or providing utility functions. But the conversion logic in this PR now looks fine to me.

return Tag(reference.to_string());
}

let branch = if branch == MAIN_BRANCH {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: standardize_branch was designed as the entry point for this scenario

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants