Add file_extension field to BlobType#7009
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7009 +/- ##
==========================================
- Coverage 56.96% 56.92% -0.04%
==========================================
Files 931 931
Lines 58188 58210 +22
==========================================
- Hits 33146 33138 -8
- Misses 22001 22030 +29
- Partials 3041 3042 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
592e227 to
22ff989
Compare
| // 1. "format" is used for type validation in flytekit, "file_extension" is not. | ||
| // 2. "file_extension" controls the file extension of the blob when materializing | ||
| // to local disk during copilot download, unlike "format". | ||
| string file_extension = 3; |
There was a problem hiding this comment.
I'm open to merging this concept with the existing format field, but I think we'd still need a new bool field like enable_legacy_filename for backwards compatibility.
The initial motivation was that in some case allowing format != file_extension would be more flexible; also, this way the Blob is very explicit about whether a file extension will be added to it (you can visually parse the BlobType values and immediately know what file extension, if any, will be added. otherwise, if the concepts were merged, you could not immediately assume that "format=csv" implies the file is written with the extension.)
There was a problem hiding this comment.
We would like to prevent updating the proto as possible. Could we:
- Add the file extension based on the
format. In flytekit we can add this format as file extension. Is there any case that you think usingformathere cannot fit in? - Rather than adding flag, could we directly fallback to find file without extension if the one with extension does not exists?
There was a problem hiding this comment.
Add the file extension based on the format
Totally open to this! Are you thinking that e.g. "format=csv" implies flytecopilot will always write the file with the extension? My initial suggestion was mostly from an explicitness/compatibility perspective.
Also, I wanted to understand more about avoiding updates to the proto. Was it a performance concern? I just did a quick check to 100% ensure the happy path proto messages are unaffected by this change.
from flyteidl.core.types_pb2 import BlobType
blob = BlobType(
format="csv",
dimensionality=BlobType.SINGLE,
)
# flyte@master (latest flyte head)
blob.SerializeToString().hex() # value is 0a03637376
# flyte@rliu.DOM-75010.file-ext-copilot (my branch)
blob.SerializeToString().hex() # value is 0a03637376
So the optional new fields are indeed being excluded from happy path existing proto messages, which is probably an important requirement.
There was a problem hiding this comment.
Also, I wanted to understand more about avoiding updates to the proto. Was it a performance concern?
It's not a performance concern. My main concern is that changing the proto means changing a public API, which also requires corresponding client-side updates.
I'm not opposed to modifying the proto here, especially since these are optional fields. But we still need to be careful about the cases where the fields are missing. In particular, we should make sure both of these version-skew scenarios behave correctly:
- new flytekit with old backend
- old flytekit with new backend
Therefor, I think the bigger concern is API compatibility and missing-field semantics. If we can avoid modifying the proto and still achieve the same behavior, I would prefer that.
There was a problem hiding this comment.
Interesting, I'm thinking of ways to automate a version-skew test that gives reproducible results. I've created a proof of concept here: https://github.com/flyteorg/flyte/pull/7147/changes#r3030688248
I think the results are positive for the changes in this PR, meaning that version-skew testing is satisfactory for the new optional fields. Here are the results I achieved: https://github.com/flyteorg/flyte/pull/7147/changes#r3030772543
Backend (flyte) and Client (flytekit)
- Old: master@558cab64ffb60044653f19f1ea952133e233208e
- New: rliu.DOM-75010.file-ext-copilot@9327eb02af6acac3d31c844413583cd9d13a74d0
Test 1: New client (flytekit) and old backend (flyte)
message = BlobType(
format="test",
dimensionality=BlobType.SINGLE,
file_extension="test",
enable_legacy_filename=True,
)New client (flytekit) serializes the message to: 0a04746573741a04746573742001
Old backend (flyte) deserializes the message to: BlobType("test", BlobType.SINGLE)
// [Author's note] the optional params are dropped, which is the desired behavior.Test 2: Old client (flytekit) and new backend (flyte)
message = BlobType(
format="test",
dimensionality=BlobType.SINGLE,
)Old client (flytekit) serializes the message to: 0a0474657374
New backend (flyte) deserializes the message to: BlobType("test", BlobType.SINGLE, "", False)
// [Author's note] the optional params are populated with the default values "", False, which is the desired behavior.
If this kind of testing is directionally correct, that's awesome and I can extend it more in any way, if needed!
Prior to this PR, I considered two other approaches that do not involve modifying the proto: #7009 (comment)
1. New flyte-copilot CLI flags
--file_extension_config ENUM=(disabled, enabled, legacy)file_extension_config = disabled (by default) - Same behavior as today (
data: FlyteFile[csv]is written tooutputs/data)
file_extension_config = enabled - New behavior (data: FlyteFile[csv]is written tooutputs/data.csv)
file_extension_config = legacy - New behavior but backwards compatible, (data: FlyteFile[csv]is written tooutputs/dataandoutputs/data.csv)Cons:
- The download behavior is granular at the file level, but the CLI flag is granular at the task level (more or less). This makes the configuration less flexible, and begins to somewhat overload the configurability of copilot at the CLI level.
2. No changes, user code should be modified to read from the existing path e.g.
dataand add the extension itselfCons:
- Migrating large projects to Flyte becomes burdensome, when needing to add another extra step like this here - user code needs to be modified to handle adjusting the file extensions on inputs.
After my analysis, I concluded that extending the proto would have the most desirable result, and I'm hopeful the version-skew testing I included can help give an API compatibility/missing field handling guarantee. But I'm also happy to investigate any of the alternative approaches more, and give more confidence in the final decision.
There was a problem hiding this comment.
Thank you for your detailed reply! I agree that we would need to modify the proto here.
After re-thinking on using format as file_extension, this may not be a good way as users may already use format=XXX but does not expect having file extension. Changing the semantics of this field can cause breaking change. Therefor I agree that add file_extension here is better.
The question I got is whether we really need enable_legacy_filename. I imagine that if user want copilot to download with the extension, they will set file_extension=XXX, and copilot will download file as inputs /data.XXX. If they want to keep the original behavior, they can just define FlyteFile without file_extension, then copilot will download file as inputs/data, keeping the old behavior.
Could we only add file_extension field in proto here?
There was a problem hiding this comment.
Yes, that makes sense. I changed it to only add file_extension to the proto! 32627a1
Re-ran locally to test the changes
# Container task python code
def t1(datasetA: Annotated[FlyteFile[TypeVar('csv')], FileDownloadConfig(file_extension="csv")], datasetB: FlyteFile[TypeVar('csv')]): …
# flyte-copilot logs
2026-04-08T23:48:42.480Z: {"json":{},"level":"info","msg":"Successfully copied [196608] bytes remote data from [s3://ls-engc-flyte-data//858fd656-11ec-41ec-b235-b3b2984f230e/ex] to local [/execution-vol/flows/workflow/inputs/datasetB]","ts":"2026-04-08T23:48:42Z"}
2026-04-08T23:48:42.564Z: {"json":{},"level":"info","msg":"Successfully copied [1703936] bytes remote data from [s3://ls-engc-flyte-data//25d4f8d8-7eb4-492b-97c4-ab3ce8bdcf74/ae] to local [/execution-vol/flows/workflow/inputs/datasetA.csv]","ts":"2026-04-08T23:48:42Z"}
Re-ran version skew test as well https://github.com/flyteorg/flyte/pull/7147/changes#r3054453586
Add a new `file_extension` string field to the BlobType protobuf message, allowing FlyteFile to optionally specify a file extension (e.g. "csv") that flytecopilot appends when writing blobs to local disk during the download phase (e.g. "data.csv"). When empty (the default), behavior is unchanged. Add a new `enable_legacy_filename` bool field to the BlobType protobuf message, allowing FlyteFile to optionally specify whether to preserve backward compatibility for tasks that read from the extensionless path. Regenerated all protobuf bindings (Go, Python, JS, Rust, ES, Swagger). Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>
Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>
Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>
Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>
26ff8da to
9327eb0
Compare
| } | ||
| for _, filename := range filenames { | ||
| varPath := path.Join(dir, filename) | ||
| // TODO: Refactor handleLiteral to accept a list of file paths and return a list of downloaded results |
ab3ca8f to
a41cb83
Compare
Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>
a41cb83 to
32627a1
Compare
|
Thank you! The proto looks good to me. I will look into the implementation. |
Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Signed-off-by: ddl-rliu <140021987+ddl-rliu@users.noreply.github.com>
Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>
45b5e8e to
5a0bb83
Compare
|
Done! flyteorg/flytekit#3406 |
Tracking issue
Closes #7024 [BUG] [copilot] File extensions are missing when copilot downloads Blob/FlyteFile inputs
Why are the changes needed?
We'll assume the above behavior is not a bug (it is long-standing, and a "bugfix" will likely break existing workflows). Instead, this PR proposes enhancements to FlyteFile/Blob to support writing workflow inputs with the file extension. This PR includes several changes across flyteidl and flytecopilot.
The enhancements allow workflows to be flexible when writing blobs. Specifically, the existing behavior where flytecopilot writes blobs during the download phase without the file extension (e.g. "inputs/data") can now be enhanced so that the file extension is included (e.g. "inputs/data.csv").
edit: Just to add to the above - personally, I can use the
/inputs/datapath, and using user code the file can be renamed to/inputs/data.csvor something similar, if needed.But the key issue(s)/use case(s) is this:
.csvbut flytecopilot writes the file without.csv)data.csv, etc. - files with extensions). Then, all of their user code expects paths more-or-less like/inputs/data.csv.pandas.read_csv(path)is easily interchangeable withpath=/inputs/data vs. path=/inputs/data.csv, many of our users write "SAS" code. SAS is a weird language where SAS datasets must have a particular name, like/inputs/data.sas7bdat, and it fails to process a file like/inputs/data. So unfortunately, a lot of the SAS scripts out-of-the-box will fail, but with a copilot file extensions fix, it does work!What changes were proposed in this pull request?
[flyteidl]
Add a new
file_extensionstring field to the BlobType protobuf message, allowing FlyteFile to optionally specify a file extension (e.g. "csv") that flytecopilot appends when writing blobs during the download phase (e.g. "data.csv"). When empty (the default), behavior is unchanged.Add a newenable_legacy_filenamebool field to the BlobType protobuf message, allowing FlyteFile to optionally specify whether to preserve backward compatibility for tasks that read from the extensionless path.Regenerated all protobuf bindings (Go, Python, JS, Rust, ES, Swagger).
[flytecopilot]
The copilot download phase infers the desired download behavior(s) from the input interface.
[flytekit] PR: flyteorg/flytekit#3406
Alternatives considered
The PR's approach configures the file download behaviors at the BlobType level (e.g. per FlyteFile). This has several pros (granularity, explicitness).
But, one con is that unlike
BlobType.format(which can be inferred from the output filename e.g. "data.csv" ->format: "csv"), the new fieldsBlobType.file_extension, BlobType.enable_legacy_filenamecould not be inferred from the output filename (does "data.csv" match tofile_extension: "csv"?). Ultimately, this seems like it is introducing a minor inconsistency, but acceptable given the benefits. (It also seems like this is all hypothetical - copilot upload does not actually infer the format from the filename, so this concern may be somewhat moot?)Here are the other approaches I considered:
1. New flyte-copilot CLI flags
--file_extension_config ENUM=(disabled, enabled, legacy)file_extension_config = disabled (by default) - Same behavior as today (
data: FlyteFile[csv]is written tooutputs/data)file_extension_config = enabled - New behavior (
data: FlyteFile[csv]is written tooutputs/data.csv)file_extension_config = legacy - New behavior but backwards compatible, (
data: FlyteFile[csv]is written tooutputs/dataandoutputs/data.csv)Cons:
2. No changes, user code should be modified to read from the existing path e.g.
dataand add the extension itselfCons:
How was this patch tested?
Labels
Please add one or more of the following labels to categorize your PR:
This is important to improve the readability of release notes.
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link