Add file_extension field to BlobType by ddl-rliu · Pull Request #7009 · flyteorg/flyte

ddl-rliu · 2026-03-10T23:42:27Z

Tracking issue

Closes #7024 [BUG] [copilot] File extensions are missing when copilot downloads Blob/FlyteFile inputs

Why are the changes needed?

We'll assume the above behavior is not a bug (it is long-standing, and a "bugfix" will likely break existing workflows). Instead, this PR proposes enhancements to FlyteFile/Blob to support writing workflow inputs with the file extension. This PR includes several changes across flyteidl and flytecopilot.

The enhancements allow workflows to be flexible when writing blobs. Specifically, the existing behavior where flytecopilot writes blobs during the download phase without the file extension (e.g. "inputs/data") can now be enhanced so that the file extension is included (e.g. "inputs/data.csv").

edit: Just to add to the above - personally, I can use the /inputs/data path, and using user code the file can be renamed to /inputs/data.csv or something similar, if needed.

But the key issue(s)/use case(s) is this:

We have a lot of users using Flyte, and bringing with them a great variety of user code. Highly technical users are comfortable changing the user code to rename file names for example, but the rest of the users get stuck and they find that their code fails when it runs in Flyte (namely, the user code expects .csv but flytecopilot writes the file without .csv)
So Flyte uses blob storage. But in parallel, all the data that users work with is living in an NFS file system. Namely, all their files in the NFS are typical files (data.csv, etc. - files with extensions). Then, all of their user code expects paths more-or-less like /inputs/data.csv.
Even though in Python, something like pandas.read_csv(path) is easily interchangeable with path=/inputs/data vs. path=/inputs/data.csv, many of our users write "SAS" code. SAS is a weird language where SAS datasets must have a particular name, like /inputs/data.sas7bdat, and it fails to process a file like /inputs/data. So unfortunately, a lot of the SAS scripts out-of-the-box will fail, but with a copilot file extensions fix, it does work!

What changes were proposed in this pull request?

[flyteidl]

Add a new file_extension string field to the BlobType protobuf message, allowing FlyteFile to optionally specify a file extension (e.g. "csv") that flytecopilot appends when writing blobs during the download phase (e.g. "data.csv"). When empty (the default), behavior is unchanged.

Add a new enable_legacy_filename bool field to the BlobType protobuf message, allowing FlyteFile to optionally specify whether to preserve backward compatibility for tasks that read from the extensionless path.

Regenerated all protobuf bindings (Go, Python, JS, Rust, ES, Swagger).

[flytecopilot]

The copilot download phase infers the desired download behavior(s) from the input interface.

[flytekit] PR: flyteorg/flytekit#3406

Alternatives considered

The PR's approach configures the file download behaviors at the BlobType level (e.g. per FlyteFile). This has several pros (granularity, explicitness).

But, one con is that unlike BlobType.format (which can be inferred from the output filename e.g. "data.csv" -> format: "csv"), the new fields BlobType.file_extension, BlobType.enable_legacy_filename could not be inferred from the output filename (does "data.csv" match to file_extension: "csv"?). Ultimately, this seems like it is introducing a minor inconsistency, but acceptable given the benefits. (It also seems like this is all hypothetical - copilot upload does not actually infer the format from the filename, so this concern may be somewhat moot?)

Here are the other approaches I considered:

1. New flyte-copilot CLI flags --file_extension_config ENUM=(disabled, enabled, legacy)

file_extension_config = disabled (by default) - Same behavior as today (data: FlyteFile[csv] is written to outputs/data)
file_extension_config = enabled - New behavior (data: FlyteFile[csv] is written to outputs/data.csv)
file_extension_config = legacy - New behavior but backwards compatible, (data: FlyteFile[csv] is written to outputs/data and outputs/data.csv)

Cons:

The download behavior is granular at the file level, but the CLI flag is granular at the task level (more or less). This makes the configuration less flexible, and begins to somewhat overload the configurability of copilot at the CLI level.

2. No changes, user code should be modified to read from the existing path e.g. data and add the extension itself

Cons:

Migrating large projects to Flyte becomes burdensome, when needing to add another extra step like this here - user code needs to be modified to handle adjusting the file extensions on inputs.

How was this patch tested?

# Container task python code
def t1(datasetA: FlyteFile[TypeVar('csv')], datasetB=Annotated[FlyteFile[TypeVar('csv')], FileDownloadConfig(file_extension="csv", enable_legacy_filename=True)]): …

# flyte-copilot logs
{"json":{},"level":"debug","msg":"inputInterface: [variables:{key:\"datasetA\"  value:{type:{blob:{format:\"csv\"}}  description:\"datasetA\"}}  variables:{key:\"datasetB\"  value:{type:{blob:{format:\"csv\"  file_extension:\"csv\"  enable_legacy_filename:true}}  description:\"datasetB\"}}]","ts":"2026-03-10T23:20:55Z"}
{"json":{},"level":"debug","msg":"varFilenames: map[datasetA:[datasetA[] datasetB:[datasetB.csv datasetB]]","ts":"2026-03-10T23:20:55Z"}
...
{"json":{},"level":"info","msg":"Successfully copied [4119] bytes remote data from [s3://rliucp108882-flyte-data//a0c46ac7-eea3-427d-91c1-a3c6b4a27e83/datasetB] to local [/execution-vol/flows/workflow/inputs/datasetB]","ts":"2026-03-10T23:20:55Z"}
{"json":{},"level":"info","msg":"Successfully copied [4119] bytes remote data from [s3://rliucp108882-flyte-data//a0c46ac7-eea3-427d-91c1-a3c6b4a27e83/datasetB] to local [/execution-vol/flows/workflow/inputs/datasetB.csv]","ts":"2026-03-10T23:20:55Z"}
{"json":{},"level":"info","msg":"Successfully copied [3904] bytes remote data from [s3://rliucp108882-flyte-data//af859c71-811d-4c26-9845-b3fd8830ab6b/datasetA] to local [/execution-vol/flows/workflow/inputs/datasetA]","ts":"2026-03-10T23:20:55Z"}

Labels

Please add one or more of the following labels to categorize your PR:

added: For new features.
changed: For changes in existing functionality.
deprecated: For soon-to-be-removed features.
removed: For features being removed.
fixed: For any bug fixed.
security: In case of vulnerabilities

This is important to improve the readability of release notes.

Setup process

Screenshots

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

codecov · 2026-03-11T02:09:23Z

Codecov Report

❌ Patch coverage is 25.80645% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.92%. Comparing base (ed8aedb) to head (93ff903).

Files with missing lines	Patch %	Lines
flytecopilot/data/download.go	15.38%	21 Missing and 1 partial ⚠️
flytecopilot/data/upload.go	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7009      +/-   ##
==========================================
- Coverage   56.96%   56.92%   -0.04%     
==========================================
  Files         931      931              
  Lines       58188    58210      +22     
==========================================
- Hits        33146    33138       -8     
- Misses      22001    22030      +29     
- Partials     3041     3042       +1

Flag	Coverage Δ
unittests-datacatalog	`53.51% <ø> (ø)`
unittests-flyteadmin	`53.10% <ø> (-0.04%)`	⬇️
unittests-flytecopilot	`41.68% <17.85%> (-1.38%)`	⬇️
unittests-flytectl	`64.02% <ø> (-0.06%)`	⬇️
unittests-flyteidl	`75.72% <100.00%> (+0.01%)`	⬆️
unittests-flyteplugins	`60.17% <ø> (ø)`
unittests-flytepropeller	`53.65% <ø> (ø)`
unittests-flytestdlib	`63.00% <ø> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ddl-rliu · 2026-03-23T19:04:34Z

+    //   1. "format" is used for type validation in flytekit, "file_extension" is not.
+    //   2. "file_extension" controls the file extension of the blob when materializing
+    //    to local disk during copilot download, unlike "format".
+    string file_extension = 3;


I'm open to merging this concept with the existing format field, but I think we'd still need a new bool field like enable_legacy_filename for backwards compatibility.

The initial motivation was that in some case allowing format != file_extension would be more flexible; also, this way the Blob is very explicit about whether a file extension will be added to it (you can visually parse the BlobType values and immediately know what file extension, if any, will be added. otherwise, if the concepts were merged, you could not immediately assume that "format=csv" implies the file is written with the extension.)

We would like to prevent updating the proto as possible. Could we:

Add the file extension based on the format. In flytekit we can add this format as file extension. Is there any case that you think using format here cannot fit in?

Rather than adding flag, could we directly fallback to find file without extension if the one with extension does not exists?

Add the file extension based on the format

Totally open to this! Are you thinking that e.g. "format=csv" implies flytecopilot will always write the file with the extension? My initial suggestion was mostly from an explicitness/compatibility perspective.

Also, I wanted to understand more about avoiding updates to the proto. Was it a performance concern? I just did a quick check to 100% ensure the happy path proto messages are unaffected by this change.

from flyteidl.core.types_pb2 import BlobType blob = BlobType( format="csv", dimensionality=BlobType.SINGLE, ) # flyte@master (latest flyte head) blob.SerializeToString().hex() # value is 0a03637376 # flyte@rliu.DOM-75010.file-ext-copilot (my branch) blob.SerializeToString().hex() # value is 0a03637376

So the optional new fields are indeed being excluded from happy path existing proto messages, which is probably an important requirement.

Also, I wanted to understand more about avoiding updates to the proto. Was it a performance concern?

It's not a performance concern. My main concern is that changing the proto means changing a public API, which also requires corresponding client-side updates.

I'm not opposed to modifying the proto here, especially since these are optional fields. But we still need to be careful about the cases where the fields are missing. In particular, we should make sure both of these version-skew scenarios behave correctly:

new flytekit with old backend

old flytekit with new backend

Therefor, I think the bigger concern is API compatibility and missing-field semantics. If we can avoid modifying the proto and still achieve the same behavior, I would prefer that.

Interesting, I'm thinking of ways to automate a version-skew test that gives reproducible results. I've created a proof of concept here: https://github.com/flyteorg/flyte/pull/7147/changes#r3030688248

I think the results are positive for the changes in this PR, meaning that version-skew testing is satisfactory for the new optional fields. Here are the results I achieved: https://github.com/flyteorg/flyte/pull/7147/changes#r3030772543

Backend (flyte) and Client (flytekit)

Old: master@558cab64ffb60044653f19f1ea952133e233208e

New: rliu.DOM-75010.file-ext-copilot@9327eb02af6acac3d31c844413583cd9d13a74d0

Test 1: New client (flytekit) and old backend (flyte)

message = BlobType(
format="test",
dimensionality=BlobType.SINGLE,
file_extension="test",
enable_legacy_filename=True,
)

New client (flytekit) serializes the message to: 0a04746573741a04746573742001
Old backend (flyte) deserializes the message to: BlobType("test", BlobType.SINGLE)
// [Author's note] the optional params are dropped, which is the desired behavior.

Test 2: Old client (flytekit) and new backend (flyte)

message = BlobType(
format="test",
dimensionality=BlobType.SINGLE,
)

Old client (flytekit) serializes the message to: 0a0474657374
New backend (flyte) deserializes the message to: BlobType("test", BlobType.SINGLE, "", False)
// [Author's note] the optional params are populated with the default values "", False, which is the desired behavior.

If this kind of testing is directionally correct, that's awesome and I can extend it more in any way, if needed!

Prior to this PR, I considered two other approaches that do not involve modifying the proto: #7009 (comment)

1. New flyte-copilot CLI flags --file_extension_config ENUM=(disabled, enabled, legacy)

file_extension_config = disabled (by default) - Same behavior as today (data: FlyteFile[csv] is written to outputs/data)
file_extension_config = enabled - New behavior (data: FlyteFile[csv] is written to outputs/data.csv)
file_extension_config = legacy - New behavior but backwards compatible, (data: FlyteFile[csv] is written to outputs/data and outputs/data.csv)

Cons:

The download behavior is granular at the file level, but the CLI flag is granular at the task level (more or less). This makes the configuration less flexible, and begins to somewhat overload the configurability of copilot at the CLI level.

2. No changes, user code should be modified to read from the existing path e.g. data and add the extension itself

Cons:

Migrating large projects to Flyte becomes burdensome, when needing to add another extra step like this here - user code needs to be modified to handle adjusting the file extensions on inputs.

After my analysis, I concluded that extending the proto would have the most desirable result, and I'm hopeful the version-skew testing I included can help give an API compatibility/missing field handling guarantee. But I'm also happy to investigate any of the alternative approaches more, and give more confidence in the final decision.

Thank you for your detailed reply! I agree that we would need to modify the proto here.

After re-thinking on using format as file_extension, this may not be a good way as users may already use format=XXX but does not expect having file extension. Changing the semantics of this field can cause breaking change. Therefor I agree that add file_extension here is better.

The question I got is whether we really need enable_legacy_filename. I imagine that if user want copilot to download with the extension, they will set file_extension=XXX, and copilot will download file as inputs /data.XXX. If they want to keep the original behavior, they can just define FlyteFile without file_extension, then copilot will download file as inputs/data, keeping the old behavior.

Could we only add file_extension field in proto here?

Yes, that makes sense. I changed it to only add file_extension to the proto! 32627a1

Re-ran locally to test the changes

# Container task python code def t1(datasetA: Annotated[FlyteFile[TypeVar('csv')], FileDownloadConfig(file_extension="csv")], datasetB: FlyteFile[TypeVar('csv')]): … # flyte-copilot logs 2026-04-08T23:48:42.480Z: {"json":{},"level":"info","msg":"Successfully copied [196608] bytes remote data from [s3://ls-engc-flyte-data//858fd656-11ec-41ec-b235-b3b2984f230e/ex] to local [/execution-vol/flows/workflow/inputs/datasetB]","ts":"2026-04-08T23:48:42Z"} 2026-04-08T23:48:42.564Z: {"json":{},"level":"info","msg":"Successfully copied [1703936] bytes remote data from [s3://ls-engc-flyte-data//25d4f8d8-7eb4-492b-97c4-ab3ce8bdcf74/ae] to local [/execution-vol/flows/workflow/inputs/datasetA.csv]","ts":"2026-04-08T23:48:42Z"}

Re-ran version skew test as well https://github.com/flyteorg/flyte/pull/7147/changes#r3054453586

Add a new `file_extension` string field to the BlobType protobuf message, allowing FlyteFile to optionally specify a file extension (e.g. "csv") that flytecopilot appends when writing blobs to local disk during the download phase (e.g. "data.csv"). When empty (the default), behavior is unchanged. Add a new `enable_legacy_filename` bool field to the BlobType protobuf message, allowing FlyteFile to optionally specify whether to preserve backward compatibility for tasks that read from the extensionless path. Regenerated all protobuf bindings (Go, Python, JS, Rust, ES, Swagger). Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>

Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>

ddl-ebrown · 2026-04-05T22:53:42Z

+			}
+			for _, filename := range filenames {
+				varPath := path.Join(dir, filename)
+				// TODO: Refactor handleLiteral to accept a list of file paths and return a list of downloaded results


fixed! 32627a1

Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>

machichima · 2026-04-09T12:15:08Z

Thank you! The proto looks good to me. I will look into the implementation.
Could you please also update the flytekit PR so that we can test it locally?

Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Signed-off-by: ddl-rliu <140021987+ddl-rliu@users.noreply.github.com>

Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>

ddl-rliu · 2026-04-10T00:26:37Z

Done! flyteorg/flytekit#3406

ddl-rliu commented Mar 12, 2026

View reviewed changes

Comment thread flytecopilot/data/download.go Outdated

This was referenced Mar 12, 2026

Add file_extension fields to BlobType flyteorg/flytekit#3406

Draft

[BUG] [copilot] File extensions are missing when copilot downloads Blob/FlyteFile inputs #7024

Open

ddl-rliu force-pushed the rliu.DOM-75010.file-ext-copilot branch from 592e227 to 22ff989 Compare March 12, 2026 23:23

ddl-rliu marked this pull request as ready for review March 12, 2026 23:27

ddl-rliu commented Mar 23, 2026

View reviewed changes

ddl-rliu added 4 commits March 30, 2026 10:16

Regenerate protobuf

cd7111a

Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>

Refactor resolveVarFilenames

c203375

Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>

Rename inputInterface

9327eb0

Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>

ddl-rliu force-pushed the rliu.DOM-75010.file-ext-copilot branch from 26ff8da to 9327eb0 Compare March 30, 2026 17:16

ddl-ebrown reviewed Apr 5, 2026

View reviewed changes

ddl-rliu force-pushed the rliu.DOM-75010.file-ext-copilot branch 3 times, most recently from ab3ca8f to a41cb83 Compare April 8, 2026 23:43

Remove enableLegacyFilename

32627a1

Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>

ddl-rliu force-pushed the rliu.DOM-75010.file-ext-copilot branch from a41cb83 to 32627a1 Compare April 9, 2026 00:00

ddl-rliu changed the title ~~Add file_extension, enable_legacy_filename fields to BlobType~~ Add file_extension fields to BlobType Apr 9, 2026

ddl-rliu changed the title ~~Add file_extension fields to BlobType~~ Add file_extension field to BlobType Apr 9, 2026

machichima reviewed Apr 9, 2026

View reviewed changes

Comment thread flyteidl/protos/flyteidl/core/types.proto Outdated

ddl-rliu and others added 2 commits April 9, 2026 13:43

Update flyteidl/protos/flyteidl/core/types.proto

1ba7c15

Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Signed-off-by: ddl-rliu <140021987+ddl-rliu@users.noreply.github.com>

Update proto, add file extension check

5a0bb83

Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>

ddl-rliu force-pushed the rliu.DOM-75010.file-ext-copilot branch from 45b5e8e to 5a0bb83 Compare April 9, 2026 22:37

Merge branch 'master' into rliu.DOM-75010.file-ext-copilot

93ff903

Conversation

ddl-rliu commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

Alternatives considered

How was this patch tested?

Labels

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

Uh oh!

codecov bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

ddl-rliu Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

machichima Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddl-rliu Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machichima Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

ddl-rliu Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machichima Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

ddl-rliu Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddl-ebrown Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

ddl-rliu Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

machichima commented Apr 9, 2026

Uh oh!

ddl-rliu commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ddl-rliu commented Mar 10, 2026 •

edited

Loading

codecov bot commented Mar 11, 2026 •

edited

Loading

machichima Apr 1, 2026 •

edited

Loading

ddl-rliu Apr 1, 2026 •

edited

Loading

ddl-rliu Apr 2, 2026 •

edited

Loading

ddl-rliu Apr 8, 2026 •

edited

Loading