Skip to content

Support optimized fetching and caching #137

@kriswuollett

Description

@kriswuollett

Just started trying out protofetch upon a recommendation. However the first thing I noticed is how slow it can be depending on the source, currently only git. The example I've encountered was setting up the following dependency:

[grpc_health_v1]
url = "github.com/grpc/grpc"
revision = "b8a04acbbf18fd1c805e5d53d62ed9fa4721a4d1" # v1.64.0
protocol = "https"
allow_policies = ["src/proto/grpc/health/v1/*"]

The grpc/health/v1/health.proto file is just 2416 B. Looks like it mirrored the entire repo into ~/.cache/protofetch/github.com/grpc/grpc taking up 416 MB and about a minute for it to be ready. Performance is machine and network dependent of course, I'm using an M2 mac. And when doing a shallow git clone myself this is the output to see also network performance:

% git clone --depth=1 https://github.com/grpc/grpc
Cloning into 'grpc'...
remote: Enumerating objects: 13476, done.
remote: Counting objects: 100% (13476/13476), done.
remote: Compressing objects: 100% (8198/8198), done.
remote: Total 13476 (delta 4629), reused 10048 (delta 3865), pack-reused 0
Receiving objects: 100% (13476/13476), 19.37 MiB | 10.66 MiB/s, done.
Resolving deltas: 100% (4629/4629), done.
Updating files: 100% (12308/12308), done.

The shallow clone takes up less space, just 178 MB.

So my thought was, even if one could use a repo mirror to support multiple versions of different deps from the same source, would it really beat the efficiency, in practice, of a "shallow" fetch and strip out all but the proto files? Perhaps even wrapped in a .tar.gz that could just be streamed and decoded in memory when needed. I'd think actual git mirrors or clones would only be necessary if fetching git submodules was supported.

Shouldn't be the user's fault that the proto repo is too large.

Also noticed that revision can be a tag or a hash. IMO both should be supported and use the hash to confirm the tag when both are provided. Git tags are not constants, and being able to specify both serves as functional documentation rather than just a manual code comment I'd be doing now like in the above example.

In any case, if there is any desire to support a potentially breaking config change in the future, I'd think it would be great to support different fetch types like plain http (tarball) with optional sha256 checks as well despite sometimes the hash of a source like git source archives may not be guaranteed for long term on some platforms.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions