Skip to content

Conversation

@jschneider-bensch
Copy link
Contributor

@jschneider-bensch jschneider-bensch commented Jun 4, 2025

This PR was an attempt to address #82, but after many tries, it turned out that manually re-using allocations usually made stack usage worse.
There are two things that can be salvaged from all these trials:

  • For shake128 we were always initializing four KeccakState structs, even if we only need one, as is the case for on-the-fly sampling (and we don't have any parallelism to exploit on the IoT target anyway).
  • We were using PRFxN for sampling ring elements, meaning we would sample a lot of randomness first and then generating ring elements from that. Splitting that up and sampling randomness for individual ring elements reduces stack usage in key generation by about 20%.

In the course of these changes I removed the Hash abstraction and we should use PortableHash everywhere now.

These are the measurements for stack usage:

Repository at commit: 2ec929a11c8d2f41d2c8f14b78cc94ca912086f0
Measurement: ML-KEM 768 Key Generation
  Stack begins at: 0x200a0000
  Found lowest stack address in use at: 0x2009eb14
  Max. stack usage: 5356 Bytes

Measurement: ML-KEM 768 Encapsulation
  Stack begins at: 0x200a0000
  Found lowest stack address in use at: 0x2009a184
  Max. stack usage: 24188 Bytes

Measurement: ML-KEM 768 Decapsulation
  Stack begins at: 0x200a0000
  Found lowest stack address in use at: 0x20099d44
  Max. stack usage: 25276 Bytes

and cycles:

l,0,16000000,ML-KEM Benchmark
b,r,bench_keygen,0
b,d,bench_keygen,0,1756626
b,r,bench_keygen,1
b,d,bench_keygen,1,1756624
b,r,bench_keygen,2
b,d,bench_keygen,2,1756626
b,r,bench_keygen,3
b,d,bench_keygen,3,1756628
b,r,bench_keygen,4
b,d,bench_keygen,4,1756625
b,r,bench_encaps,0
b,d,bench_encaps,0,1826480
b,r,bench_encaps,1
b,d,bench_encaps,1,1826480
b,r,bench_encaps,2
b,d,bench_encaps,2,1826481
b,r,bench_encaps,3
b,d,bench_encaps,3,1826483
b,r,bench_encaps,4
b,d,bench_encaps,4,1826483
b,r,bench_decaps,0
b,d,bench_decaps,0,2009735
b,r,bench_decaps,1
b,d,bench_decaps,1,2009732
b,r,bench_decaps,2
b,d,bench_decaps,2,2009733
b,r,bench_decaps,3
b,d,bench_decaps,3,2009734
b,r,bench_decaps,4
b,d,bench_decaps,4,2009737

Base automatically changed from jonas/on-the-fly-matrix to main June 5, 2025 15:14
@franziskuskiefer
Copy link
Member

@jschneider-bensch does this need review? Or what's the state of this?

@jschneider-bensch
Copy link
Contributor Author

Yes, let me rebase this to the latest version and check if we gain anything from it still. If so, then it will be ready for review.

Copy link
Member

@keks keks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this mostly looks good, the main thing I don't understand is why we replace array references with slices, especially since the functions are still generic over K (which seems to mostly be the length of the sequences)

@franziskuskiefer
Copy link
Member

@jschneider-bensch what's the state here? This looks a little stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants