- How did I come up with the idea of doing this?
- A quick note about sumhash512
- The implementation phases
- Code porting
- A more focused and idiomatic implementation
- A primer on the digest crate
- Implementing Sumhash512Core
- Phew… OK, what’s the lesson here?
- Honorable mentions
- Performance 🔥
- Future enhancements
- Wrapping thoughts
- I did the Rust implementation of the
sumhash512cryptographic hash function designed by Algorand.
- The implementation is concise and 2x faster than the Go reference implementation.
- I hope this might be useful for the Algorand community in the future integration of state proofs in the Rust ecosystem.
I’ve been programming full-time in Go for a couple of years, and I love the language. I usually say that Go has made programming fun and productive again. This post isn’t about getting flamewars about Rust vs. Go. Both are excellent and the best choices for different use-cases.
I’ve been learning Rust for some time. I got interested in the language for a couple of reasons:
- Go is super for highly-concurrent programs, but you require much attention while doing PR reviews to catch data races. Depending on how the author designs the code, this is more or less a problem. There’re ways to make data races harder if you organize your code correctly, but unfortunately, this depends on the skill level of the coder.
- In some very particular use cases when optimizing Go code, you can hit a wall fighting the compiler regarding inlining or convincing the compiler that some variable doesn’t escape trying to avoid allocations.
- Sometimes I want a language without a runtime that can add extra overhead to what I’m coding. Also, no deferred work or pauses with GCs. The Go GC is impressive, having tiny STW pauses but is still there.
- The way Rust expresses (im)mutability can allow the compiler to do further optimizations.
- I was always attracted to formal proofs in code. I love the idea of a compiler shouting at me because it considers some code isn’t safe or strictly correct.
- I'm okay with slow compile times or slower coding iterations in cases where all this is mandatory (which I consider very tiny).
As mentioned before, I’ve been a Rust enthusiast mostly at the theoretical level, learning many corners of the language without much practical experience. Implementing this library is the first time I have dedicated more than 1hr to code something in Rust, so you’re warned! Doing this library in Rust was a pleasant experience, and I thought it would be nice to share it. This blog post ended up being lengthy, but I think it’s okay. I’d suggest skipping sections that you might not find interesting.
How did I come up with the idea of doing this?
If you’ve read other blog posts or know me personally, I’ve been interested in the Algorand blockchain before it even launched its testnet. I’ve written some articles before which you might find interesting too. I won’t go into details about why I like the project, but just saying this for context.
A while ago, out of curiosity, I was lurking the go-sumhash repository. The repository is the reference implementation of a new cryptographic hash function that the project has designed for their soon-to-come state proofs. The code is relatively straightforward and typical for a hashing function implementation.
Whenever I’m learning a new language, I’m the typical person that wants to find something useful to do with it. So, I thought, why not implement this in Rust which can be very interesting to compare performances and would be a valuable asset for the Algorand community in the future?
I want to be clear about something. I think doing the reference implementation in Go is a reasonable choice. The Algorand node is implemented in Go, so it's nice to have all the dependencies in Go to avoid CGO. Also, I'm not fully aware if extreme performance is needed for state proofs. If that's the case and the CGO overhead is small enough compared to a potential faster Rust implementation, that can be switched later without much problem.
A quick note about sumhash512
The subset-sum hash function spec was written by Yossi Gilad, David Lazar, and Chris Peikert. This new cryptographic hash function is a part of the machinery behind state proofs, which will be released soon in Algorand.
The idea of state proofs is a solution for blockchain interoperability without any trusted parties apart from Algorand and the target blockchain. This means no trusted external committees, oracles, relays, or similar, which is very interesting. The gist is leveraging the validator set generated by the cryptographic sortition at the consensus layer to validate the existing state in the blockchain. You can then compress the proofs using zkSNARKs to be validated in a different chain. See this talk for more information.
You might wonder why to invent a new cryptographic function or how safe it is compared to mature ones. Some days ago, a cryptoanalysis was published with details about this, quoting:
The ultimate goal and conclusion were that Algorand's sumhash512 [GLP21] has at least 128 bits of security against known quantum collision-finding attacks [Sch21], in an attacker-favorable model and time*memory (TM) metric
sumhash512 is a particular instance of the Subset-hash cryptographic hash function. To understand this better, here’re some high-level descriptions of how everything works:
- A random matrix
Mis created filled with data using
Shake256with some specific seed. See here and here. The Subset-sum hash spec is agnostic of the values, but for Algorand’s particular instance, the
Algorandstring is used as part of the seed.
Compressmodule receives a fixed-size message and compresses the message using each byte of the message with an arithmetic operation using the values of
Compresstrait has two flavors,
Matriximplementation is simply what was described before. The
Lookupis computing all the possible values of bytes and how they would interact with each matrix entry. Saying it differently, given some
u8that interacts with a particular entry in
M, you could already compute all 256 combinations (8 bits) ahead and then simply lookup the values when compressing a message. This switches the computation from CPU to a memory lookup. This is the default mode both in Go and in this Rust implementation.
sumhash512corecontains code that allows proper fixed block size buffering and delegates the compression of the block to
Compressor, allowing to mutate of the internal state related to the ultimate hash output. A whole section below is dedicated to this.
In the above bullets, I’m getting ahead and providing links to some parts of the Rust implementation in the current version published with this article. Don’t worry if it doesn’t make sense to you; you may come back here again later!
The implementation phases
There's a well-known idea out there that I like when coding anything:
- Make it correct.
- Make it clear.
- Make it fast.
And do it in that order! As you advance in the bullets, more tensions appear in sacrificing one for the other, and that’s why remembering this ordering is essential. In particular, you can sacrifice clarity and performance, but never correctness. This is can be a slippery slope when trying to squeeze the last bit of performance doing some unsafe magic.
To start, I did what could be understood as a port of the reference implementation. The idea was to have a fully working implementation in Rust that passes all existing tests so we can be confident that the library is doing what we expect.
If you're interested, here is the commit hash of that version. That link isn’t the final version of the library, but it was a good milestone while working on it: make it correct.
While doing the port, I had that familiar feeling you might get when porting something into another language:
- The Rust ported code had the same API style as the Go implementation, which is the hash.Hash interface. This interface is used a lot in the Go ecosystem and is the best choice. But for Rust, isn’t that obvious the best choice?
- The reference implementation has its own implementation of block-buffering, inspired by how the
sha256does it. This is common in hashing functions that process data in fixed-sized blocks. In my ported code, this was also done in Rust ported version. Still, it felt that this was solving something that isn’t that focused on the new cryptographic function but mainly having a building block that probably should already have a good implementation in Rust out there.
- The Go code already had some loop unrolling and other hand-made optimizations, which looked like a good idea. But, I was curious if they were strictly needed in Rust to achieve the same performance.
In the next section, I’ll focus on the first two bullets and leave performance measurements for a separate section.
A more focused and idiomatic implementation
The Go implementation uses
Shake256, an eXtendable Output Function, as a random source to create a matrix for the
sumhash512 compression step.
Shake256 is also used in tests to generate random messages to test the correctness of the outputs. This later point was great since I could reproduce the same randomness in Rust and use the same test vectors to ensure correctness!
Shake256 implementation I used came from the
sha3 crate. When using it, I noticed something interesting. The variable type for the implementation was
&mut XofReaderCoreWrapper<Shake256ReaderCore>. That looked interesting and sparked an intuition about useful existing interfaces, simply out of intuition:
XofReaderCoreWrapperhad the magic word wrapper, which felt interesting.
Shake256ReaderCorehad the magic word core, which also felt interesting.
Digging a bit from where these abstractions came from, it was apparent that the
digest crate was the way to let the implementation of this new library fit an existing ecosystem. This strategy will remove the manual block-buffering implementation plus have an idiomatic API that fits an existing ecosystem.
A primer on the
The core idea of this API design is that you have these four variants that wrap a core implementation of a hashing function. The core contains the gist of the hashing function logic, mainly compressing fixed-sized data blocks into some internal state (i.e., what most cryptographic hashing functions do differently). When you focus on implementing a new hashing function, you must implement these core expected traits in the crate. We’ll see more about this in a moment.
The first three flavors provide options for defining the output size of the hash function:
CoreWrapperhas a static definition of the output size at the type level.
CtVariableCoreWrapperallows defining the output size with generics.
RtVariableCoreWrapperallows defining the output size at runtime.
XofReaderCoreWrapperwas the one we saw for
Shake256, an XOF-style function.
sumhash512 implementation, the
512 is already defining the output size so that we can use
CoreWrapper. So, we can focus only on doing that.
CoreWrapper trait has the following definition:
This means our core T (
sumhash512core) should implement
This means T should also implement
BlockSizeUser and define a
BufferKind which has two possible values:
I wasn’t entirely sure what this meant, so I had to look at the source code. It defines how the fixed-sized block buffer behaves when you “close” on enough bytes. It immediately pushes the block for compression when filling the size (
BufferKind::Eager) or waits for one extra byte (i.e., the first byte of the next block) to push it (
BufferKind::Lazy). In our case,
BufferKind::Eager sounds like a reasonable choice; maybe
BufferKind::Lazy is helpful for other cryptographic hash functions that need extra information before compressing blocks.
BufferKind::Eager also has an implementation for
digest_pad(...) function that I’ll mention soon is very convenient.
Finally, we get to the bottom of the trait hierarchy of definition. The
BlockSizeUser defines which is the block size to be used for the hashing algorithm.
Let’s see how all this applies to building our
To start, let’s take a look here:
Here we define some interesting properties of our
U64for configuring our block size, the block size is 64 bytes.
- The buffer kind is
- The output size of the hash function is 64 bytes.
This will configure internal implementations of the block buffering and output size types.
But wait, what is this
OutputSizeUser trait implementation that we haven’t mentioned before? This is where a powerful Rust feature shines, which allows the type to implement more methods if the generic type also satisfies more traits.
In our case, we are also interested in having the
FixedOutput trait implementation since our output is 64 bytes (fixed). This trait allow us calling
<Self> which is analogous with
hash.Sum(byte) in Go; that is, finalizing the compression and processing any pending data in the buffer.
As we see below,
T also to be
And here is where our previous
OuputSizeUser trait appears. It needs to know the output size of the hash function for this method to have the correct information. Also, it indirectly asks for
UpdateCore since finalizing the compression with pending data implies that the partially filled block will need to be completed, mutating the hash function's internal state again.
UpdateCore is something analogous to the
io.Writer mechanics that
hash.Hash have in Go.
Phew… OK, what’s the lesson here?
All this can be pretty confusing, but I want to stress what I think is the most helpful lesson. While I was trying to implement
Sumhash512Core to fit
CoreWrapper, I didn’t know anything about the
digest crate traits. As soon as I tried to use some methods, I got compile time errors. These compile errors were guidance since they were saying: Hey, for implementing
FixedOutputCore you also need to implement X, and Y. After doing that, I got other errors saying: Oh, now that you want to implement Y you also need Z.
Before you realize you end up with the compiler guiding the implementation, which:
- Indirectly split a monolithic (ported code) implementation into clearly defined responsibilities.
- Gave enough information for blanket implementations to do the heavy lifting of block buffering.
- End up with a nice idiomatic API for
I had some conflicting feelings about this experience. From one angle, it felt great how the compiler can be so helpful in doing the right thing. But also, it felt that since methods exist depending on constraints on generic parameters, it isn’t entirely obvious what can be possible. You must read the crate docs carefully to ensure you’re not missing a helpful method since it only applies under a specific constraint.
Compare the original port code dealing with block buffering with the final implementation!
I went a bit further with other details, such as defining the
Default trait and other nits so you can get a fully configured
AlgorandSumhash512 hash function:
let h = CoreWrapper::<AlgorandSumhash512Core>::default();, very easy!
All this pleasant experience is a clear merit of the
digest authors on how they designed the traits.
I want to make two particular honorable mentions on methods that make our implementation easier:
digest_padmethod simplified all the (originally ported) complications of finalizing a partially written block with a byte, remaining zeroing, and compressing it. When I discovered this method, it felt like magic: exactly what I needed.
update_blockmethod, called by
CoreWrapperwhenever a whole block is ready to be compressed. This sounds too obvious, but now
SumhashCore512should only think about entire blocks and completely forget about doing buffering.
digest crate authors!
Out of curiosity, I wanted to compare the performance of this Rust implementation of the hash function with the Go one.
The Go implementation has a useful benchmark here:
If we run this locally, we have the following numbers:
After digging for a while, I decided to give it a shot at using criterion for Rust benchmarking. I ported this benchmark too and got:
Huh? That looks very weird. At least 10x slower.
Time for some flame graphs!
Doh, the problem is that we were re-generating the compression matrix from randomness with a fixed seed every time we ran a benchmark sample (note this link is from an old commit). I can’t re-utilize the
finalize_fixed(...) has move semantics, so it can’t be reused later.
What I did was to have a lazy initialization of a static variable in the crate, so it will initialize the random matrix once and re-use it for all implementations (similar strategy as in the Go implementation). This is possible since the compression matrix is read-only. I’m not entirely sure I like this since, despite its lazily calculated its lifetime is
'static. That’s a bit of a tradeoff to let people use the
Default trait. There’s also an option to use
CoreWrapper:from_core(...) and provide the compressor instance you can
Drop whenever you want to have a limited lifetime.
Rerunning the benchmark:
Interesting! Now our implementation takes ~6μs compared to ~11μs from the Go implementation; that’s about ~2x speedup!
Let’s see the flame graph again:
Most of the time, we’re iterating in the message, doing a lookup in the lookup table, and a “wrapping add” to do the compression. Both branches of the flame graph,
finalize_fixed(...), spend their time doing that calculation which makes sense.
The tight loop is basically:
Some further notes about this:
Vec<Vec<[u64;256]>>which maybe can be better if it is flattened. I haven’t tried this.
- I tried using
bytesorder::LittleEndiant::write_u64since it can receive a mutable slice to store the serialization, guessing that might be better than a copy, but it didn’t have a noticeable impact.
- I looked at the assembly of this loop, and the
msg[j]produces bounds check, which is unfortunate. It’s correct that this gets checked by the compiler, but we know from our invariants that it shouldn’t be needed. I did a quick test with
get_unchecked, which is an unsafe operation. While I saw the bounds check removed, it had some other effects in the loop which didn’t lead to a noticeable speedup in
- The compressor struct is generic since it can work on different input and output sizes. If would were switched to have constant-sized internal arrays, the constants in loops could be unrolled, avoiding further memory lookups or bounds checks.
- This implementation uses a Lookup Table (as the Go implementation), assuming that memory lookups are faster than burning CPU doing
sum_bit(...)calls. This should be validated since, mixed with the above points, it may not be that obvious is a good deal.
So, here are some fun things to explore in the future to try increasing the 2x speedup.
For now, I focused on implementing
SumhashCore512, which, as said, has a fixed output size of 512 bits.
The reality is that the compressor implementation allows you to have a Subset-hash hash function with variable output size. I could try implementing
CtVariableCoreWrapper, but it would need an extra twist.
The block buffer size depends on the output size. Since
BlockSizeUser, which defines the block buffer size at compile time. I’m not entirely sure yet, but I could probably craft a generic
SumhashCore type with generic parameters to define the output and block size. That might be a story for another day!
The experience of doing this implementation was practical to see the power of Rust traits,
digest crate design, and some performance tools.
The implementation is relatively short (and probably can be shorter) and has 2x speedup compared to the Go implementation.
Finally, the most important thing: I hope it will be helpful for other projects in the Rust ecosystem in the future, probably related to Algorand’s state proofs.