Implementing Algorand’s sumhash512 cryptographic hash function in Rust

2022-08-14

TL;DR
Introduction
How did I come up with the idea of doing this?
A quick note about sumhash512
The implementation phases
Code porting
A more focused and idiomatic implementation
A primer on the digest crate
Implementing Sumhash512Core
Phew… OK, what’s the lesson here?
Honorable mentions
Performance 🔥
Future enhancements
Wrapping thoughts

TL;DR

I did the Rust implementation of the sumhash512 cryptographic hash function designed by Algorand.
The implementation is concise and 2x faster than the Go reference implementation.
I hope this might be useful for the Algorand community in the future integration of state proofs in the Rust ecosystem.

Introduction

I’ve been programming full-time in Go for a couple of years, and I love the language. I usually say that Go has made programming fun and productive again. This post isn’t about getting flamewars about Rust vs. Go. Both are excellent and the best choices for different use-cases.

I’ve been learning Rust for some time. I got interested in the language for a couple of reasons:

Go is super for highly-concurrent programs, but you require much attention while doing PR reviews to catch data races. Depending on how the author designs the code, this is more or less a problem. There’re ways to make data races harder if you organize your code correctly, but unfortunately, this depends on the skill level of the coder.
In some very particular use cases when optimizing Go code, you can hit a wall fighting the compiler regarding inlining or convincing the compiler that some variable doesn’t escape trying to avoid allocations.
Sometimes I want a language without a runtime that can add extra overhead to what I’m coding. Also, no deferred work or pauses with GCs. The Go GC is impressive, having tiny STW pauses but is still there.
The way Rust expresses (im)mutability can allow the compiler to do further optimizations.
I was always attracted to formal proofs in code. I love the idea of a compiler shouting at me because it considers some code isn’t safe or strictly correct.
I'm okay with slow compile times or slower coding iterations in cases where all this is mandatory (which I consider very tiny).

As mentioned before, I’ve been a Rust enthusiast mostly at the theoretical level, learning many corners of the language without much practical experience. Implementing this library is the first time I have dedicated more than 1hr to code something in Rust, so you’re warned! Doing this library in Rust was a pleasant experience, and I thought it would be nice to share it. This blog post ended up being lengthy, but I think it’s okay. I’d suggest skipping sections that you might not find interesting.

How did I come up with the idea of doing this?

If you’ve read other blog posts or know me personally, I’ve been interested in the Algorand blockchain before it even launched its testnet. I’ve written some articles before which you might find interesting too. I won’t go into details about why I like the project, but just saying this for context.

A while ago, out of curiosity, I was lurking the go-sumhash repository. The repository is the reference implementation of a new cryptographic hash function that the project has designed for their soon-to-come state proofs. The code is relatively straightforward and typical for a hashing function implementation.

Whenever I’m learning a new language, I’m the typical person that wants to find something useful to do with it. So, I thought, why not implement this in Rust which can be very interesting to compare performances and would be a valuable asset for the Algorand community in the future?

I want to be clear about something. I think doing the reference implementation in Go is a reasonable choice. The Algorand node is implemented in Go, so it's nice to have all the dependencies in Go to avoid CGO. Also, I'm not fully aware if extreme performance is needed for state proofs. If that's the case and the CGO overhead is small enough compared to a potential faster Rust implementation, that can be switched later without much problem.

A quick note about sumhash512

The subset-sum hash function spec was written by Yossi Gilad, David Lazar, and Chris Peikert. This new cryptographic hash function is a part of the machinery behind state proofs, which will be released soon in Algorand.

The idea of state proofs is a solution for blockchain interoperability without any trusted parties apart from Algorand and the target blockchain. This means no trusted external committees, oracles, relays, or similar, which is very interesting. The gist is leveraging the validator set generated by the cryptographic sortition at the consensus layer to validate the existing state in the blockchain. You can then compress the proofs using zkSNARKs to be validated in a different chain. See this talk for more information.

You might wonder why to invent a new cryptographic function or how safe it is compared to mature ones. Some days ago, a cryptoanalysis was published with details about this, quoting:

The ultimate goal and conclusion were that Algorand's sumhash512 [GLP21] has at least 128 bits of security against known quantum collision-finding attacks [Sch21], in an attacker-favorable model and time*memory (TM) metric

sumhash512 is a particular instance of the Subset-hash cryptographic hash function. To understand this better, here’re some high-level descriptions of how everything works:

A random matrix M is created filled with data using Shake256 with some specific seed. See here and here. The Subset-sum hash spec is agnostic of the values, but for Algorand’s particular instance, the Algorand string is used as part of the seed.
The Compress module receives a fixed-size message and compresses the message using each byte of the message with an arithmetic operation using the values of M.
The Compress trait has two flavors, Matrix and Lookup. The Matrix implementation is simply what was described before. The Lookup is computing all the possible values of bytes and how they would interact with each matrix entry. Saying it differently, given some u8 that interacts with a particular entry in M, you could already compute all 256 combinations (8 bits) ahead and then simply lookup the values when compressing a message. This switches the computation from CPU to a memory lookup. This is the default mode both in Go and in this Rust implementation.
The sumhash512core contains code that allows proper fixed block size buffering and delegates the compression of the block to Compressor, allowing to mutate of the internal state related to the ultimate hash output. A whole section below is dedicated to this.

In the above bullets, I’m getting ahead and providing links to some parts of the Rust implementation in the current version published with this article. Don’t worry if it doesn’t make sense to you; you may come back here again later!

The implementation phases

There's a well-known idea out there that I like when coding anything:

Make it correct.
Make it clear.
Make it fast.

And do it in that order! As you advance in the bullets, more tensions appear in sacrificing one for the other, and that’s why remembering this ordering is essential. In particular, you can sacrifice clarity and performance, but never correctness. This is can be a slippery slope when trying to squeeze the last bit of performance doing some unsafe magic.

Code porting

To start, I did what could be understood as a port of the reference implementation. The idea was to have a fully working implementation in Rust that passes all existing tests so we can be confident that the library is doing what we expect.

If you're interested, here is the commit hash of that version. That link isn’t the final version of the library, but it was a good milestone while working on it: make it correct.

While doing the port, I had that familiar feeling you might get when porting something into another language:

The Rust ported code had the same API style as the Go implementation, which is the hash.Hash interface. This interface is used a lot in the Go ecosystem and is the best choice. But for Rust, isn’t that obvious the best choice?
The reference implementation has its own implementation of block-buffering, inspired by how the sha256 does it. This is common in hashing functions that process data in fixed-sized blocks. In my ported code, this was also done in Rust ported version. Still, it felt that this was solving something that isn’t that focused on the new cryptographic function but mainly having a building block that probably should already have a good implementation in Rust out there.
The Go code already had some loop unrolling and other hand-made optimizations, which looked like a good idea. But, I was curious if they were strictly needed in Rust to achieve the same performance.

In the next section, I’ll focus on the first two bullets and leave performance measurements for a separate section.

A more focused and idiomatic implementation

The Go implementation uses Shake256, an eXtendable Output Function, as a random source to create a matrix for the sumhash512 compression step. Shake256 is also used in tests to generate random messages to test the correctness of the outputs. This later point was great since I could reproduce the same randomness in Rust and use the same test vectors to ensure correctness!

The Shake256 implementation I used came from the sha3 crate. When using it, I noticed something interesting. The variable type for the implementation was &mut XofReaderCoreWrapper<Shake256ReaderCore>. That looked interesting and sparked an intuition about useful existing interfaces, simply out of intuition:

XofReaderCoreWrapper had the magic word wrapper, which felt interesting.
Shake256ReaderCore had the magic word core, which also felt interesting.

Digging a bit from where these abstractions came from, it was apparent that the digest crate was the way to let the implementation of this new library fit an existing ecosystem. This strategy will remove the manual block-buffering implementation plus have an idiomatic API that fits an existing ecosystem.

A primer on the `digest` crate

The digest create defines some interesting traits in the core_api module:

The core idea of this API design is that you have these four variants that wrap a core implementation of a hashing function. The core contains the gist of the hashing function logic, mainly compressing fixed-sized data blocks into some internal state (i.e., what most cryptographic hashing functions do differently). When you focus on implementing a new hashing function, you must implement these core expected traits in the crate. We’ll see more about this in a moment.

The first three flavors provide options for defining the output size of the hash function:

CoreWrapper has a static definition of the output size at the type level.
CtVariableCoreWrapper allows defining the output size with generics.
RtVariableCoreWrapper allows defining the output size at runtime.
XofReaderCoreWrapper was the one we saw for Shake256, an XOF-style function.

In this sumhash512 implementation, the 512 is already defining the output size so that we can use CoreWrapper. So, we can focus only on doing that.

The CoreWrapper trait has the following definition:

pub struct CoreWrapper<T> 
where
    T: BufferKindUser,
    T::BlockSize: IsLess<U256>,
    Le<T::BlockSize, U256>: NonZero, 
 { /* private fields */ }

This means our core T (sumhash512core) should implement BufferKindUser:

pub trait BufferKindUser: BlockSizeUser {
    type BufferKind: BufferKind;
}

This means T should also implement BlockSizeUser and define a BufferKind which has two possible values:

BufferKind::Eager
BufferKind::Lazy

I wasn’t entirely sure what this meant, so I had to look at the source code. It defines how the fixed-sized block buffer behaves when you “close” on enough bytes. It immediately pushes the block for compression when filling the size (BufferKind::Eager) or waits for one extra byte (i.e., the first byte of the next block) to push it (BufferKind::Lazy). In our case, BufferKind::Eager sounds like a reasonable choice; maybe BufferKind::Lazy is helpful for other cryptographic hash functions that need extra information before compressing blocks. BufferKind::Eager also has an implementation for digest_pad(...) function that I’ll mention soon is very convenient.

Let’s see BlockSizeUser :

pub trait BlockSizeUser {
    type BlockSize: 'static + ArrayLength<u8>;
    fn block_size() -> usize { ... }
}

Finally, we get to the bottom of the trait hierarchy of definition. The BlockSizeUser defines which is the block size to be used for the hashing algorithm.

Let’s see how all this applies to building our Sumhash512Core.

Implementing Sumhash512Core

To start, let’s take a look here:

impl<C: Compressor> BlockSizeUser for Sumhash512Core<C> {
    type BlockSize = U64;
}

impl<C: Compressor> BufferKindUser for Sumhash512Core<C> {
    type BufferKind = Eager;
}

impl<C: Compressor> OutputSizeUser for Sumhash512Core<C> {
    type OutputSize = U64;
}

Here we define some interesting properties of our Sumhash512Core implementation:

Using U64 for configuring our block size, the block size is 64 bytes.
The buffer kind is Eager.
The output size of the hash function is 64 bytes.

This will configure internal implementations of the block buffering and output size types.

But wait, what is this OutputSizeUser trait implementation that we haven’t mentioned before? This is where a powerful Rust feature shines, which allows the type to implement more methods if the generic type also satisfies more traits.

In our case, we are also interested in having the FixedOutput trait implementation since our output is 64 bytes (fixed). This trait allow us calling finalize_fixed(self) -> Output<Self> which is analogous with hash.Sum([]byte) in Go; that is, finalizing the compression and processing any pending data in the buffer.

As we see below, FixedOutput asks T also to be FixedOutputCore :

pub trait FixedOutputCore: UpdateCore + BufferKindUser + OutputSizeUser 
where
    Self::BlockSize: IsLess<U256>,
    Le<Self::BlockSize, U256>: NonZero, 
{
    fn finalize_fixed_core(
        &mut self, 
        buffer: &mut Buffer<Self>, 
        out: &mut Output<Self>
    );
}

And here is where our previous OuputSizeUser trait appears. It needs to know the output size of the hash function for this method to have the correct information. Also, it indirectly asks for UpdateCore since finalizing the compression with pending data implies that the partially filled block will need to be completed, mutating the hash function's internal state again. UpdateCore is something analogous to the io.Writer mechanics that hash.Hash have in Go.

Phew… OK, what’s the lesson here?

All this can be pretty confusing, but I want to stress what I think is the most helpful lesson. While I was trying to implement Sumhash512Core to fit CoreWrapper, I didn’t know anything about the digest crate traits. As soon as I tried to use some methods, I got compile time errors. These compile errors were guidance since they were saying: Hey, for implementing FixedOutputCore you also need to implement X, and Y. After doing that, I got other errors saying: Oh, now that you want to implement Y you also need Z.

Before you realize you end up with the compiler guiding the implementation, which:

Indirectly split a monolithic (ported code) implementation into clearly defined responsibilities.
Gave enough information for blanket implementations to do the heavy lifting of block buffering.
End up with a nice idiomatic API for sumhash512.

I had some conflicting feelings about this experience. From one angle, it felt great how the compiler can be so helpful in doing the right thing. But also, it felt that since methods exist depending on constraints on generic parameters, it isn’t entirely obvious what can be possible. You must read the crate docs carefully to ensure you’re not missing a helpful method since it only applies under a specific constraint.

Compare the original port code dealing with block buffering with the final implementation!

I went a bit further with other details, such as defining the Default trait and other nits so you can get a fully configured AlgorandSumhash512 hash function: let h = CoreWrapper::<AlgorandSumhash512Core>::default();, very easy!

Honorable mentions

All this pleasant experience is a clear merit of the digest authors on how they designed the traits.

I want to make two particular honorable mentions on methods that make our implementation easier:

digest_pad method simplified all the (originally ported) complications of finalizing a partially written block with a byte, remaining zeroing, and compressing it. When I discovered this method, it felt like magic: exactly what I needed.
update_block method, called by CoreWrapper whenever a whole block is ready to be compressed. This sounds too obvious, but now SumhashCore512 should only think about entire blocks and completely forget about doing buffering.

Kudos to digest crate authors!

Performance 🔥

Out of curiosity, I wanted to compare the performance of this Rust implementation of the hash function with the Go one.

The Go implementation has a useful benchmark here:

func BenchmarkHashInterface(b *testing.B) {
	msg := make([]byte, 600)

	rand.Read(msg)
	h := New512(nil)

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		h.Reset()
		h.Write(msg)
		_ = h.Sum(nil)
	}
}

If we run this locally, we have the following numbers:

goos: linux
goarch: amd64
pkg: github.com/algorand/go-sumhash
cpu: AMD Ryzen 7 3800XT 8-Core Processor
BenchmarkHashInterface-16   99922   11466 ns/op   512 B/op   6 allocs/op
PASS
ok     github.com/algorand/go-sumhash 1.276s

After digging for a while, I decided to give it a shot at using criterion for Rust benchmarking. I ported this benchmark too and got:

hash 600 bytes          time:   [236.65 µs 237.59 µs 238.76 µs]

Huh? That looks very weird. At least 10x slower.

Time for some flame graphs!

Doh, the problem is that we were re-generating the compression matrix from randomness with a fixed seed every time we ran a benchmark sample (note this link is from an old commit). I can’t re-utilize the AlgorandSumhash512Core since finalize_fixed(...) has move semantics, so it can’t be reused later.

What I did was to have a lazy initialization of a static variable in the crate, so it will initialize the random matrix once and re-use it for all implementations (similar strategy as in the Go implementation). This is possible since the compression matrix is read-only. I’m not entirely sure I like this since, despite its lazily calculated its lifetime is 'static. That’s a bit of a tradeoff to let people use the Default trait. There’s also an option to use CoreWrapper:from_core(...) and provide the compressor instance you can Drop whenever you want to have a limited lifetime.

Rerunning the benchmark:

hash 600 bytes          time:   [6.0652 µs 6.1099 µs 6.1473 µs]

Interesting! Now our implementation takes ~6μs compared to ~11μs from the Go implementation; that’s about ~2x speedup!

Let’s see the flame graph again:

Most of the time, we’re iterating in the message, doing a lookup in the lookup table, and a “wrapping add” to do the compression. Both branches of the flame graph, update(...) and finalize_fixed(...), spend their time doing that calculation which makes sense.

The tight loop is basically:

(0..self.lookup_table.len()).for_each(|i| {
    let x = (0..self.lookup_table[i].len()).fold(0u64, |x, j| {
        x.wrapping_add(self.lookup_table[i][j][msg[j] as usize])
    });
    dst[8 * i..8 * i + 8].copy_from_slice(&x.to_le_bytes());
});

Some further notes about this:

The self.lookup_table is a Vec<Vec<[u64;256]>> which maybe can be better if it is flattened. I haven’t tried this.
I tried using bytesorder::LittleEndiant::write_u64 since it can receive a mutable slice to store the serialization, guessing that might be better than a copy, but it didn’t have a noticeable impact.
I looked at the assembly of this loop, and the msg[j] produces bounds check, which is unfortunate. It’s correct that this gets checked by the compiler, but we know from our invariants that it shouldn’t be needed. I did a quick test with get_unchecked, which is an unsafe operation. While I saw the bounds check removed, it had some other effects in the loop which didn’t lead to a noticeable speedup in cargo bench.
The compressor struct is generic since it can work on different input and output sizes. If would were switched to have constant-sized internal arrays, the constants in loops could be unrolled, avoiding further memory lookups or bounds checks.
This implementation uses a Lookup Table (as the Go implementation), assuming that memory lookups are faster than burning CPU doing sum_bit(...) calls. This should be validated since, mixed with the above points, it may not be that obvious is a good deal.

So, here are some fun things to explore in the future to try increasing the 2x speedup.

Future enhancements

For now, I focused on implementing SumhashCore512, which, as said, has a fixed output size of 512 bits.

The reality is that the compressor implementation allows you to have a Subset-hash hash function with variable output size. I could try implementing CtVariableCoreWrapper, but it would need an extra twist.

The block buffer size depends on the output size. Since CtVariableCoreWrapper uses BlockSizeUser, which defines the block buffer size at compile time. I’m not entirely sure yet, but I could probably craft a generic SumhashCore type with generic parameters to define the output and block size. That might be a story for another day!

Wrapping thoughts

The experience of doing this implementation was practical to see the power of Rust traits, digest crate design, and some performance tools.

The implementation is relatively short (and probably can be shorter) and has 2x speedup compared to the Go implementation.

Finally, the most important thing: I hope it will be helpful for other projects in the Rust ecosystem in the future, probably related to Algorand’s state proofs.