Update (2023-03-23): I’ve created a document with more details about performance results.
Some weeks ago, I was on the monthly Verkle Trie sync call, which tracks progress around the EIP.
While chatting about performance bottlenecks, I touched on a reasonable amount of CPU burn while loading compressed elliptic curve points since that involved calculating the coordinate from , which requires some heavy CPU operations. These heavy operations are square root calculations in a finite field using, presumably, Tonelli-Shanks algorithm. At that point, we discussed some experiments around saving these points in the uncompressed form to load both coordinates in memory without any extra computation. I won’t dive if that’s a good or bad idea, but the chat went in that direction.
Curiously, around the same time and without knowing, this other Twitter discussion was happening. In a nutshell, another person notices the same challenge from another angle/use case. In that thread, you can see that Antonio mentions an optimization for Tonelli-Shanks mentioned in the Wikipedia article. That thread gives some rough ideas of how this would work, but while digging into the details, I realized it isn’t that trivial unless you’ve seen some ideas before.
This optimization opportunity for Tonelli-Shanks had some discussion internally since it was precisely the same discussion we had the day before in the Verkle Trie sync call. This optimization is only implemented (AFAIK) in the Bandersnatch repository coded by Gottfried Herold.
I took some time to figure out how this idea works by looking at the implementation from Gottfried Herold, which is based on some ideas mentioned in Faster square roots in annoying finite fields from Berenstein. This article tries to explain this in more detail. It was probably an overkill to deduct all this by reverse engineering cryptography-related code, but the paper only sketched an idea. In any case, it was quite instructive since reverse engineering is good for learning stuff, and Gottfried added helpful comments in the code! To repeat, I’m not describing a new idea or discovering anything new; just unpacking a nice idea for others to know and understand without making the same effort.
Apparently, this idea isn’t currently used in libraries such as gnark or arkworks, which apply plain Tonelli-Shanks. Depending on the two-addicity of the base field of the curve, this precomputing should produce a decent speedup with not that much memory overhead. For “complicated” curves with high two-addicity (worst case scenario) probably is good to apply what’s described here. If the two-addicity isn’t big, you could have a simpler table and avoid some complexity.
For go-verkle we use gnark, so it would be nice to include this optimization there. In our benchmarks, the amount of CPU burned doing sqrts is high. Soon, I plan to port Gottfried implementation to use gnark base operations and check how much the speedup is for our case. That will avoid a lot of CPU burn, and with more time, we can upstream the optimization to gnark directly if possible.
Let’s refresh what we are trying to do. We have a field element and want to calculate using Tonelli-Shanks. We can code the algorithm and be happy, but if you have a fixed , maybe you can do something faster.
I won’t explain how Tonelli-Shanks works because the Wikipedia page does a reasonable job. It isn’t a trivial algorithm if you don’t have some background in group theory, roots of unity, or similar things. I didn’t know all the needed background, but it was a great excuse to learn more.
Roughly, the crux of Tonelli-Shanks starts with some known and the following equation:
Then, through some iterations, discover some such that . If we multiply both sides by , it means our square root of is One of the main bottlenecks is having to do many squaring (i.e: ) to find .
High-level idea & strategy
Coming from Tonelli-Shanks, we want to solve this situation for some known :
If we could find such that we’re done since:
Tonelli-Shanks come up with with a series of iterations doing squarings. I think the Core Ideas section of Wikipedia provides a good reference. Let’s think for a moment if we could find this “in a single shot” instead of doing so many iterations in Tonelli-Shanks.
First, let’s look at what Antonio Sanso referenced pointing at the Generalizations section:
If many square-roots must be done in the same cyclic group and S is not too large, a table of square-roots of the elements of 2-power order can be prepared in advance and the algorithm simplified and sped up as follows.
So that looks promising since we’re in a situation where we want to compute many square roots in the same cyclic group. The problem appears in what I highlighted in bold.
In the current Verkle Trie EIP, we’re using the Bandersnatch elliptic curve. For this curve, means . As we’ll see soon, this means creating a table of elements which is too big. If was smaller, you could directly do what’s mentioned in the above quote and move on.
OK, precomputing a table with elements is discarded; what is this other idea? If we look at our original problem again, we want to find such that:
First, let’s note that is a -th root of unity. This is true since because of Euler’s criterion. This was a necessary condition checked by Tonelli-Shanks to know that our original has a square root, so:
so is a -th root of unity. In the explanation, we’ll fix since we’re in the context of the Bandersnatch curve, but the same idea can be applied to other values.
Let’s define as a primitive -th root of unity in the finite field, which means it is a generator of the cyclic group of the -th root of unity subgroup. Since is a -th root of unity then there’s an even such that:
The idea now is to solve the discrete logarithm problem to find .
Actually, instead of finding , what we ultimately need is solving for , since we want to know so we can calculate :
In summary, if we know , we can find .
Solve the dlog with precomputed tables
Since the order of the group is we can rewrite as:
where are 8-bit integers. We could use 16-bit integers or any other size, having a different tradeoff between the pre-computed table sizes and compute speed. Thus, we can do the following rewrite:
We’ve now transformed finding to finding . Now is a good moment to remember that we want the negative values of these numbers to calculate . Let’s try to keep this in mind.
Let’s exponentiate both sides by :
Recall that is a -th root of unity, so the first three terms in the exponent are , thus:
The right-hand side is a number that we can calculate since we know and . Now we’re in a better position than where we started. We must solve the discrete logarithm problem for an 8-bit number instead of a 32-bit number.
The fact that we chose to be an 8-bit number is quite helpful since we could precompute a table with all possible values and their result. Then, to solve this last equation, we can do a reverse lookup and find in the table (yes, the negative dlog since it’s what we need)
To do this, let's define where , which is the same equation we have before. Saying it differently, having this pre-computed table, we can do the following lookup to solve for :
Note that is a primitive -th of unity, so the table size (i.e: all values) is the order of the group (256), which is tractable.
Now we know , the least-significant 8-bits of . But we still need to figure out , , and .
Let’s remember this previous equation, where now we know :
Let’s apply the same idea as we did before, but instead of exponentiate by let’s do it by :
Applying the same logic regarding being a -th root of unity:
Note that we know so now we’re solving for :
Hopefully, this looks familiar since it’s a similar equation we solved when calculating .
If we continue with this process, we’ll end up with the following:
- (exponentiate by )
- (no exponentiation, just clear out )
We’d have to solve these equations in that order since each one depends on results from the previous one.
We’ve already explained how the first bullet is solved using the table. Note that for the rest of the bullets, the right-hand side is a number that we can always compute since it only depends on previous results.
Let’s analyze the last bullet since it might seem the most complex, and the other cases would be resolved similarly. So we have the equation:
We have to do this in two steps:
- Solve the right-hand side; let’s call it .
- Now that we have , solve for .
We know how to solve Step 2.; use the table.
Step 1. involves calculating some things that might seem like a lot of work, but we won’t do any calculations but use another set of tables that we should precompute:
- where for all 8-bit .
- where for all 8-bit
- where for all 8-bit
Each table has 256 entries, so it’s reasonable to compute and store in memory.
So calculating means looking each term in its corresponding table and doing three multiplications. Note that in this process, we also need for . You should compute this with some chained squaring instead of calculating each number independently which is wasteful. So, in reverse order compared to how I explained things.
I hope this article shines some light on a cool idea to use precomputed tables to use Tonelli-Shanks in a fixed prime when is big.
- Don’t worry if you don’t know what two-adicity means. I didn’t know either when I started digging into this. This article from Cryptologie is very helpful.
- If you find it confusing or hard to understand, it might be probably a lack of other knowledge that is given for granted. If you’re struggling a bit, send me an email, and maybe I can help since I was in that same starting position!
- A primitive n-th root of unity is an element such that and isn’t true for .
- If this isn’t the case, doesn’t exist.