Update (2023-03-23): Iâve created a document with more details about performance results.
Intro
Some weeks ago, I was on the monthly Verkle Trie sync call, which tracks progress around the EIP.
While chatting about performance bottlenecks, I touched on a reasonable amount of CPU burn while loading compressed elliptic curve points since that involved calculating the coordinate from , which requires some heavy CPU operations. These heavy operations are square root calculations in a finite field using, presumably, Tonelli-Shanks algorithm. At that point, we discussed some experiments around saving these points in the uncompressed form to load both coordinates in memory without any extra computation. I wonât dive if thatâs a good or bad idea, but the chat went in that direction.
Curiously, around the same time and without knowing, this other Twitter discussion was happening. In a nutshell, another person notices the same challenge from another angle/use case. In that thread, you can see that Antonio mentions an optimization for Tonelli-Shanks mentioned in the Wikipedia article. That thread gives some rough ideas of how this would work, but while digging into the details, I realized it isnât that trivial unless youâve seen some ideas before.
This optimization opportunity for Tonelli-Shanks had some discussion internally since it was precisely the same discussion we had the day before in the Verkle Trie sync call. This optimization is only implemented (AFAIK) in the Bandersnatch repository coded by Gottfried Herold.
I took some time to figure out how this idea works by looking at the implementation from Gottfried Herold, which is based on some ideas mentioned in Faster square roots in annoying finite fields from Berenstein. This article tries to explain this in more detail. It was probably an overkill to deduct all this by reverse engineering cryptography-related code, but the paper only sketched an idea. In any case, it was quite instructive since reverse engineering is good for learning stuff, and Gottfried added helpful comments in the code! To repeat, Iâm not describing a new idea or discovering anything new; just unpacking a nice idea for others to know and understand without making the same effort.
Apparently, this idea isnât currently used in libraries such as gnark or arkworks, which apply plain Tonelli-Shanks. Depending on the two-addicity of the base field of the curve, this precomputing should produce a decent speedup with not that much memory overhead. For âcomplicatedâ curves with high two-addicity (worst case scenario) probably is good to apply whatâs described here. If the two-addicity isnât big, you could have a simpler table and avoid some complexity.
For go-verkle we use gnark, so it would be nice to include this optimization there. In our benchmarks, the amount of CPU burned doing sqrts is high. Soon, I plan to port Gottfried implementation to use gnark base operations and check how much the speedup is for our case. That will avoid a lot of CPU burn, and with more time, we can upstream the optimization to gnark directly if possible.
The problem
Letâs refresh what we are trying to do. We have a field element and want to calculate using Tonelli-Shanks. We can code the algorithm and be happy, but if you have a fixed , maybe you can do something faster.
I wonât explain how Tonelli-Shanks works because the Wikipedia page does a reasonable job. It isnât a trivial algorithm if you donât have some background in group theory, roots of unity, or similar things. I didnât know all the needed background, but it was a great excuse to learn more.
Roughly, the crux of Tonelli-Shanks starts with some known and the following equation:
Then, through some iterations, discover some such that . If we multiply both sides by , it means our square root of is One of the main bottlenecks is having to do many squaring (i.e: ) to find .
High-level idea & strategy
Coming from Tonelli-Shanks, we want to solve this situation for some known :
If we could find such that weâre done since:
Tonelli-Shanks come up with with a series of iterations doing squarings. I think the Core Ideas section of Wikipedia provides a good reference. Letâs think for a moment if we could find this âin a single shotâ instead of doing so many iterations in Tonelli-Shanks.
First, letâs look at what Antonio Sanso referenced pointing at the Generalizations section:
If many square-roots must be done in the same cyclic group and S is not too large, a table of square-roots of the elements of 2-power order can be prepared in advance and the algorithm simplified and sped up as follows.
So that looks promising since weâre in a situation where we want to compute many square roots in the same cyclic group. The problem appears in what I highlighted in bold.
In the current Verkle Trie EIP, weâre using the Bandersnatch elliptic curve. For this curve, means . As weâll see soon, this means creating a table of elements which is too big. If was smaller, you could directly do whatâs mentioned in the above quote and move on.
OK, precomputing a table with elements is discarded; what is this other idea? If we look at our original problem again, we want to find such that:
First, note that is a -th root of unity. This is true since because of Eulerâs criterion. This was a necessary condition checked by Tonelli-Shanks to know that our original has a square root, so:
so is a -th root of unity. In the explanation, weâll fix since weâre in the context of the Bandersnatch curve, but the same idea can be applied to other values.
Letâs define as a primitive -th root of unity in the finite field, which means it is a generator of the cyclic group of the -th root of unity subgroup. Since is a -th root of unity then thereâs an even such that:
The idea now is to solve the discrete logarithm problem to find .
Actually, instead of finding , what we ultimately need is solving for , since we want to know so we can calculate :
In summary, if we know , we can find .
Solve the dlog with precomputed tables
Since the order of the group is we can rewrite as:
where are 8-bit integers. We could use 16-bit integers or any other size, having a different tradeoff between the pre-computed table sizes and compute speed. Thus, we can do the following rewrite:
Weâve now transformed finding to finding . Now is a good moment to remember that we want the negative values of these numbers to calculate . Letâs try to keep this in mind.
Letâs exponentiate both sides by :
Recall that is a -th root of unity, so the first three terms in the exponent are , thus:
The right-hand side is a number that we can calculate since we know and . Now, weâre in a better position than where we started. We must solve the discrete logarithm problem for an 8-bit number instead of a 32-bit number.
The fact that we chose to be an 8-bit number is quite helpful since we could precompute a table with all possible values and their result. Then, to solve this last equation, we can do a reverse lookup and find in the table (yes, the negative dlog since itâs what we need)
To do this, let's define where , which is the same equation we have before. Saying it differently, having this pre-computed table, we can do the following lookup to solve for :
Note that is a primitive -th of unity, so the table size (i.e: all values) is the order of the group (256), which is tractable.
Now we know , the least-significant 8-bits of . But we still need to figure out , , and .
Letâs remember this previous equation, where now we know :
Letâs apply the same idea as we did before, but instead of exponentiate by letâs do it by :
Applying the same logic regarding being a -th root of unity:
Note that we know so now weâre solving for :
Hopefully, this looks familiar since itâs a similar equation we solved when calculating .
If we continue with this process, weâll end up with the following:
- (explained)
- (explained)
- (exponentiate by )
- (no exponentiation, just clear out )
Weâd have to solve these equations in that order since each one depends on results from the previous one.
Weâve already explained how the first bullet is solved using the table. Note that for the rest of the bullets, the right-hand side is a number that we can always compute since it only depends on previous results.
Letâs analyze the last bullet since it might seem the most complex, and the other cases would be resolved similarly. So we have the equation:
We have to do this in two steps:
- Solve the right-hand side; letâs call it .
- Now that we have , solve for .
We know how to solve Step 2.; use the table.
Step 1. involves calculating some things that might seem like a lot of work, but we wonât do any calculations but use another set of tables that we should precompute:
- where for all 8-bit .
- where for all 8-bit
- where for all 8-bit
Each table has 256 entries, so itâs reasonable to compute and store in memory.
So calculating means looking each term in its corresponding table and doing three multiplications. Note that in this process, we also need for . You should compute this with some chained squaring instead of calculating each number independently, which is wasteful. So, in reverse order compared to how I explained things.
Conclusion
I hope this article shines some light on a cool idea to use precomputed tables to use Tonelli-Shanks in a fixed prime when is big.
- Donât worry if you donât know what two-adicity means. I didnât know either when I started digging into this. This article from Cryptologie is very helpful.
- If you find it confusing or hard to understand, it might be probably a lack of other knowledge that is given for granted.
- A primitive n-th root of unity is an element such that and isnât true for .
- If this isnât the case, doesnât exist.