Update (2023-03-23): Iāve created a document with more details about performance results.
Intro
Some weeks ago, I was on the monthly Verkle Trie sync call, which tracks progress around the EIP.
While chatting about performance bottlenecks, I touched on a reasonable amount of CPU burn while loading compressed elliptic curve points since that involved calculating the coordinate from , which requires some heavy CPU operations. These heavy operations are square root calculations in a finite field using, presumably, Tonelli-Shanks algorithm. At that point, we discussed some experiments around saving these points in the uncompressed form to load both coordinates in memory without any extra computation. I wonāt dive if thatās a good or bad idea, but the chat went in that direction.
Curiously, around the same time and without knowing, this other Twitter discussion was happening. In a nutshell, another person notices the same challenge from another angle/use case. In that thread, you can see that Antonio mentions an optimization for Tonelli-Shanks mentioned in the Wikipedia article. That thread gives some rough ideas of how this would work, but while digging into the details, I realized it isnāt that trivial unless youāve seen some ideas before.
This optimization opportunity for Tonelli-Shanks had some discussion internally since it was precisely the same discussion we had the day before in the Verkle Trie sync call. This optimization is only implemented (AFAIK) in the Bandersnatch repository coded by Gottfried Herold.
I took some time to figure out how this idea works by looking at the implementation from Gottfried Herold, which is based on some ideas mentioned in Faster square roots in annoying finite fields from Berenstein. This article tries to explain this in more detail. It was probably an overkill to deduct all this by reverse engineering cryptography-related code, but the paper only sketched an idea. In any case, it was quite instructive since reverse engineering is good for learning stuff, and Gottfried added helpful comments in the code! To repeat, Iām not describing a new idea or discovering anything new; just unpacking a nice idea for others to know and understand without making the same effort.
Apparently, this idea isnāt currently used in libraries such as gnark or arkworks, which apply plain Tonelli-Shanks. Depending on the two-addicity of the base field of the curve, this precomputing should produce a decent speedup with not that much memory overhead. For ācomplicatedā curves with high two-addicity (worst case scenario) probably is good to apply whatās described here. If the two-addicity isnāt big, you could have a simpler table and avoid some complexity.
For go-verkle we use gnark, so it would be nice to include this optimization there. In our benchmarks, the amount of CPU burned doing sqrts is high. Soon, I plan to port Gottfried implementation to use gnark base operations and check how much the speedup is for our case. That will avoid a lot of CPU burn, and with more time, we can upstream the optimization to gnark directly if possible.
The problem
Letās refresh what we are trying to do. We have a field element and want to calculate using Tonelli-Shanks. We can code the algorithm and be happy, but if you have a fixed , maybe you can do something faster.
I wonāt explain how Tonelli-Shanks works because the Wikipedia page does a reasonable job. It isnāt a trivial algorithm if you donāt have some background in group theory, roots of unity, or similar things. I didnāt know all the needed background, but it was a great excuse to learn more.
Roughly, the crux of Tonelli-Shanks starts with some known and the following equation:
Then, through some iterations, discover some such that . If we multiply both sides by , it means our square root of is One of the main bottlenecks is having to do many squaring (i.e: ) to find .
High-level idea & strategy
Coming from Tonelli-Shanks, we want to solve this situation for some known :
If we could find such that weāre done since:
Tonelli-Shanks come up with with a series of iterations doing squarings. I think the Core Ideas section of Wikipedia provides a good reference. Letās think for a moment if we could find this āin a single shotā instead of doing so many iterations in Tonelli-Shanks.
First, letās look at what Antonio Sanso referenced pointing at the Generalizations section:
If many square-roots must be done in the same cyclic group and S is not too large, a table of square-roots of the elements of 2-power order can be prepared in advance and the algorithm simplified and sped up as follows.
So that looks promising since weāre in a situation where we want to compute many square roots in the same cyclic group. The problem appears in what I highlighted in bold.
In the current Verkle Trie EIP, weāre using the Bandersnatch elliptic curve. For this curve, means . As weāll see soon, this means creating a table of elements which is too big. If was smaller, you could directly do whatās mentioned in the above quote and move on.
OK, precomputing a table with elements is discarded; what is this other idea? If we look at our original problem again, we want to find such that:
First, note that is a -th root of unity. This is true since because of Eulerās criterion. This was a necessary condition checked by Tonelli-Shanks to know that our original has a square root, so:
so is a -th root of unity. In the explanation, weāll fix since weāre in the context of the Bandersnatch curve, but the same idea can be applied to other values.
Letās define as a primitive -th root of unity in the finite field, which means it is a generator of the cyclic group of the -th root of unity subgroup. Since is a -th root of unity then thereās an even such that:
The idea now is to solve the discrete logarithm problem to find .
Actually, instead of finding , what we ultimately need is solving for , since we want to know so we can calculate :
In summary, if we know , we can find .
Solve the dlog with precomputed tables
Since the order of the group is we can rewrite as:
where are 8-bit integers. We could use 16-bit integers or any other size, having a different tradeoff between the pre-computed table sizes and compute speed. Thus, we can do the following rewrite:
Weāve now transformed finding to finding . Now is a good moment to remember that we want the negative values of these numbers to calculate . Letās try to keep this in mind.
Letās exponentiate both sides by :
Recall that is a -th root of unity, so the first three terms in the exponent are , thus:
The right-hand side is a number that we can calculate since we know and . Now, weāre in a better position than where we started. We must solve the discrete logarithm problem for an 8-bit number instead of a 32-bit number.
The fact that we chose to be an 8-bit number is quite helpful since we could precompute a table with all possible values and their result. Then, to solve this last equation, we can do a reverse lookup and find in the table (yes, the negative dlog since itās what we need)
To do this, let's define where , which is the same equation we have before. Saying it differently, having this pre-computed table, we can do the following lookup to solve for :
Note that is a primitive -th of unity, so the table size (i.e: all values) is the order of the group (256), which is tractable.
Now we know , the least-significant 8-bits of . But we still need to figure out , , and .
Letās remember this previous equation, where now we know :
Letās apply the same idea as we did before, but instead of exponentiate by letās do it by :
Applying the same logic regarding being a -th root of unity:
Note that we know so now weāre solving for :
Hopefully, this looks familiar since itās a similar equation we solved when calculating .
If we continue with this process, weāll end up with the following:
- (explained)
- (explained)
- (exponentiate by )
- (no exponentiation, just clear out )
Weād have to solve these equations in that order since each one depends on results from the previous one.
Weāve already explained how the first bullet is solved using the table. Note that for the rest of the bullets, the right-hand side is a number that we can always compute since it only depends on previous results.
Letās analyze the last bullet since it might seem the most complex, and the other cases would be resolved similarly. So we have the equation:
We have to do this in two steps:
- Solve the right-hand side; letās call it .
- Now that we have , solve for .
We know how to solve Step 2.; use the table.
Step 1. involves calculating some things that might seem like a lot of work, but we wonāt do any calculations but use another set of tables that we should precompute:
- where for all 8-bit .
- where for all 8-bit
- where for all 8-bit
Each table has 256 entries, so itās reasonable to compute and store in memory.
So calculating means looking each term in its corresponding table and doing three multiplications. Note that in this process, we also need for . You should compute this with some chained squaring instead of calculating each number independently, which is wasteful. So, in reverse order compared to how I explained things.
Conclusion
I hope this article shines some light on a cool idea to use precomputed tables to use Tonelli-Shanks in a fixed prime when is big.
- Donāt worry if you donāt know what two-adicity means. I didnāt know either when I started digging into this. This article from Cryptologie is very helpful.
- If you find it confusing or hard to understand, it might be probably a lack of other knowledge that is given for granted.
- A primitive n-th root of unity is an element such that and isnāt true for .
- If this isnāt the case, doesnāt exist.