Update (2023-03-23): I’ve created a document with more details about performance results.

# Intro

Some weeks ago, I was on the monthly Verkle Trie sync call, which tracks progress around the EIP.

While chatting about performance bottlenecks, I touched on a reasonable amount of CPU burn while loading compressed elliptic curve points since that involved calculating the $Y$ coordinate from $X$, which requires some heavy CPU operations. These heavy operations are square root calculations in a finite field using, presumably, Tonelli-Shanks algorithm. At that point, we discussed some experiments around saving these points in the uncompressed form to load both coordinates in memory without any extra computation. I won’t dive if that’s a good or bad idea, but the chat went in that direction.

Curiously, around the same time and without knowing, this other Twitter discussion was happening. In a nutshell, another person notices the same challenge from another angle/use case. In that thread, you can see that Antonio mentions an optimization for Tonelli-Shanks mentioned in the Wikipedia article. That thread gives some rough ideas of how this would work, but while digging into the details, I realized it isn’t that trivial unless you’ve seen some ideas before.

This optimization opportunity for Tonelli-Shanks had some discussion internally since it was precisely the same discussion we had the day before in the Verkle Trie sync call. This optimization is only implemented (AFAIK) in the Bandersnatch repository coded by Gottfried Herold.

I took some time to figure out how this idea works by looking at the implementation from Gottfried Herold, which is based on some ideas mentioned in Faster square roots in annoying finite fields from Berenstein. This article tries to explain this in more detail. It was probably an overkill to deduct all this by reverse engineering cryptography-related code, but the paper only sketched an idea. In any case, it was quite instructive since reverse engineering is good for learning stuff, and Gottfried added helpful comments in the code! To repeat, I’m not describing a new idea or discovering anything new; just unpacking a nice idea for others to know and understand without making the same effort.

Apparently, this idea isn’t currently used in libraries such as gnark or arkworks, which apply plain Tonelli-Shanks. Depending on the *two-addicity*$^0$ of the base field of the curve, this precomputing should produce a decent speedup with not that much memory overhead. For “complicated” curves with high *two-addicity *(worst case scenario) probably is good to apply what’s described here. If the *two-addicity* isn’t big, you could have a simpler table and avoid some complexity.

For go-verkle we use *gnark*, so it would be nice to include this optimization there. In our benchmarks, the amount of CPU burned doing sqrts is high. Soon, I plan to port Gottfried implementation to use *gnark* base operations and check how much the speedup is for our case. That will avoid a lot of CPU burn, and with more time, we can upstream the optimization to *gnark* directly if possible.

# The problem

Let’s refresh what we are trying to do. We have a field element $n$ and want to calculate $\sqrt{n}\mod {p}$ using Tonelli-Shanks. We can code the algorithm and be happy, but if you have a fixed $p$ , maybe you can do something faster.

I won’t explain how Tonelli-Shanks works because the Wikipedia page does a reasonable job. It isn’t a trivial algorithm if you don’t have some background in group theory, roots of unity, or similar things. I didn’t know all the needed background, but it was a great excuse to learn more.

Roughly, the crux of Tonelli-Shanks starts with some known $R$ and the following equation:

Then, through some iterations, discover some $\frac{1}{k^2}$ such that $n^Q\cdot \frac{1}{k^2} \equiv 1 \mod{p}$. If we multiply both sides by $\frac{1}{k^2}$, it means our square root of $n$ is $\frac{R}{k} \mod {p}.$ One of the main bottlenecks is having to do many squaring (i.e: $t^2$) to find $\frac{1}{k^2}$.

# High-level idea & strategy

Coming from Tonelli-Shanks, we want to solve this situation for some known $R$:

If we could find $b$ such that $b^2 \equiv n^Q \mod{p}$ we’re done since:

Tonelli-Shanks come up with $b$ with a series of iterations doing squarings. I think the Core Ideas section of Wikipedia provides a good reference$^1$. Let’s think for a moment if we could find this $b^2$ “in a single shot” instead of doing so many iterations in Tonelli-Shanks.

First, let’s look at what Antonio Sanso referenced pointing at the Generalizations section:

If many square-roots must be done in the same cyclic groupand S is not too large, a table of square-roots of the elements of 2-power order can be prepared in advance and the algorithm simplified and sped up as follows.

So that looks promising since we’re in a situation where we want to compute many square roots in the same cyclic group. The problem appears in what I highlighted in bold.

In the current Verkle Trie EIP, we’re using the Bandersnatch elliptic curve. For this curve, $p-1 = Q.2^S$ means $S = 32$. As we’ll see soon, this means creating a table of $2^{31}$ elements which is too big. If $S$ was smaller, you could directly do what’s mentioned in the above quote and move on.

OK, precomputing a table with $2^{31}$ elements is discarded; what is this other idea? If we look at our original problem again, we want to find $b$ such that:

First, note that $n^Q$ is a $2^{S-1}$-th root of unity. This is true since $n^\frac{Q\cdot2^{S}}{2} \equiv n^{\frac{p-1}{2}} \equiv 1 \mod{p}$ because of Euler’s criterion. This was a necessary condition checked by Tonelli-Shanks to know that our original $n$ has a square root, so:

so $n^Q$ is a $2^{S-1}$-th root of unity. In the explanation, we’ll fix $S=32$ since we’re in the context of the Bandersnatch curve, but the same idea can be applied to other $S$ values.

Let’s define $g$ as a __primitive__ $2^{32}$-th root of unity$^2$ in the finite field, which means it is a generator of the cyclic group of the $2^{32}$-th root of unity subgroup. Since $n^Q$ is a $2^{S-1}$-th root of unity then there’s an even$^3$ $x$ such that:

The idea now is to solve the discrete logarithm problem to find $x$.

Actually, instead of finding $x$ , what we ultimately need is solving for $-x$, since we want to know $\frac{1}{b}$ so we can calculate $\left(\frac{R}{b}\right)^2$:

In summary, if we know $-x$, we can find $\frac{1}{b}$ .

# Solve the dlog with precomputed tables

Since the order of the group is $2^{32}$ we can rewrite $x$ as:

where $x_0, x_1, x_2, x_3$ are 8-bit integers. We could use 16-bit integers or any other size, having a different tradeoff between the pre-computed table sizes and compute speed. Thus, we can do the following rewrite:

We’ve now transformed finding $x$ to finding $x_0, x_1, x_2, x_3$. Now is a good moment to remember that we want the negative values of these numbers to calculate $\frac{1}{b^2}$. Let’s try to keep this in mind.

Let’s exponentiate both sides by $2^{24}$:

Recall that $g$ is a $2^{32}$-th root of unity, so the first three terms in the exponent are $\equiv 1 \mod{p}$, thus:

The right-hand side is a number that we can calculate since we know $n$ and $Q$. Now, we’re in a better position than where we started. We must solve the discrete logarithm problem for an 8-bit number $x_3$ instead of a 32-bit number.

The fact that we chose $x_3$ to be an 8-bit number is quite helpful since we could precompute a table with all possible $x_3$ values and their result. Then, to solve this last equation, we can do a *reverse lookup* and find $-x_3$ in the table (yes, the negative dlog since it’s what we need)

To do this, let's define $negdlog24[y] = -k$ where $g^{2^{24} k} = y$, which is the same equation we have before$^4$. Saying it differently, having this pre-computed table, we can do the following lookup to solve for $-x_3$:

Note that $g^{2^{24}}$ is a primitive $2^8$-th of unity, so the table size (i.e: all $y$ values) is the order of the group (256), which is tractable.

Now we know $-x_3$ , the least-significant 8-bits of $-x$. But we still need to figure out $x_0$, $x_1$ , and $x_2$.

Let’s remember this previous equation, where now we **know** $-x_3$:

Let’s apply the same idea as we did before, but instead of exponentiate by $2^{24}$ let’s do it by $2^{16}$:

Applying the same logic regarding $g$ being a $2^{32}$-th root of unity:

Note that we know $-x_3$ so now we’re solving for $x_2$:

Hopefully, this looks familiar since it’s a similar equation we solved when calculating $x_3$.

If we continue with this process, we’ll end up with the following:

- $g^{2^{24}x_3} \equiv n^{Q\cdot 2^{24}} \mod{p}$ (explained)
- $g^{2^{24}x_2} \equiv n^{Q\cdot 2^{16}}\cdot g^{2^{16}-x_3} \mod{p}$ (explained)
- $g^{2^{24}x_1} \equiv n^{Q\cdot 2^{8}}\cdot g^{2^{16}-x_2} \cdot g^{2^{8}-x_3} \mod{p}$ (exponentiate by $2^8$)
- $g^{2^{24}x_0} \equiv n^{Q}\cdot g^{2^{16}-x_1} \cdot g^{2^{8}-x_2} \cdot g^{-x_3} \mod{p}$ (no exponentiation, just clear out $x_0$)

We’d have to solve these equations in that order since each one depends on results from the previous one.

We’ve already explained how the first bullet is solved using the $negdlog24$ table. Note that for the rest of the bullets, the right-hand side is a number that we can always compute since it only depends on previous results.

Let’s analyze the last bullet since it might seem the most complex, and the other cases would be resolved similarly. So we have the equation:

We have to do this in two steps:

- Solve the right-hand side; let’s call it $r$.
- Now that we have $g^{2^{24}x_0} \equiv r \mod{p}$, solve for $-x_0$.

We know how to solve Step 2.; use the $negdlog24$ table.

Step 1. involves calculating some things that might seem like a lot of work, but we won’t do any calculations but use another set of tables that we should precompute:

- $precomp16[x] = y$ where $y \equiv g^{2^{16}k}$ for all 8-bit $k$.
- $precomp08[x] = y$ where $y \equiv g^{2^8k}$ for all 8-bit $k$
- $precomp00[x] = y$ where $y \equiv g^{k}$ for all 8-bit $k$

Each table has 256 entries, so it’s reasonable to compute and store in memory.

So calculating $g^{2^{16}-x_1} \cdot g^{2^{8}-x_2} \cdot g^{-x_3} \mod{p}$ means looking each term in its corresponding table and doing three multiplications. Note that in this process, we also need $n^{Q\cdot 2^k}$ for $k = [24, 16, ..]$. You should compute this with some chained squaring instead of calculating each number independently, which is wasteful. So, in *reverse* order compared to how I explained things.

# Conclusion

I hope this article shines some light on a cool idea to use precomputed tables to use Tonelli-Shanks in a fixed prime $p-1 = n^Q \cdot 2^{S}$ when $S$ is big.

$^0$ - Don’t worry if you don’t know what *two-adicity* means. I didn’t know either when I started digging into this. This article from Cryptologie is very helpful.

$^1$ - If you find it confusing or hard to understand, it might be probably a lack of other knowledge that is given for granted.

$^2$ - A primitive n-th root of unity is an element $g$ such that $g^n \equiv 1 \mod{p}$ and $g^k \equiv 1 \mod{p}$ isn’t true for $k < n$.

$^3$ - If this isn’t the case, $b$ doesn’t exist.