Ignacio Hagopian (@jsign) blog
    home
    📃

    Tonelli-Shanks with precomputed dlog tables

    • Intro
    • The problem
    • High-level idea & strategy
    • Solve the dlog with precomputed tables
    • Conclusion

    Update (2023-03-23): I’ve created a document with more details about performance results.

    Intro

    Some weeks ago, I was on the monthly Verkle Trie sync call, which tracks progress around the EIP.

    While chatting about performance bottlenecks, I touched on a reasonable amount of CPU burn while loading compressed elliptic curve points since that involved calculating the YYY coordinate from XXX, which requires some heavy CPU operations. These heavy operations are square root calculations in a finite field using, presumably, Tonelli-Shanks algorithm. At that point, we discussed some experiments around saving these points in the uncompressed form to load both coordinates in memory without any extra computation. I won’t dive if that’s a good or bad idea, but the chat went in that direction.

    Curiously, around the same time and without knowing, this other Twitter discussion was happening. In a nutshell, another person notices the same challenge from another angle/use case. In that thread, you can see that Antonio mentions an optimization for Tonelli-Shanks mentioned in the Wikipedia article. That thread gives some rough ideas of how this would work, but while digging into the details, I realized it isn’t that trivial unless you’ve seen some ideas before.

    This optimization opportunity for Tonelli-Shanks had some discussion internally since it was precisely the same discussion we had the day before in the Verkle Trie sync call. This optimization is only implemented (AFAIK) in the Bandersnatch repository coded by Gottfried Herold.

    I took some time to figure out how this idea works by looking at the implementation from Gottfried Herold, which is based on some ideas mentioned in Faster square roots in annoying finite fields from Berenstein. This article tries to explain this in more detail. It was probably an overkill to deduct all this by reverse engineering cryptography-related code, but the paper only sketched an idea. In any case, it was quite instructive since reverse engineering is good for learning stuff, and Gottfried added helpful comments in the code! To repeat, I’m not describing a new idea or discovering anything new; just unpacking a nice idea for others to know and understand without making the same effort.

    Apparently, this idea isn’t currently used in libraries such as gnark or arkworks, which apply plain Tonelli-Shanks. Depending on the two-addicity0^00 of the base field of the curve, this precomputing should produce a decent speedup with not that much memory overhead. For “complicated” curves with high two-addicity (worst case scenario) probably is good to apply what’s described here. If the two-addicity isn’t big, you could have a simpler table and avoid some complexity.

    For go-verkle we use gnark, so it would be nice to include this optimization there. In our benchmarks, the amount of CPU burned doing sqrts is high. Soon, I plan to port Gottfried implementation to use gnark base operations and check how much the speedup is for our case. That will avoid a lot of CPU burn, and with more time, we can upstream the optimization to gnark directly if possible.

    The problem

    Let’s refresh what we are trying to do. We have a field element nnn and want to calculate nmod  p\sqrt{n}\mod {p}n​modp using Tonelli-Shanks. We can code the algorithm and be happy, but if you have a fixed ppp , maybe you can do something faster.

    I won’t explain how Tonelli-Shanks works because the Wikipedia page does a reasonable job. It isn’t a trivial algorithm if you don’t have some background in group theory, roots of unity, or similar things. I didn’t know all the needed background, but it was a great excuse to learn more.

    Roughly, the crux of Tonelli-Shanks starts with some known RRR and the following equation:

    R2≡n⋅nQmod  pR^2 \equiv n \cdot n^Q\mod{p}R2≡n⋅nQmodp

    Then, through some iterations, discover some 1k2\frac{1}{k^2}k21​ such that nQ⋅1k2≡1mod  pn^Q\cdot \frac{1}{k^2} \equiv 1 \mod{p}nQ⋅k21​≡1modp. If we multiply both sides by 1k2\frac{1}{k^2}k21​, it means our square root of nnn is Rkmod  p.\frac{R}{k} \mod {p}.kR​modp. One of the main bottlenecks is having to do many squaring (i.e: t2t^2t2) to find 1k2\frac{1}{k^2}k21​.

    High-level idea & strategy

    Coming from Tonelli-Shanks, we want to solve this situation for some known RRR:

    R2≡n⋅nQmod  pR^2 \equiv n \cdot n^Q\mod{p}R2≡n⋅nQmodp

    If we could find bbb such that b2≡nQmod  pb^2 \equiv n^Q \mod{p}b2≡nQmodp we’re done since:

    R2≡n⋅b2mod  p(Rb)2≡nmod  pR^2 \equiv n \cdot b^2 \mod{p} \newline \left(\frac{R}{b}\right)^2 \equiv n \mod{p}R2≡n⋅b2modp(bR​)2≡nmodp

    Tonelli-Shanks come up with bbb with a series of iterations doing squarings. I think the Core Ideas section of Wikipedia provides a good reference1^11. Let’s think for a moment if we could find this b2b^2b2 “in a single shot” instead of doing so many iterations in Tonelli-Shanks.

    First, let’s look at what Antonio Sanso referenced pointing at the Generalizations section:

    If many square-roots must be done in the same cyclic group and S is not too large, a table of square-roots of the elements of 2-power order can be prepared in advance and the algorithm simplified and sped up as follows.

    So that looks promising since we’re in a situation where we want to compute many square roots in the same cyclic group. The problem appears in what I highlighted in bold.

    In the current Verkle Trie EIP, we’re using the Bandersnatch elliptic curve. For this curve, p−1=Q.2Sp-1 = Q.2^Sp−1=Q.2S means S=32S = 32S=32. As we’ll see soon, this means creating a table of 2312^{31}231 elements which is too big. If SSS was smaller, you could directly do what’s mentioned in the above quote and move on.

    OK, precomputing a table with 2312^{31}231 elements is discarded; what is this other idea? If we look at our original problem again, we want to find bbb such that:

    b2≡nQmod  pb^2 \equiv n^Q \mod{p}b2≡nQmodp

    First, note that nQn^QnQ is a 2S−12^{S-1}2S−1-th root of unity. This is true since nQ⋅2S2≡np−12≡1mod  pn^\frac{Q\cdot2^{S}}{2} \equiv n^{\frac{p-1}{2}} \equiv 1 \mod{p}n2Q⋅2S​≡n2p−1​≡1modp because of Euler’s criterion. This was a necessary condition checked by Tonelli-Shanks to know that our original nnn has a square root, so:

    nQ⋅2S2≡(nQ)2S−1≡1mod  pn^\frac{Q\cdot2^{S}}{2} \equiv \left(n^Q\right)^{2^{S-1}} \equiv 1 \mod{p} \newlinen2Q⋅2S​≡(nQ)2S−1≡1modp

    so nQn^QnQ is a 2S−12^{S-1}2S−1-th root of unity. In the explanation, we’ll fix S=32S=32S=32 since we’re in the context of the Bandersnatch curve, but the same idea can be applied to other SSS values.

    Let’s define ggg as a primitive 2322^{32}232-th root of unity2^22 in the finite field, which means it is a generator of the cyclic group of the 2322^{32}232-th root of unity subgroup. Since nQn^QnQ is a 2S−12^{S-1}2S−1-th root of unity then there’s an even3^33 xxx such that:

    b2≡gx≡nQmod  pb^2 \equiv g^x \equiv n^Q \mod{p}b2≡gx≡nQmodp

    The idea now is to solve the discrete logarithm problem to find xxx.

    Actually, instead of finding xxx , what we ultimately need is solving for −x-x−x, since we want to know 1b\frac{1}{b}b1​ so we can calculate (Rb)2\left(\frac{R}{b}\right)^2(bR​)2:

    1b≡g−x2mod  p\frac{1}{b} \equiv g^{\frac{-x}{2}} \mod{p}b1​≡g2−x​modp

    In summary, if we know −x-x−x, we can find 1b\frac{1}{b}b1​ .

    Solve the dlog with precomputed tables

    Since the order of the group is 2322^{32}232 we can rewrite xxx as:

    224x0+216x1+28x2+x32^{24}x_0+2^{16}x_1+2^{8}x_2+x_3224x0​+216x1​+28x2​+x3​

    where x0,x1,x2,x3x_0, x_1, x_2, x_3x0​,x1​,x2​,x3​ are 8-bit integers. We could use 16-bit integers or any other size, having a different tradeoff between the pre-computed table sizes and compute speed. Thus, we can do the following rewrite:

    b2≡g224x0+216x1+28x2+x3≡nQmod  pb^2 \equiv g^{2^{24}x_0+2^{16}x_1+2^{8}x_2+x_3} \equiv n^Q \mod{p}b2≡g224x0​+216x1​+28x2​+x3​≡nQmodp

    We’ve now transformed finding xxx to finding x0,x1,x2,x3x_0, x_1, x_2, x_3x0​,x1​,x2​,x3​. Now is a good moment to remember that we want the negative values of these numbers to calculate 1b2\frac{1}{b^2}b21​. Let’s try to keep this in mind.

    Let’s exponentiate both sides by 2242^{24}224:

    g248x0+240x1+232x2+224x3≡nQ⋅224mod  pg^{2^{48}x_0+2^{40}x_1+2^{32}x_2+2^{24}x_3} \equiv n^{Q\cdot 2^{24}} \mod{p}g248x0​+240x1​+232x2​+224x3​≡nQ⋅224modp

    Recall that ggg is a 2322^{32}232-th root of unity, so the first three terms in the exponent are ≡1mod  p\equiv 1 \mod{p}≡1modp, thus:

    g224x3≡nQ⋅224mod  pg^{2^{24}x_3} \equiv n^{Q\cdot 2^{24}} \mod{p}g224x3​≡nQ⋅224modp

    The right-hand side is a number that we can calculate since we know nnn and QQQ. Now, we’re in a better position than where we started. We must solve the discrete logarithm problem for an 8-bit number x3x_3x3​ instead of a 32-bit number.

    The fact that we chose x3x_3x3​ to be an 8-bit number is quite helpful since we could precompute a table with all possible x3x_3x3​ values and their result. Then, to solve this last equation, we can do a reverse lookup and find −x3-x_3−x3​ in the table (yes, the negative dlog since it’s what we need)

    To do this, let's define negdlog24[y]=−knegdlog24[y] = -knegdlog24[y]=−k where g224k=yg^{2^{24} k} = yg224k=y, which is the same equation we have before4^44. Saying it differently, having this pre-computed table, we can do the following lookup to solve for −x3-x_3−x3​:

    −x3=negdlog24[nQ⋅224]-x_3 = negdlog24[n^{Q \cdot 2^{24}}]−x3​=negdlog24[nQ⋅224]

    Note that g224g^{2^{24}}g224 is a primitive 282^828-th of unity, so the table size (i.e: all yyy values) is the order of the group (256), which is tractable.

    Now we know −x3-x_3−x3​ , the least-significant 8-bits of −x-x−x. But we still need to figure out x0x_0x0​, x1x_1x1​ , and x2x_2x2​.

    Let’s remember this previous equation, where now we know −x3-x_3−x3​:

    g224x0+216x1+28x2+x3≡nQmod  pg^{2^{24}x_0+2^{16}x_1+2^{8}x_2+x_3} \equiv n^Q \mod{p}g224x0​+216x1​+28x2​+x3​≡nQmodp

    Let’s apply the same idea as we did before, but instead of exponentiate by 2242^{24}224 let’s do it by 2162^{16}216:

    g240x0+232x1+224x2+216x3≡nQ⋅216mod  pg^{2^{40}x_0+2^{32}x_1+2^{24}x_2+2^{16}x_3} \equiv n^{Q\cdot 2^{16}} \mod{p}g240x0​+232x1​+224x2​+216x3​≡nQ⋅216modp

    Applying the same logic regarding ggg being a 2322^{32}232-th root of unity:

    g224x2+216x3≡nQ⋅216mod  pg^{2^{24}x_2+2^{16}x_3} \equiv n^{Q\cdot 2^{16}} \mod{p}g224x2​+216x3​≡nQ⋅216modp

    Note that we know −x3-x_3−x3​ so now we’re solving for x2x_2x2​:

    g224x2≡nQ⋅216⋅g216−x3mod  pg^{2^{24}x_2} \equiv n^{Q\cdot 2^{16}}\cdot g^{2^{16}-x_3} \mod{p}g224x2​≡nQ⋅216⋅g216−x3​modp

    Hopefully, this looks familiar since it’s a similar equation we solved when calculating x3x_3x3​.

    If we continue with this process, we’ll end up with the following:

    • g224x3≡nQ⋅224mod  pg^{2^{24}x_3} \equiv n^{Q\cdot 2^{24}} \mod{p}g224x3​≡nQ⋅224modp (explained)
    • g224x2≡nQ⋅216⋅g216(−x3)mod  pg^{2^{24}x_2} \equiv n^{Q\cdot 2^{16}}\cdot g^{2^{16}(-x_3)} \mod{p}g224x2​≡nQ⋅216⋅g216(−x3​)modp (explained)
    • g224x1≡nQ⋅28⋅g216(−x2)⋅g28(−x3)mod  pg^{2^{24}x_1} \equiv n^{Q\cdot 2^{8}}\cdot g^{2^{16}(-x_2)} \cdot g^{2^{8}(-x_3)} \mod{p}g224x1​≡nQ⋅28⋅g216(−x2​)⋅g28(−x3​)modp (exponentiate by 282^828)
    • g224x0≡nQ⋅g216(−x1)⋅g28(−x2)⋅g−x3mod  pg^{2^{24}x_0} \equiv n^{Q}\cdot g^{2^{16}(-x_1)} \cdot g^{2^{8}(-x_2)} \cdot g^{-x_3} \mod{p}g224x0​≡nQ⋅g216(−x1​)⋅g28(−x2​)⋅g−x3​modp (no exponentiation, just clear out x0x_0x0​)

    We’d have to solve these equations in that order since each one depends on results from the previous one.

    We’ve already explained how the first bullet is solved using the negdlog24negdlog24negdlog24 table. Note that for the rest of the bullets, the right-hand side is a number that we can always compute since it only depends on previous results.

    Let’s analyze the last bullet since it might seem the most complex, and the other cases would be resolved similarly. So we have the equation:

    g224x0≡nQ⋅216⋅g216(−x1)⋅g28(−x2)⋅g(−x3)mod  pg^{2^{24}x_0} \equiv n^{Q\cdot 2^{16}}\cdot g^{2^{16}(-x_1)} \cdot g^{2^{8}(-x_2)} \cdot g^{(-x_3)} \mod{p}g224x0​≡nQ⋅216⋅g216(−x1​)⋅g28(−x2​)⋅g(−x3​)modp

    We have to do this in two steps:

    1. Solve the right-hand side; let’s call it rrr.
    2. Now that we have g224x0≡rmod  pg^{2^{24}x_0} \equiv r \mod{p}g224x0​≡rmodp, solve for −x0-x_0−x0​.

    We know how to solve Step 2.; use the negdlog24negdlog24negdlog24 table.

    Step 1. involves calculating some things that might seem like a lot of work, but we won’t do any calculations but use another set of tables that we should precompute:

    • precomp16[x]=yprecomp16[x] = yprecomp16[x]=y where y≡g216ky \equiv g^{2^{16}k}y≡g216k for all 8-bit kkk.
    • precomp08[x]=yprecomp08[x] = yprecomp08[x]=y where y≡g28ky \equiv g^{2^8k}y≡g28k for all 8-bit kkk
    • precomp00[x]=yprecomp00[x] = yprecomp00[x]=y where y≡gky \equiv g^{k}y≡gk for all 8-bit kkk

    Each table has 256 entries, so it’s reasonable to compute and store in memory.

    So calculating g216(−x1)⋅g28(−x2)⋅g(−x3)mod  pg^{2^{16}(-x_1)} \cdot g^{2^{8}(-x_2)} \cdot g^{(-x_3)} \mod{p}g216(−x1​)⋅g28(−x2​)⋅g(−x3​)modp means looking each term in its corresponding table and doing three multiplications. Note that in this process, we also need nQ⋅2kn^{Q\cdot 2^k}nQ⋅2k for k=[24,16,..]k = [24, 16, ..]k=[24,16,..]. You should compute this with some chained squaring instead of calculating each number independently, which is wasteful. So, in reverse order compared to how I explained things.

    Conclusion

    I hope this article shines some light on a cool idea to use precomputed tables to use Tonelli-Shanks in a fixed prime p−1=nQ⋅2Sp-1 = n^Q \cdot 2^{S}p−1=nQ⋅2S when SSS is big.

    0^00 - Don’t worry if you don’t know what two-adicity means. I didn’t know either when I started digging into this. This article from Cryptologie is very helpful.

    1^11 - If you find it confusing or hard to understand, it might be probably a lack of other knowledge that is given for granted.

    2^22 - A primitive n-th root of unity is an element ggg such that gn≡1mod  pg^n \equiv 1 \mod{p}gn≡1modp and gk≡1mod  pg^k \equiv 1 \mod{p}gk≡1modp isn’t true for k<nk < nk<n.

    3^33 - If this isn’t the case, bbb doesn’t exist.

    GitHubX