Let me propose an design for field_element_mul that I think may yield better costs.
Step 1: full_multiply_128
full_multiply is designed to be used recursively to produce multiplication such as
full_multiply_128(a: (u64, u64), b: (u64, u64), c: (u64, u64), d: (u64, u64)) -> ((u64, u64), (u64, u64))
that returns a 256-bit value (note the nesting of the product types).
For reference the specification of full_multiply(a,b,c,d) is that it returns a × b + c + d.
The implementation of full_multiply_128 is roughly
let (hi_a, lo_a) = a;
let (hi_b, lo_b) = b;
let (hi_c, lo_c) = c;
let (hi_d, lo_d) = d;
let (hi_1, lo_1) = full_multiply_64(lo_a, lo_b, lo_c, lo_d);
let (hi_2, lo_2) = full_multiply_64(lo_a, hi_b, hi_c, hi_d);
let (hi_3, lo_3) = full_multiply_64(hi_a, lo_b, hi_1, lo_2);
let hi_res = full_multiply_64(hi_a, hi_b, hi_2, hi_3);
return (hi_res, (lo_3, lo_1)).
Step 2: full_multiply_256
Using the same technique in Step 1, implement full_multiply_256, returning a 512-bit value, by using four calls to full_multiply_128.
Alternatively you can just implement multiply_256 since you won't actually be needing the full multiply.
Step 3: div_mod_256_128
Like full_multiply, the div_mod_128_64 jet is intended to be recursively use to create quotient-remainder computations on wider and wider integer sizes. There are some pre-conditions on this jet, e.g. div_mod_128_64((ahi, alo), b) requires that the highest bit of b is set and that b < ahi, but these won't really cause too many problems.
div_mod_256_128 divides a 256-bit number by a 128-bit number returning a 128-bit quotient and a 128-bit remainder, subject to the same preconditions that div_mod_128_64 has. div_mod_256_128 is actually defined in terms of div_mod_192_128 which returns a 64-bit quotient and a 128-bit remainder.
The algorithm for div_mod_256_128 in terms of div_mod_192_128 is to do two steps of long division. If we want to divide AAAA by BB, where A and B represent 2⁶⁴ bit "digits", We perform one call to div_mod_192_128 to get the high "digit" of the quotient and an intermediate remainder.
* 1. Q
* ______
* BB) AAAA
* XXX
* ---
* RR
Then we "bring down" the last "digit" of A next to our remainder and call div_mod_192_128 to get the low "digit" of the quotient and the final remainder.
* 2. QQ
* ______
* BB) AAA|
* XXX|
* ---|
* RRA
* XXX
* ---
* RR
*
Now we have to implement div_mod_192_128 in terms of the existing div_mod_128_64 jet. Using the language from above we want to divide AAA by BB. Again we have preconditions that require the high bit of BB is set and AAA is less than BB0.
The idea here is to "guess" what the 64-bit quotient is by dividing the high two "digits" in A by the high "digit" in B using div_mod_128_64. For math reasons, this guess will be correct within 2 of the true quotient. We use our guessed quotient to compute a remainder, and if the remainder would be negative, we adjust our guessed quotient and repeat until our remainder comes out positive.
This algorithm is based on the paper "Fast Recursive Division" by Christoph Burnikel and Joachim Ziegler, MPI-I-98-1-022, Oct. 1998. You can also find some details in the Simplicity code comments at
https://github.com/BlockstreamResearch/simplicity/blob/d27b7e6f6b4e41f5f1725b102f8f80edc04a13ba/C/jets.c#L1046-L1203
I also have a implementation of these wide div_mod functions in raw Simplicity at
https://github.com/BlockstreamResearch/simplicity/blob/master/Haskell/Core/Simplicity/Programs/Arith.hs#L240-L282
however, that code may not be decipherable.
Step 4: div_mod_512_256
This step is exactly like in Step 3, except we again double the width of our arguments.
Again you may be able to simplify things a bit by only implementing mod_512_256, since you don't actually need the quotient.
Step 5: field_element_mul
Now we have everything in place to implement field_element_mul. We use full_multiply_256 (or multiply_256) to multiply two 256-bit field elements to produce a 512-bit result. Then we can use div_mod_512_256 to take our 512-bit result and reduce it modulo the required 256-modulus.
Our div_mod_512_256 above has a preconditions that requires that the high bits of a to be less than b; however, the product of already (partially) normalized field elements ought to satisfy this requirement.
There is another more sticky issue is that the BN-254 modulus M doesn't actually have the high bit set as required by div_mod_512_256. There are probably a couple of solutions to this problem. My suggestion would be to actually work with only partially reduced values as much as you can get away with. That is work modulo 3M instead. You can write a specialized normalization function to reduce to M by subtraction whenever it is needed, but hopefully that won't be required very much.
Epilogue
I will say that I would like SimplicityHL language to eventually provide "synthetic" wide integer types along with these sorts of arithmetic operations, but it will take some time before we get there.
Let me propose an design for
field_element_multhat I think may yield better costs.Step 1:
full_multiply_128full_multiplyis designed to be used recursively to produce multiplication such asthat returns a 256-bit value (note the nesting of the product types).
For reference the specification of
full_multiply(a,b,c,d)is that it returnsa×b+c+d.The implementation of full_multiply_128 is roughly
Step 2:
full_multiply_256Using the same technique in Step 1, implement
full_multiply_256, returning a 512-bit value, by using four calls tofull_multiply_128.Alternatively you can just implement
multiply_256since you won't actually be needing the full multiply.Step 3:
div_mod_256_128Like
full_multiply, thediv_mod_128_64jet is intended to be recursively use to create quotient-remainder computations on wider and wider integer sizes. There are some pre-conditions on this jet, e.g.div_mod_128_64((ahi, alo), b)requires that the highest bit ofbis set and thatb < ahi, but these won't really cause too many problems.div_mod_256_128divides a 256-bit number by a 128-bit number returning a 128-bit quotient and a 128-bit remainder, subject to the same preconditions thatdiv_mod_128_64has.div_mod_256_128is actually defined in terms ofdiv_mod_192_128which returns a 64-bit quotient and a 128-bit remainder.The algorithm for
div_mod_256_128in terms ofdiv_mod_192_128is to do two steps of long division. If we want to divideAAAAbyBB, whereAandBrepresent 2⁶⁴ bit "digits", We perform one call todiv_mod_192_128to get the high "digit" of the quotient and an intermediate remainder.Then we "bring down" the last "digit" of A next to our remainder and call
div_mod_192_128to get the low "digit" of the quotient and the final remainder.Now we have to implement
div_mod_192_128in terms of the existingdiv_mod_128_64jet. Using the language from above we want to divideAAAbyBB. Again we have preconditions that require the high bit ofBBis set andAAAis less thanBB0.The idea here is to "guess" what the 64-bit quotient is by dividing the high two "digits" in A by the high "digit" in B using
div_mod_128_64. For math reasons, this guess will be correct within 2 of the true quotient. We use our guessed quotient to compute a remainder, and if the remainder would be negative, we adjust our guessed quotient and repeat until our remainder comes out positive.This algorithm is based on the paper "Fast Recursive Division" by Christoph Burnikel and Joachim Ziegler, MPI-I-98-1-022, Oct. 1998. You can also find some details in the Simplicity code comments at
https://github.com/BlockstreamResearch/simplicity/blob/d27b7e6f6b4e41f5f1725b102f8f80edc04a13ba/C/jets.c#L1046-L1203
I also have a implementation of these wide
div_modfunctions in raw Simplicity athttps://github.com/BlockstreamResearch/simplicity/blob/master/Haskell/Core/Simplicity/Programs/Arith.hs#L240-L282
however, that code may not be decipherable.
Step 4:
div_mod_512_256This step is exactly like in Step 3, except we again double the width of our arguments.
Again you may be able to simplify things a bit by only implementing
mod_512_256, since you don't actually need the quotient.Step 5:
field_element_mulNow we have everything in place to implement
field_element_mul. We usefull_multiply_256(ormultiply_256) to multiply two 256-bit field elements to produce a 512-bit result. Then we can usediv_mod_512_256to take our 512-bit result and reduce it modulo the required 256-modulus.Our
div_mod_512_256above has a preconditions that requires that the high bits ofato be less thanb; however, the product of already (partially) normalized field elements ought to satisfy this requirement.There is another more sticky issue is that the BN-254 modulus
Mdoesn't actually have the high bit set as required bydiv_mod_512_256. There are probably a couple of solutions to this problem. My suggestion would be to actually work with only partially reduced values as much as you can get away with. That is work modulo3Minstead. You can write a specialized normalization function to reduce toMby subtraction whenever it is needed, but hopefully that won't be required very much.Epilogue
I will say that I would like SimplicityHL language to eventually provide "synthetic" wide integer types along with these sorts of arithmetic operations, but it will take some time before we get there.