This directory contains mpn functions for 64-bit V9 SPARC

RELEVANT OPTIMIZATION ISSUES

The Ultra I/II pipeline executes up to two simple integer arithmetic operations
per cycle.  The 64-bit integer multiply instruction mulx takes from 5 cycles to
35 cycles, depending on the position of the most significant bit of the 1st
source operand.  It cannot overlap with other instructions.  For our use of
mulx, it will take from 5 to 20 cycles.

Integer conditional move instructions cannot dual-issue with other integer
instructions.  No conditional move can issue 1-5 cycles after a load.  (Or
something such bizzare.)

Integer branches can issue with two integer arithmetic instructions.  Likewise
for integer loads.  Four instructions may issue (arith, arith, ld/st, branch)
but only if the branch is last.

(The V9 architecture manual recommends that the 2nd operand of a multiply
instruction be the smaller one.  For UltraSPARC, they got things backwards and
optimize for the wrong operand!  Really helpful in the light of that multiply
is incredibly slow on these CPUs!)

STATUS

There is new code in ~/prec/gmp-remote/sparc64.  Not tested or completed, but
the pipelines are worked out.  Here are the timings:

* lshift, rshift: The code is well-optimized and runs at 2.0 cycles/limb.

* add_n, sub_n: add3.s currently runs at 6 cycles/limb.  We use a bizarre
  scheme of compares and branches (with some nops and fnops to align things)
  and carefully stay away from the instructions intended for this application
  (i.e., movcs and movcc).

  Using movcc/movcs, even with deep unrolling, seems to get down to 7
  cycles/limb.

  The most promising approach is to split operands in 32-bit pieces using
  srlx, then use two addccc, and finally compile the results with sllx+or.
  The result could run at 5 cycles/limb, I think.  It might be possible to
  do without unrolling, or with minimal unrolling.

* addmul_1/submul_1: Should optimize for when scalar operand < 2^32.
* addmul_1/submul_1: Since mulx is horrendously slow on UltraSPARC I/II,
  Karatsuba's method should save up to 16 cycles (i.e. > 20%).
* mul_1 (and possibly the other multiply functions): Handle carry in the
  same tricky way as add_n,sub_n.