rts/gmp/mpn/sparc64/README

   1 This directory contains mpn functions for 64-bit V9 SPARC
   2
   3 RELEVANT OPTIMIZATION ISSUES
   4
   5 The Ultra I/II pipeline executes up to two simple integer arithmetic operations
   6 per cycle.  The 64-bit integer multiply instruction mulx takes from 5 cycles to
   7 35 cycles, depending on the position of the most significant bit of the 1st
   8 source operand.  It cannot overlap with other instructions.  For our use of
   9 mulx, it will take from 5 to 20 cycles.
  10
  11 Integer conditional move instructions cannot dual-issue with other integer
  12 instructions.  No conditional move can issue 1-5 cycles after a load.  (Or
  13 something such bizzare.)
  14
  15 Integer branches can issue with two integer arithmetic instructions.  Likewise
  16 for integer loads.  Four instructions may issue (arith, arith, ld/st, branch)
  17 but only if the branch is last.
  18
  19 (The V9 architecture manual recommends that the 2nd operand of a multiply
  20 instruction be the smaller one.  For UltraSPARC, they got things backwards and
  21 optimize for the wrong operand!  Really helpful in the light of that multiply
  22 is incredibly slow on these CPUs!)
  23
  24 STATUS
  25
  26 There is new code in ~/prec/gmp-remote/sparc64.  Not tested or completed, but
  27 the pipelines are worked out.  Here are the timings:
  28
  29 * lshift, rshift: The code is well-optimized and runs at 2.0 cycles/limb.
  30
  31 * add_n, sub_n: add3.s currently runs at 6 cycles/limb.  We use a bizarre
  32   scheme of compares and branches (with some nops and fnops to align things)
  33   and carefully stay away from the instructions intended for this application
  34   (i.e., movcs and movcc).
  35
  36   Using movcc/movcs, even with deep unrolling, seems to get down to 7
  37   cycles/limb.
  38
  39   The most promising approach is to split operands in 32-bit pieces using
  40   srlx, then use two addccc, and finally compile the results with sllx+or.
  41   The result could run at 5 cycles/limb, I think.  It might be possible to
  42   do without unrolling, or with minimal unrolling.
  43
  44 * addmul_1/submul_1: Should optimize for when scalar operand < 2^32.
  45 * addmul_1/submul_1: Since mulx is horrendously slow on UltraSPARC I/II,
  46   Karatsuba's method should save up to 16 cycles (i.e. > 20%).
  47 * mul_1 (and possibly the other multiply functions): Handle carry in the
  48   same tricky way as add_n,sub_n.