rts/gmp/mpn/pa64/README

   1 This directory contains mpn functions for 64-bit PA-RISC 2.0.
   2
   3 RELEVANT OPTIMIZATION ISSUES
   4
   5 The PA8000 has a multi-issue pipeline with large buffers for instructions
   6 awaiting pending results.  Therefore, no latency scheduling is necessary
   7 (and might actually be harmful).
   8
   9 Two 64-bit loads can be completed per cycle.  One 64-bit store can be
  10 completed per cycle.  A store cannot complete in the same cycle as a load.
  11
  12 STATUS
  13
  14 * mpn_lshift, mpn_rshift, mpn_add_n, mpn_sub_n are all well-tuned and run at
  15   the peak cache bandwidth; 1.5 cycles/limb for shifting and 2.0 cycles/limb
  16   for add/subtract.
  17
  18 * The multiplication functions run at 11 cycles/limb.  The cache bandwidth
  19   allows 7.5 cycles/limb.  Perhaps it would be possible, using unrolling or
  20   better scheduling, to get closer to the cache bandwidth limit.
  21
  22 * xaddmul_1.S contains a quicker method for forming the 128 bit product.  It
  23   uses some fewer operations, and keep the carry flag live across the loop
  24   boundary.  But it seems hard to make it run more than 1/4 cycle faster
  25   than the old code.  Perhaps we really ought to unroll this loop be 2x?
  26   2x should suffice since register latency schedling is never needed,
  27   but the unrolling would hide the store-load latency.  Here is a sketch:
  28
  29         1. A multiply and store 64-bit products
  30         2. B sum 64-bit products 128-bit product
  31         3. B load  64-bit products to integer registers
  32         4. B multiply and store 64-bit products
  33         5. A sum 64-bit products 128-bit product
  34         6. A load  64-bit products to integer registers
  35         7. goto 1
  36
  37   In practice, adjacent groups (1 and 2, 2 and 3, etc) will be interleaved
  38   for better instruction mix.