rts/gmp/mpn/alpha/README

   1 This directory contains mpn functions optimized for DEC Alpha processors.
   2
   3 ALPHA ASSEMBLY RULES AND REGULATIONS
   4
   5 The `.prologue N' pseudo op marks the end of instruction that needs
   6 special handling by unwinding.  It also says whether $27 is really
   7 needed for computing the gp.  The `.mask M' pseudo op says which
   8 registers are saved on the stack, and at what offset in the frame.
   9
  10 Cray code is very very different...
  11
  12
  13 RELEVANT OPTIMIZATION ISSUES
  14
  15 EV4
  16
  17 1. This chip has very limited store bandwidth.  The on-chip L1 cache is
  18    write-through, and a cache line is transfered from the store buffer to
  19    the off-chip L2 in as much 15 cycles on most systems.  This delay hurts
  20    mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift.
  21
  22 2. Pairing is possible between memory instructions and integer arithmetic
  23    instructions.
  24
  25 3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of
  26    these cycles are pipelined.  Thus, multiply instructions can be issued at
  27    a rate of one each 21st cycle.
  28
  29 EV5
  30
  31 1. The memory bandwidth of this chip seems excellent, both for loads and
  32    stores.  Even when the working set is larger than the on-chip L1 and L2
  33    caches, the performance remain almost unaffected.
  34
  35 2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
  36    umulh has a measured latency of 14 cycles and an issue rate of 1 each
  37    10th cycle.  But the exact timing is somewhat confusing.
  38
  39 3. mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
  40    are memory operations.  This will take at least
  41         ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
  42    We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
  43    cache cycles, which should be completely hidden in the 19 issue cycles.
  44    The computation is inherently serial, with these dependencies:
  45
  46                ldq  ldq
  47                  \  /\
  48           (or)   addq |
  49            |\   /   \ |
  50            | addq  cmpult
  51             \  |     |
  52              cmpult  |
  53                  \  /
  54                   or
  55
  56    I.e., 3 operations are needed between carry-in and carry-out, making 12
  57    cycles the absolute minimum for the 4 limbs.  We could replace the `or'
  58    with a cmoveq/cmovne, which could issue one cycle earlier that the `or',
  59    but that might waste a cycle on EV4.  The total depth remain unaffected,
  60    since cmov has a latency of 2 cycles.
  61
  62      addq
  63      /   \
  64    addq  cmpult
  65      |      \
  66    cmpult -> cmovne
  67
  68 Montgomery has a slightly different way of computing carry that requires one
  69 less instruction, but has depth 4 (instead of the current 3).  Since the
  70 code is currently instruction issue bound, Montgomery's idea should save us
  71 1/2 cycle per limb, or bring us down to a total of 17 cycles or 4.25
  72 cycles/limb.  Unfortunately, this method will not be good for the EV6.
  73
  74 EV6
  75
  76 Here we have a really parallel pipeline, capable of issuing up to 4 integer
  77 instructions per cycle.  One integer multiply instruction can issue each
  78 cycle.  To get optimal speed, we need to pretend we are vectorizing the code,
  79 i.e., minimize the iterative dependencies.
  80
  81 There are two dependencies to watch out for.  1) Address arithmetic
  82 dependencies, and 2) carry propagation dependencies.
  83
  84 We can avoid serializing due to address arithmetic by unrolling the loop, so
  85 that addresses don't depend heavily on an index variable.  Avoiding
  86 serializing because of carry propagation is trickier; the ultimate performance
  87 of the code will be determined of the number of latency cycles it takes from
  88 accepting carry-in to a vector point until we can generate carry-out.
  89
  90 Most integer instructions can execute in either the L0, U0, L1, or U1
  91 pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.
  92
  93 CMOV instructions split into two internal instructions, CMOV1 and CMOV2, but
  94 the execute efficiently.  But CMOV split the mapping process (see pg 2-26 in
  95 cmpwrgd.pdf), suggesting the CMOV should always be placed as the last
  96 instruction of an aligned 4 instruction block (?).
  97
  98 Perhaps the most important issue is the latency between the L0/U0 and L1/U1
  99 clusters; a result obtained on either cluster has an extra cycle of latency
 100 for consumers in the opposite cluster.  Because of the dynamic nature of the
 101 implementation, it is hard to predict where an instruction will execute.
 102
 103 The shift loops need (per limb):
 104     1 load (Lx pipes)
 105     1 store (Lx pipes)
 106     2 shift (Ux pipes)
 107     1 iaddlog (Lx pipes, Ux pipes)
 108 Obviously, since the pipes are very equally loaded, we should get 4 insn/cycle, or 1.25 cycles/limb.
 109
 110 For mpn_add_n, we currently have
 111     2 load (Lx pipes)
 112     1 store (Lx pipes)
 113     5 iaddlog (Lx pipes, Ux pipes)
 114
 115 Again, we have a perfect balance and will be limited by carry propagation
 116 delays, currently three cycles.  The superoptimizer indicates that ther
 117 might be sequences that--using a final cmov--have a carry propagation delay
 118 of just two.  Montgomery's subtraction sequence could perhaps be used, by
 119 complementing some operands.  All in all, we should get down to 2 cycles
 120 without much problems.
 121
 122 For mpn_mul_1, we could do, just like for mpn_add_n:
 123     not         newlo,notnewlo
 124     addq        cylimb,newlo,newlo  ||    cmpult        cylimb,notnewlo,cyout
 125     addq        cyout,newhi,cylimb
 126 and get 2-cycle carry propagation.  The instructions needed will be
 127     1 ld (Lx pipes)
 128     1 st (Lx pipes)
 129     2 mul (U1 pipe)
 130     4 iaddlog (Lx pipes, Ux pipes)
 131 issue1: addq not mul ld
 132 issue2: cmpult addq mul st
 133 Conclusion: no cluster delays and 2-cycle carry delays will give us 2 cycles/limb!
 134
 135 Last, we have mpn_addmul_1.  Almost certainly, we will get down to 3
 136 cycles/limb, which would be absolutely awesome.
 137
 138 Old, perhaps obsolete addmul_1 dependency diagram (needs 175 columns wide screen):
 139
 140    i
 141    s
 142    s  i
 143    u  n
 144    e  s
 145    d  t
 146       r
 147    i  u
 148 l  n  c
 149 i  s  t
 150 v  t  i
 151 e  r  o
 152    u  n
 153 v  c
 154 a  t  t
 155 l  i  y
 156 u  o  p
 157 e  n  e
 158 s  s  s
 159         issue
 160          in
 161         cycle
 162          -1     ldq
 163                /    \
 164           0   |      \
 165               |       \
 166           1   |        |
 167               |        |
 168           2   |        |                   ldq
 169               |        |                  /    \
 170           3   |       mulq               |      \
 171               |           \              |       \
 172           4  umulh         \             |        |
 173                |            |            |        |
 174           5    |            |            |        |                   ldq
 175                |            |            |        |                  /    \
 176     4calm 6    |            |   ldq      |       mulq               |      \
 177                |            |  /         |           \              |       \
 178     4casm 7    |            | /         umulh         \             |        |
 179 6              |            ||            |            |            |        |
 180     3aal  8    |            ||            |            |            |        |                   ldq
 181 7              |            ||            |            |            |        |                  /    \
 182     4calm 9    |            ||            |            |   ldq      |       mulq               |      \
 183 9              |            ||            |            |  /         |           \              |       \
 184     4casm 10   |            ||            |            | /         umulh         \             |        |
 185 9              |            ||            |            ||            |            |            |        |
 186     3aal  11   |           addq           |            ||            |            |            |        |                   ldq
 187 9              |          //   \          |            ||            |            |            |        |                  /    \
 188     4calm 12    \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq               |      \
 189 13               \    /       //   \      |            ||            |            |  /         |           \              |       \
 190     4casm 13      addq   cmpult     stq   |            ||            |            | /         umulh         \             |        |
 191 11                    \  /                |            ||            |            ||            |            |            |        |
 192     3aal  14          addq                |           addq           |            ||            |            |            |        |                   ldq
 193 10                        \               |          //   \          |            ||            |            |            |        |                  /    \
 194     4calm 15                cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq               |      \
 195 13                                          \    /       //   \      |            ||            |            |  /         |           \              |       \
 196     4casm 16                                 addq   cmpult     stq   |            ||            |            | /         umulh         \             |        |
 197 11                                               \  /                |            ||            |            ||            |            |            |        |
 198     3aal  17                                     addq                |           addq           |            ||            |            |            |        |
 199 10                                                   \               |          //   \          |            ||            |            |            |        |
 200     4calm 18                                           cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq      |       mulq
 201 13                                                                     \    /       //   \      |            ||            |            |  /         |           \
 202     4casm 19                                                            addq   cmpult     stq   |            ||            |            | /         umulh         \
 203 11                                                                          \  /                |            ||            |            ||            |            |
 204     3aal  20                                                                addq                |           addq           |            ||            |            |
 205 10                                                                              \               |          //   \          |            ||            |            |
 206     4calm 21                                                                      cy ---->       \     cmpult    addq<-cy  |            ||            |            |   ldq
 207                                                                                                   \    /       //   \      |            ||            |            |  /
 208           22                                                                                       addq   cmpult     stq   |            ||            |            | /
 209                                                                                                        \  /                |            ||            |            ||
 210           23                                                                                           addq                |           addq           |            ||
 211                                                                                                            \               |          //   \          |            ||
 212           24                                                                                                 cy ---->       \     cmpult    addq<-cy  |            ||
 213                                                                                                                              \    /       //   \      |            ||
 214           25                                                                                                                  addq   cmpult     stq   |            ||
 215                                                                                                                                   \  /                |            ||
 216           26                                                                                                                      addq                |           addq
 217                                                                                                                                       \               |          //   \
 218           27                                                                                                                            cy ---->       \     cmpult    addq<-cy
 219                                                                                                                                                         \    /       //   \
 220           28                                                                                                                                             addq   cmpult     stq
 221                                                                                                                                                              \  /
 222 As many as 6 consecutive points will be under execution simultaneously, or if we                                                                             addq
 223 schedule loads even further away, maybe 7 or 8.  But the number of live quantities                                                                               \
 224 is reasonable, and can easily be satisfied.                                                                                                                        cy ---->