rts/gmp/mpn/x86/k6/README

   1
   2                         AMD K6 MPN SUBROUTINES
   3
   4
   5
   6 This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
   7 K6-3.
   8
   9 The mmx and k62mmx subdirectories have routines using MMX instructions.  All
  10 K6s have MMX, the separate directories are just so that ./configure can omit
  11 them if the assembler doesn't support MMX.
  12
  13
  14
  15
  16 STATUS
  17
  18 Times for the loops, with all code and data in L1 cache, are as follows.
  19
  20                                  cycles/limb
  21
  22         mpn_add_n/sub_n            3.25 normal, 2.75 in-place
  23
  24         mpn_mul_1                  6.25
  25         mpn_add/submul_1           7.65-8.4  (varying with data values)
  26
  27         mpn_mul_basecase           9.25 cycles/crossproduct (approx)
  28         mpn_sqr_basecase           4.7  cycles/crossproduct (approx)
  29                                    or 9.2 cycles/triangleproduct (approx)
  30
  31         mpn_divrem_1              20.0
  32         mpn_mod_1                 20.0
  33         mpn_divexact_by3          11.0
  34
  35         mpn_l/rshift               3.0
  36
  37         mpn_copyi/copyd            1.0
  38
  39         mpn_com_n                  1.5-1.85  \
  40         mpn_and/andn/ior/xor_n     1.5-1.75  | varying with
  41         mpn_iorn/xnor_n            2.0-2.25  | data alignment
  42         mpn_nand/nior_n            2.0-2.25  /
  43
  44         mpn_popcount              12.5
  45         mpn_hamdist               13.0
  46
  47
  48 K6-2 and K6-3 have dual-issue MMX and get the following improvements.
  49
  50         mpn_l/rshift               1.75
  51
  52         mpn_copyi/copyd            0.56 or 1.0  \
  53                                                 |
  54         mpn_com_n                  1.0-1.2      | varying with
  55         mpn_and/andn/ior/xor_n     1.2-1.5      | data alignment
  56         mpn_iorn/xnor_n            1.5-2.0      |
  57         mpn_nand/nior_n            1.75-2.0     /
  58
  59         mpn_popcount               9.0
  60         mpn_hamdist               11.5
  61
  62
  63 Prefetching of sources hasn't yet given any joy.  With the 3DNow "prefetch"
  64 instruction, code seems to run slower, and with just "mov" loads it doesn't
  65 seem faster.  Results so far are inconsistent.  The K6 does a hardware
  66 prefetch of the second cache line in a sector, so the penalty for not
  67 prefetching in software is reduced.
  68
  69
  70
  71
  72 NOTES
  73
  74 All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.
  75
  76 Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
  77 execute them in both X and Y (and together).
  78
  79 Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
  80 chapter 6 table 12).
  81
  82 Write-allocate L1 data cache means prefetching of destinations is unnecessary.
  83 Store queue is 7 entries of 64 bits each.
  84
  85 Floating point multiplications can be done in parallel with integer
  86 multiplications, but there doesn't seem to be any way to make use of this.
  87
  88
  89
  90 OPTIMIZATIONS
  91
  92 Unrolled loops are used to reduce looping overhead.  The unrolling is
  93 configurable up to 32 limbs/loop for most routines, up to 64 for some.
  94
  95 Sometimes computed jumps into the unrolling are used to handle sizes not a
  96 multiple of the unrolling.  An attractive feature of this is that times
  97 smoothly increase with operand size, but an indirect jump is about 6 cycles
  98 and the setups about another 6, so it depends on how much the unrolled code
  99 is faster than a simple loop as to whether a computed jump ought to be used.
 100
 101 Position independent code is implemented using a call to get eip for
 102 computed jumps and a ret is always done, rather than an addl $4,%esp or a
 103 popl, so the CPU return address branch prediction stack stays synchronised
 104 with the actual stack in memory.  Such a call however still costs 4 to 7
 105 cycles.
 106
 107 Branch prediction, in absence of any history, will guess forward jumps are
 108 not taken and backward jumps are taken.  Where possible it's arranged that
 109 the less likely or less important case is under a taken forward jump.
 110
 111
 112
 113 MMX
 114
 115 Putting emms or femms as late as possible in a routine seems to be fastest.
 116 Perhaps an emms or femms stalls until all outstanding MMX instructions have
 117 completed, so putting it later gives them a chance to complete on their own,
 118 in parallel with other operations (like register popping).
 119
 120 The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
 121 at the start of a routine, in case it's been preceded by x87 floating point
 122 operations.  This isn't done because in gmp programs it's expected that x87
 123 floating point won't be much used and that chances are an mpn routine won't
 124 have been preceded by any x87 code.
 125
 126
 127
 128 CODING
 129
 130 Instructions in general code are shown paired if they can decode and execute
 131 together, meaning two short decode instructions with the second not
 132 depending on the first, only the first using the shifter, no more than one
 133 load, and no more than one store.
 134
 135 K6 does some out of order execution so the pairings aren't essential, they
 136 just show what slots might be available.  When decoding is the limiting
 137 factor things can be scheduled that might not execute until later.
 138
 139
 140
 141 NOTES
 142
 143 Code alignment
 144
 145 - if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
 146   short decode is inhibited.  The cross.pl script detects this.
 147
 148 - loops and branch targets should be aligned to 16 bytes, or ensure at least
 149   2 instructions before a 32 byte boundary.  This makes use of the 16 byte
 150   cache in the BTB.
 151
 152 Addressing modes
 153
 154 - (%esi) degrades decoding from short to vector.  0(%esi) doesn't have this
 155   problem, and can be used as an equivalent, or easier is just to use a
 156   different register, like %ebx.
 157
 158 - K6 and pre-CXT core K6-2 have the following problem.  (K6-2 CXT and K6-3
 159   have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).
 160
 161   If more than 3 bytes are needed to determine instruction length then
 162   decoding degrades from direct to long, or from long to vector.  This
 163   happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
 164   with mod=00 the sib determines whether there's a displacement.
 165
 166   This affects all MMX and 3DNow instructions, and others with an 0F prefix
 167   like movzbl.  The modes affected are anything with an index and no
 168   displacement, or an index but no base, and this includes (%esp) which is
 169   really (,%esp,1).
 170
 171   The cross.pl script detects problem cases.  The workaround is to always
 172   use a displacement, and to do this with Zdisp if it's zero so the
 173   assembler doesn't discard it.
 174
 175   See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
 176   13-14 and 36-37.
 177
 178 Calls
 179
 180 - indirect jumps and calls are not branch predicted, they measure about 6
 181   cycles.
 182
 183 Various
 184
 185 - adcl      2 cycles of decode, maybe 2 cycles executing in the X pipe
 186 - bsf       12-27 cycles
 187 - emms      5 cycles
 188 - femms     3 cycles
 189 - jecxz     2 cycles taken, 13 not taken (optimization manual says 7 not taken)
 190 - divl      20 cycles back-to-back
 191 - imull     2 decode, 2 execute
 192 - mull      2 decode, 3 execute (optimization manual decoding sample)
 193 - prefetch  2 cycles
 194 - rcll/rcrl implicit by one bit: 2 cycles
 195             immediate or %cl count: 11 + 2 per bit for dword
 196                                     13 + 4 per bit for byte
 197 - setCC     2 cycles
 198 - xchgl %eax,reg  1.5 cycles, back-to-back (strange)
 199         reg,reg   2 cycles, back-to-back
 200
 201
 202
 203
 204 REFERENCES
 205
 206 "AMD-K6 Processor Code Optimization Application Note", AMD publication
 207 number 21924, revision D amendment 0, January 2000.  This describes K6-2 and
 208 K6-3.  Available on-line,
 209
 210         http://www.amd.com/K6/k6docs/pdf/21924.pdf
 211
 212 "AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
 213 publication number 21828, revision A amendment 0, August 1997.  This is an
 214 older edition of the above document, describing plain K6.  Available
 215 on-line,
 216
 217         http://www.amd.com/K6/k6docs/pdf/21828.pdf
 218
 219 "3DNow Technology Manual", AMD publication number 21928F/0-August 1999.
 220 This describes the femms and prefetch instructions, but nothing else from
 221 3DNow has been used.  Available on-line,
 222
 223         http://www.amd.com/K6/k6docs/pdf/21928.pdf
 224
 225 "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
 226 August 1999.  This has some notes on general K6 optimizations as well as
 227 3DNow.  Available on-line,
 228
 229         http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf
 230
 231
 232
 233 ----------------
 234 Local variables:
 235 mode: text
 236 fill-column: 76
 237 End: