rts/gmp/mpn/x86/k7/README

   1
   2                       AMD K7 MPN SUBROUTINES
   3
   4
   5 This directory contains code optimized for the AMD Athlon CPU.
   6
   7 The mmx subdirectory has routines using MMX instructions.  All Athlons have
   8 MMX, the separate directory is just so that configure can omit it if the
   9 assembler doesn't support MMX.
  10
  11
  12
  13 STATUS
  14
  15 Times for the loops, with all code and data in L1 cache.
  16
  17                                cycles/limb
  18         mpn_add/sub_n             1.6
  19
  20         mpn_copyi                 0.75 or 1.0   \ varying with data alignment
  21         mpn_copyd                 0.75 or 1.0   /
  22
  23         mpn_divrem_1             17.0 integer part, 15.0 fractional part
  24         mpn_mod_1                17.0
  25         mpn_divexact_by3          8.0
  26
  27         mpn_l/rshift              1.2
  28
  29         mpn_mul_1                 3.4
  30         mpn_addmul/submul_1       3.9
  31
  32         mpn_mul_basecase          4.42 cycles/crossproduct (approx)
  33
  34         mpn_popcount               5.0
  35         mpn_hamdist                6.0
  36
  37 Prefetching of sources hasn't yet been tried.
  38
  39
  40
  41 NOTES
  42
  43 cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available.
  44
  45 Write-allocate L1 data cache means prefetching of destinations is unnecessary.
  46
  47 Floating point multiplications can be done in parallel with integer
  48 multiplications, but there doesn't seem to be any way to make use of this.
  49
  50 Unsigned "mul"s can be issued every 3 cycles.  This suggests 3 is a limit on
  51 the speed of the multiplication routines.  The documentation shows mul
  52 executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that,
  53 to get near 3 cycles code has to be arranged so that nothing else is issued
  54 to IEU0.  A busy IEU0 could explain why some code takes 4 cycles and other
  55 apparently equivalent code takes 5.
  56
  57
  58
  59 OPTIMIZATIONS
  60
  61 Unrolled loops are used to reduce looping overhead.  The unrolling is
  62 configurable up to 32 limbs/loop for most routines and up to 64 for some.
  63 The K7 has 64k L1 code cache so quite big unrolling is allowable.
  64
  65 Computed jumps into the unrolling are used to handle sizes not a multiple of
  66 the unrolling.  An attractive feature of this is that times increase
  67 smoothly with operand size, but it may be that some routines should just
  68 have simple loops to finish up, especially when PIC adds between 2 and 16
  69 cycles to get %eip.
  70
  71 Position independent code is implemented using a call to get %eip for the
  72 computed jumps and a ret is always done, rather than an addl $4,%esp or a
  73 popl, so the CPU return address branch prediction stack stays synchronised
  74 with the actual stack in memory.
  75
  76 Branch prediction, in absence of any history, will guess forward jumps are
  77 not taken and backward jumps are taken.  Where possible it's arranged that
  78 the less likely or less important case is under a taken forward jump.
  79
  80
  81
  82 CODING
  83
  84 Instructions in general code have been shown grouped if they can execute
  85 together, which means up to three direct-path instructions which have no
  86 successive dependencies.  K7 always decodes three and has out-of-order
  87 execution, but the groupings show what slots might be available and what
  88 dependency chains exist.
  89
  90 When there's vector-path instructions an effort is made to get triplets of
  91 direct-path instructions in between them, even if there's dependencies,
  92 since this maximizes decoding throughput and might save a cycle or two if
  93 decoding is the limiting factor.
  94
  95
  96
  97 INSTRUCTIONS
  98
  99 adcl       direct
 100 divl       39 cycles back-to-back
 101 lodsl,etc  vector
 102 loop       1 cycle vector (decl/jnz opens up one decode slot)
 103 movd reg   vector
 104 movd mem   direct
 105 mull       issue every 3 cycles, latency 4 cycles low word, 6 cycles high word
 106 popl       vector (use movl for more than one pop)
 107 pushl      direct, will pair with a load
 108 shrdl %cl  vector, 3 cycles, seems to be 3 decode too
 109 xorl r,r   false read dependency recognised
 110
 111
 112
 113 REFERENCES
 114
 115 "AMD Athlon Processor X86 Code Optimization Guide", AMD publication number
 116 22007, revision E, November 1999.  Available on-line,
 117
 118         http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf
 119
 120 "3DNow Technology Manual", AMD publication number 21928F/0-August 1999.
 121 This describes the femms and prefetch instructions.  Available on-line,
 122
 123         http://www.amd.com/K6/k6docs/pdf/21928.pdf
 124
 125 "AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD
 126 publication number 22466, revision B, August 1999.  This describes
 127 instructions added in the Athlon processor, such as pswapd and the extra
 128 prefetch forms.  Available on-line,
 129
 130         http://www.amd.com/products/cpg/athlon/techdocs/pdf/22466.pdf
 131
 132 "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
 133 August 1999.  This has some notes on general Athlon optimizations as well as
 134 3DNow.  Available on-line,
 135
 136         http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf
 137
 138
 139
 140
 141 ----------------
 142 Local variables:
 143 mode: text
 144 fill-column: 76
 145 End: