ghc/rts/gmp/mpn/x86/pentium/README

   1
   2                    INTEL PENTIUM P5 MPN SUBROUTINES
   3
   4
   5 This directory contains mpn functions optimized for Intel Pentium (P5,P54)
   6 processors.  The mmx subdirectory has code for Pentium with MMX (P55).
   7
   8
   9 STATUS
  10
  11                                 cycles/limb
  12
  13         mpn_add_n/sub_n            2.375
  14
  15         mpn_copyi/copyd            1.0
  16
  17         mpn_divrem_1              44.0
  18         mpn_mod_1                 44.0
  19         mpn_divexact_by3          15.0
  20
  21         mpn_l/rshift               5.375 normal (6.0 on P54)
  22                                    1.875 special shift by 1 bit
  23
  24         mpn_mul_1                 13.0
  25         mpn_add/submul_1          14.0
  26
  27         mpn_mul_basecase          14.2 cycles/crossproduct (approx)
  28
  29         mpn_sqr_basecase           8 cycles/crossproduct (approx)
  30                                    or 15.5 cycles/triangleproduct (approx)
  31
  32 Pentium MMX gets the following improvements
  33
  34         mpn_l/rshift               1.75
  35
  36
  37 1. mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the
  38 documentation indicates that they should take only 43/8 = 5.375 cycles/limb,
  39 or 5 cycles/limb asymptotically.  The P55 runs them at the expected speed.
  40
  41 2. mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb.  Due to loop
  42 overhead and other delays (cache refill?), they run at or near 2.5 cycles/limb.
  43
  44 3. mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they
  45 should.  Intel documentation says a mul instruction is 10 cycles, but it
  46 measures 9 and the routines using it run with it as 9.
  47
  48
  49
  50 RELEVANT OPTIMIZATION ISSUES
  51
  52 1. Pentium doesn't allocate cache lines on writes, unlike most other modern
  53 processors.  Since the functions in the mpn class do array writes, we have to
  54 handle allocating the destination cache lines by reading a word from it in the
  55 loops, to achieve the best performance.
  56
  57 2. Pairing of memory operations requires that the two issued operations refer
  58 to different cache banks.  The simplest way to insure this is to read/write
  59 two words from the same object.  If we make operations on different objects,
  60 they might or might not be to the same cache bank.
  61
  62
  63
  64 REFERENCES
  65
  66 "Intel Architecture Optimization Manual", 1997, order number 242816.  This
  67 is mostly about P5, the parts about P6 aren't relevant.  Available on-line:
  68
  69         http://download.intel.com/design/PentiumII/manuals/242816.htm
  70
  71
  72
  73 ----------------
  74 Local variables:
  75 mode: text
  76 fill-column: 76
  77 End: