X-Git-Url: http://git.megacz.com/?a=blobdiff_plain;f=ghc%2Frts%2Fgmp%2Fmpn%2Fx86%2Fpentium%2FREADME;fp=ghc%2Frts%2Fgmp%2Fmpn%2Fx86%2Fpentium%2FREADME;h=0000000000000000000000000000000000000000;hb=0065d5ab628975892cea1ec7303f968c3338cbe1;hp=3b9ec8ac6f4cf35554806dc1af02d54501c6a99d;hpb=28a464a75e14cece5db40f2765a29348273ff2d2;p=ghc-hetmet.git diff --git a/ghc/rts/gmp/mpn/x86/pentium/README b/ghc/rts/gmp/mpn/x86/pentium/README deleted file mode 100644 index 3b9ec8a..0000000 --- a/ghc/rts/gmp/mpn/x86/pentium/README +++ /dev/null @@ -1,77 +0,0 @@ - - INTEL PENTIUM P5 MPN SUBROUTINES - - -This directory contains mpn functions optimized for Intel Pentium (P5,P54) -processors. The mmx subdirectory has code for Pentium with MMX (P55). - - -STATUS - - cycles/limb - - mpn_add_n/sub_n 2.375 - - mpn_copyi/copyd 1.0 - - mpn_divrem_1 44.0 - mpn_mod_1 44.0 - mpn_divexact_by3 15.0 - - mpn_l/rshift 5.375 normal (6.0 on P54) - 1.875 special shift by 1 bit - - mpn_mul_1 13.0 - mpn_add/submul_1 14.0 - - mpn_mul_basecase 14.2 cycles/crossproduct (approx) - - mpn_sqr_basecase 8 cycles/crossproduct (approx) - or 15.5 cycles/triangleproduct (approx) - -Pentium MMX gets the following improvements - - mpn_l/rshift 1.75 - - -1. mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the -documentation indicates that they should take only 43/8 = 5.375 cycles/limb, -or 5 cycles/limb asymptotically. The P55 runs them at the expected speed. - -2. mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb. Due to loop -overhead and other delays (cache refill?), they run at or near 2.5 cycles/limb. - -3. mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they -should. Intel documentation says a mul instruction is 10 cycles, but it -measures 9 and the routines using it run with it as 9. - - - -RELEVANT OPTIMIZATION ISSUES - -1. Pentium doesn't allocate cache lines on writes, unlike most other modern -processors. Since the functions in the mpn class do array writes, we have to -handle allocating the destination cache lines by reading a word from it in the -loops, to achieve the best performance. - -2. Pairing of memory operations requires that the two issued operations refer -to different cache banks. The simplest way to insure this is to read/write -two words from the same object. If we make operations on different objects, -they might or might not be to the same cache bank. - - - -REFERENCES - -"Intel Architecture Optimization Manual", 1997, order number 242816. This -is mostly about P5, the parts about P6 aren't relevant. Available on-line: - - http://download.intel.com/design/PentiumII/manuals/242816.htm - - - ----------------- -Local variables: -mode: text -fill-column: 76 -End: