rts/gmp/mpn/x86/README.family

   1
   2                     X86 CPU FAMILY MPN SUBROUTINES
   3
   4
   5 This file has some notes on things common to all the x86 family code.
   6
   7
   8
   9 ASM FILES
  10
  11 The x86 .asm files are BSD style x86 assembler code, first put through m4
  12 for macro processing.  The generic mpn/asm-defs.m4 is used, together with
  13 mpn/x86/x86-defs.m4.  Detailed notes are in those files.
  14
  15 The code is meant for use with GNU "gas" or a system "as".  There's no
  16 support for assemblers that demand Intel style, and with gas freely
  17 available and easy to use that shouldn't be a problem.
  18
  19
  20
  21 STACK FRAME
  22
  23 m4 macros are used to define the parameters passed on the stack, and these
  24 act like comments on what the stack frame looks like too.  For example,
  25 mpn_mul_1() has the following.
  26
  27         defframe(PARAM_MULTIPLIER, 16)
  28         defframe(PARAM_SIZE,       12)
  29         defframe(PARAM_SRC,         8)
  30         defframe(PARAM_DST,         4)
  31
  32 Here PARAM_MULTIPLIER gets defined as `FRAME+16(%esp)', and the others
  33 similarly.  The return address is at offset 0, but there's not normally any
  34 need to access that.
  35
  36 FRAME is redefined as necessary through the code so it's the number of bytes
  37 pushed on the stack, and hence the offsets in the parameter macros stay
  38 correct.  At the start of a routine FRAME should be zero.
  39
  40         deflit(`FRAME',0)
  41         ...
  42         deflit(`FRAME',4)
  43         ...
  44         deflit(`FRAME',8)
  45         ...
  46
  47 Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and
  48 FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions,
  49 and can be used instead of explicit definitions if preferred.
  50 defframe_pushl() is a combination FRAME_pushl() and defframe().
  51
  52 There's generally some slackness in redefining FRAME.  If new values aren't
  53 going to get used, then the redefinitions are omitted to keep from
  54 cluttering up the code.  This happens for instance at the end of a routine,
  55 where there might be just four register pops and then a ret, so FRAME isn't
  56 getting used.
  57
  58 Local variables and saved registers can be similarly defined, with negative
  59 offsets representing stack space below the initial stack pointer.  For
  60 example,
  61
  62         defframe(SAVE_ESI,   -4)
  63         defframe(SAVE_EDI,   -8)
  64         defframe(VAR_COUNTER,-12)
  65
  66         deflit(STACK_SPACE, 12)
  67
  68 Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the
  69 space, and that instruction must be followed by a redefinition of FRAME
  70 (setting it equal to STACK_SPACE) to reflect the change in %esp.
  71
  72 Definitions for pushed registers are only put in when they're going to be
  73 used.  If registers are just saved and restored with pushes and pops then
  74 definitions aren't made.
  75
  76
  77
  78 ASSEMBLER EXPRESSIONS
  79
  80 Only addition and subtraction seem to be universally available, certainly
  81 that's all the Solaris 8 "as" seems to accept.  If expressions are wanted
  82 then m4 eval() should be used.
  83
  84 In particular note that a "/" anywhere in a line starts a comment in Solaris
  85 "as", and in some configurations of gas too.
  86
  87         addl    $32/2, %eax           <-- wrong
  88
  89         addl    $eval(32/2), %eax     <-- right
  90
  91 Binutils gas/config/tc-i386.c has a choice between "/" being a comment
  92 anywhere in a line, or only at the start.  FreeBSD patches 2.9.1 to select
  93 the latter, and as of 2.9.5 it's the default for GNU/Linux too.
  94
  95
  96
  97 ASSEMBLER COMMENTS
  98
  99 Solaris "as" doesn't support "#" commenting, using /* */ instead,
 100 unfortunately.  For that reason "C" commenting is used (see asm-defs.m4) and
 101 the intermediate ".s" files have no comments.
 102
 103
 104
 105 ZERO DISPLACEMENTS
 106
 107 In a couple of places addressing modes like 0(%ebx) with a byte-sized zero
 108 displacement are wanted, rather than (%ebx) with no displacement.  These are
 109 either for computed jumps or to get desirable code alignment.  Explicit
 110 .byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into
 111 (%ebx).  The Zdisp() macro in x86-defs.m4 is used for this.
 112
 113 Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas
 114 1.92.3 changes it.  In general changing would be the sort of "optimization"
 115 an assembler might perform, hence explicit ".byte"s are used where
 116 necessary.
 117
 118
 119
 120 SHLD/SHRD INSTRUCTIONS
 121
 122 The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx"
 123 must be written "shldl %eax,%ebx" for some assemblers.  gas takes either,
 124 Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is
 125 gas), and omits %cl elsewhere.
 126
 127 For GMP an autoconf test is used to determine whether %cl should be used and
 128 the macros shldl, shrdl, shldw and shrdw in mpn/x86/x86-defs.m4 then pass
 129 through or omit %cl as necessary.  See comments with those macros for usage.
 130
 131
 132
 133 DIRECTION FLAG
 134
 135 The x86 calling conventions say that the direction flag should be clear at
 136 function entry and exit.  (See iBCS2 and SVR4 ABI books, references below.)
 137
 138 Although this has been so since the year dot, it's not absolutely clear
 139 whether it's universally respected.  Since it's better to be safe than
 140 sorry, gmp follows glibc and does a "cld" if it depends on the direction
 141 flag being clear.  This happens only in a few places.
 142
 143
 144
 145 POSITION INDEPENDENT CODE
 146
 147 Defining the symbol PIC in m4 processing selects position independent code.
 148 This mainly affects computed jumps, and these are implemented in a
 149 self-contained fashion (without using the global offset table).  The few
 150 calls from assembly code to global functions use the normal procedure
 151 linkage table.
 152
 153 PIC is necessary for ELF shared libraries because they can be mapped into
 154 different processes at different virtual addresses.  Text relocations in
 155 shared libraries are allowed, but that presumably means a page with such a
 156 relocation isn't shared.  The use of the PLT for PIC adds a fixed cost to
 157 every function call, which is small but might be noticeable when working with
 158 small operands.
 159
 160 Calls from one library function to another don't need to go through the PLT,
 161 since of course the call instruction uses a displacement, not an absolute
 162 address, and the relative locations of object files are known when libgmp.so
 163 is created.  "ld -Bsymbolic" (or "gcc -Wl,-Bsymbolic") will resolve calls
 164 this way, so that there's no jump through the PLT, but of course leaving
 165 setups of the GOT address in %ebx that may be unnecessary.
 166
 167 The %ebx setup could be avoided in assembly if a separate option controlled
 168 PIC for calls as opposed to computed jumps etc.  But there's only ever
 169 likely to be a handful of calls out of assembler, and getting the same
 170 optimization for C intra-library calls would be more important.  There seems
 171 no easy way to tell gcc that certain functions can be called non-PIC, and
 172 unfortunately many gmp functions use the global memory allocation variables,
 173 so they need the GOT anyway.  Object files with no global data references
 174 and only intra-library calls could go into the library as non-PIC under
 175 -Bsymbolic.  Integrating this into libtool and automake is left as an
 176 exercise for the reader.
 177
 178
 179
 180 SIMPLE LOOPS
 181
 182 The overheads in setting up for an unrolled loop can mean that at small
 183 sizes a simple loop is faster.  Making small sizes go fast is important,
 184 even if it adds a cycle or two to bigger sizes.  To this end various
 185 routines choose between a simple loop and an unrolled loop according to
 186 operand size.  The path to the simple loop, or to special case code for
 187 small sizes, is always as fast as possible.
 188
 189 Adding a simple loop requires a conditional jump to choose between the
 190 simple and unrolled code.  The size of a branch misprediction penalty
 191 affects whether a simple loop is worthwhile.
 192
 193 The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover
 194 point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=
 195 UNROLL_THRESHOLD using the unrolled loop.  If position independent code adds
 196 a couple of cycles to an unrolled loop setup, the threshold will vary with
 197 PIC or non-PIC.  Something like the following is typical.
 198
 199         ifdef(`PIC',`
 200         deflit(UNROLL_THRESHOLD, 10)
 201         ',`
 202         deflit(UNROLL_THRESHOLD, 8)
 203         ')
 204
 205 There's no automated way to determine the threshold.  Setting it to a small
 206 value and then to a big value makes it possible to measure the simple and
 207 unrolled loops each over a range of sizes, from which the crossover point
 208 can be determined.  Alternately, just adjust the threshold up or down until
 209 there's no more speedups.
 210
 211
 212
 213 UNROLLED LOOP CODING
 214
 215 The x86 addressing modes allow a byte displacement of -128 to +127, making
 216 it possible to access 256 bytes, which is 64 limbs, without adjusting
 217 pointer registers within the loop.  Dword sized displacements can be used
 218 too, but they increase code size, and unrolling to 64 ought to be enough.
 219
 220 When unrolling to the full 64 limbs/loop, the limb at the top of the loop
 221 will have a displacement of -128, so pointers have to have a corresponding
 222 +128 added before entering the loop.  When unrolling to 32 limbs/loop
 223 displacements 0 to 127 can be used with 0 at the top of the loop and no
 224 adjustment needed to the pointers.
 225
 226 Where 64 limbs/loop is supported, the +128 adjustment is done only when 64
 227 limbs/loop is selected.  Usually the gain in speed using 64 instead of 32 or
 228 16 is small, so support for 64 limbs/loop is generally only for comparison.
 229
 230
 231
 232 COMPUTED JUMPS
 233
 234 When working from least significant limb to most significant limb (most
 235 routines) the computed jump and pointer calculations in preparation for an
 236 unrolled loop are as follows.
 237
 238         S = operand size in limbs
 239         N = number of limbs per loop (UNROLL_COUNT)
 240         L = log2 of unrolling (UNROLL_LOG2)
 241         M = mask for unrolling (UNROLL_MASK)
 242         C = code bytes per limb in the loop
 243         B = bytes per limb (4 for x86)
 244
 245         computed jump            (-S & M) * C + entrypoint
 246         subtract from pointers   (-S & M) * B
 247         initial loop counter     (S-1) >> L
 248         displacements            0 to B*(N-1)
 249
 250 The loop counter is decremented at the end of each loop, and the looping
 251 stops when the decrement takes the counter to -1.  The displacements are for
 252 the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".
 253
 254 Usually the multiply by "C" can be handled without an imul, using instead an
 255 leal, or a shift and subtract.
 256
 257 When working from most significant to least significant limb (eg. mpn_lshift
 258 and mpn_copyd), the calculations change as follows.
 259
 260         add to pointers          (-S & M) * B
 261         displacements            0 to -B*(N-1)
 262
 263
 264
 265 OLD GAS 1.92.3
 266
 267 This version comes with FreeBSD 2.2.8 and has a couple of gremlins that
 268 affect gmp code.
 269
 270 Firstly, an expression involving two forward references to labels comes out
 271 as zero.  For example,
 272
 273                 addl    $bar-foo, %eax
 274         foo:
 275                 nop
 276         bar:
 277
 278 This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".
 279 When only one forward reference is involved, it works correctly, as for
 280 example,
 281
 282         foo:
 283                 addl    $bar-foo, %eax
 284                 nop
 285         bar:
 286
 287 Secondly, an expression involving two labels can't be used as the
 288 displacement for an leal.  For example,
 289
 290         foo:
 291                 nop
 292         bar:
 293                 leal    bar-foo(%eax,%ebx,8), %ecx
 294
 295 A slightly cryptic error is given, "Unimplemented segment type 0 in
 296 parse_operand".  When only one label is used it's ok, and the label can be a
 297 forward reference too, as for example,
 298
 299                 leal    foo(%eax,%ebx,8), %ecx
 300                 nop
 301         foo:
 302
 303 These problems only affect PIC computed jump calculations.  The workarounds
 304 are just to do an leal without a displacement and then an addl, and to make
 305 sure the code is placed so that there's at most one forward reference in the
 306 addl.
 307
 308
 309
 310 REFERENCES
 311
 312 "Intel Architecture Software Developer's Manual", volumes 1 to 3, 1999,
 313 order numbers 243190, 243191 and 243192.  Available on-line,
 314
 315         ftp://download.intel.com/design/PentiumII/manuals/243190.htm
 316         ftp://download.intel.com/design/PentiumII/manuals/243191.htm
 317         ftp://download.intel.com/design/PentiumII/manuals/243192.htm
 318
 319 "Intel386 Family Binary Compatibility Specification 2", Intel Corporation,
 320 published by McGraw-Hill, 1991, ISBN 0-07-031219-2.
 321
 322 "System V Application Binary Interface", Unix System Laboratories Inc, 1992,
 323 published by Prentice Hall, ISBN 0-13-880410-9.  And the "Intel386 Processor
 324 Supplement", AT&T, 1991, ISBN 0-13-877689-X.  (These have details of ELF
 325 shared library PIC coding.)
 326
 327
 328
 329 ----------------
 330 Local variables:
 331 mode: text
 332 fill-column: 76
 333 End: