X86 CPU FAMILY MPN SUBROUTINES This file has some notes on things common to all the x86 family code. ASM FILES The x86 .asm files are BSD style x86 assembler code, first put through m4 for macro processing. The generic mpn/asm-defs.m4 is used, together with mpn/x86/x86-defs.m4. Detailed notes are in those files. The code is meant for use with GNU "gas" or a system "as". There's no support for assemblers that demand Intel style, and with gas freely available and easy to use that shouldn't be a problem. STACK FRAME m4 macros are used to define the parameters passed on the stack, and these act like comments on what the stack frame looks like too. For example, mpn_mul_1() has the following. defframe(PARAM_MULTIPLIER, 16) defframe(PARAM_SIZE, 12) defframe(PARAM_SRC, 8) defframe(PARAM_DST, 4) Here PARAM_MULTIPLIER gets defined as `FRAME+16(%esp)', and the others similarly. The return address is at offset 0, but there's not normally any need to access that. FRAME is redefined as necessary through the code so it's the number of bytes pushed on the stack, and hence the offsets in the parameter macros stay correct. At the start of a routine FRAME should be zero. deflit(`FRAME',0) ... deflit(`FRAME',4) ... deflit(`FRAME',8) ... Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions, and can be used instead of explicit definitions if preferred. defframe_pushl() is a combination FRAME_pushl() and defframe(). There's generally some slackness in redefining FRAME. If new values aren't going to get used, then the redefinitions are omitted to keep from cluttering up the code. This happens for instance at the end of a routine, where there might be just four register pops and then a ret, so FRAME isn't getting used. Local variables and saved registers can be similarly defined, with negative offsets representing stack space below the initial stack pointer. For example, defframe(SAVE_ESI, -4) defframe(SAVE_EDI, -8) defframe(VAR_COUNTER,-12) deflit(STACK_SPACE, 12) Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the space, and that instruction must be followed by a redefinition of FRAME (setting it equal to STACK_SPACE) to reflect the change in %esp. Definitions for pushed registers are only put in when they're going to be used. If registers are just saved and restored with pushes and pops then definitions aren't made. ASSEMBLER EXPRESSIONS Only addition and subtraction seem to be universally available, certainly that's all the Solaris 8 "as" seems to accept. If expressions are wanted then m4 eval() should be used. In particular note that a "/" anywhere in a line starts a comment in Solaris "as", and in some configurations of gas too. addl $32/2, %eax <-- wrong addl $eval(32/2), %eax <-- right Binutils gas/config/tc-i386.c has a choice between "/" being a comment anywhere in a line, or only at the start. FreeBSD patches 2.9.1 to select the latter, and as of 2.9.5 it's the default for GNU/Linux too. ASSEMBLER COMMENTS Solaris "as" doesn't support "#" commenting, using /* */ instead, unfortunately. For that reason "C" commenting is used (see asm-defs.m4) and the intermediate ".s" files have no comments. ZERO DISPLACEMENTS In a couple of places addressing modes like 0(%ebx) with a byte-sized zero displacement are wanted, rather than (%ebx) with no displacement. These are either for computed jumps or to get desirable code alignment. Explicit .byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into (%ebx). The Zdisp() macro in x86-defs.m4 is used for this. Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas 1.92.3 changes it. In general changing would be the sort of "optimization" an assembler might perform, hence explicit ".byte"s are used where necessary. SHLD/SHRD INSTRUCTIONS The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx" must be written "shldl %eax,%ebx" for some assemblers. gas takes either, Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is gas), and omits %cl elsewhere. For GMP an autoconf test is used to determine whether %cl should be used and the macros shldl, shrdl, shldw and shrdw in mpn/x86/x86-defs.m4 then pass through or omit %cl as necessary. See comments with those macros for usage. DIRECTION FLAG The x86 calling conventions say that the direction flag should be clear at function entry and exit. (See iBCS2 and SVR4 ABI books, references below.) Although this has been so since the year dot, it's not absolutely clear whether it's universally respected. Since it's better to be safe than sorry, gmp follows glibc and does a "cld" if it depends on the direction flag being clear. This happens only in a few places. POSITION INDEPENDENT CODE Defining the symbol PIC in m4 processing selects position independent code. This mainly affects computed jumps, and these are implemented in a self-contained fashion (without using the global offset table). The few calls from assembly code to global functions use the normal procedure linkage table. PIC is necessary for ELF shared libraries because they can be mapped into different processes at different virtual addresses. Text relocations in shared libraries are allowed, but that presumably means a page with such a relocation isn't shared. The use of the PLT for PIC adds a fixed cost to every function call, which is small but might be noticeable when working with small operands. Calls from one library function to another don't need to go through the PLT, since of course the call instruction uses a displacement, not an absolute address, and the relative locations of object files are known when libgmp.so is created. "ld -Bsymbolic" (or "gcc -Wl,-Bsymbolic") will resolve calls this way, so that there's no jump through the PLT, but of course leaving setups of the GOT address in %ebx that may be unnecessary. The %ebx setup could be avoided in assembly if a separate option controlled PIC for calls as opposed to computed jumps etc. But there's only ever likely to be a handful of calls out of assembler, and getting the same optimization for C intra-library calls would be more important. There seems no easy way to tell gcc that certain functions can be called non-PIC, and unfortunately many gmp functions use the global memory allocation variables, so they need the GOT anyway. Object files with no global data references and only intra-library calls could go into the library as non-PIC under -Bsymbolic. Integrating this into libtool and automake is left as an exercise for the reader. SIMPLE LOOPS The overheads in setting up for an unrolled loop can mean that at small sizes a simple loop is faster. Making small sizes go fast is important, even if it adds a cycle or two to bigger sizes. To this end various routines choose between a simple loop and an unrolled loop according to operand size. The path to the simple loop, or to special case code for small sizes, is always as fast as possible. Adding a simple loop requires a conditional jump to choose between the simple and unrolled code. The size of a branch misprediction penalty affects whether a simple loop is worthwhile. The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >= UNROLL_THRESHOLD using the unrolled loop. If position independent code adds a couple of cycles to an unrolled loop setup, the threshold will vary with PIC or non-PIC. Something like the following is typical. ifdef(`PIC',` deflit(UNROLL_THRESHOLD, 10) ',` deflit(UNROLL_THRESHOLD, 8) ') There's no automated way to determine the threshold. Setting it to a small value and then to a big value makes it possible to measure the simple and unrolled loops each over a range of sizes, from which the crossover point can be determined. Alternately, just adjust the threshold up or down until there's no more speedups. UNROLLED LOOP CODING The x86 addressing modes allow a byte displacement of -128 to +127, making it possible to access 256 bytes, which is 64 limbs, without adjusting pointer registers within the loop. Dword sized displacements can be used too, but they increase code size, and unrolling to 64 ought to be enough. When unrolling to the full 64 limbs/loop, the limb at the top of the loop will have a displacement of -128, so pointers have to have a corresponding +128 added before entering the loop. When unrolling to 32 limbs/loop displacements 0 to 127 can be used with 0 at the top of the loop and no adjustment needed to the pointers. Where 64 limbs/loop is supported, the +128 adjustment is done only when 64 limbs/loop is selected. Usually the gain in speed using 64 instead of 32 or 16 is small, so support for 64 limbs/loop is generally only for comparison. COMPUTED JUMPS When working from least significant limb to most significant limb (most routines) the computed jump and pointer calculations in preparation for an unrolled loop are as follows. S = operand size in limbs N = number of limbs per loop (UNROLL_COUNT) L = log2 of unrolling (UNROLL_LOG2) M = mask for unrolling (UNROLL_MASK) C = code bytes per limb in the loop B = bytes per limb (4 for x86) computed jump (-S & M) * C + entrypoint subtract from pointers (-S & M) * B initial loop counter (S-1) >> L displacements 0 to B*(N-1) The loop counter is decremented at the end of each loop, and the looping stops when the decrement takes the counter to -1. The displacements are for the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax". Usually the multiply by "C" can be handled without an imul, using instead an leal, or a shift and subtract. When working from most significant to least significant limb (eg. mpn_lshift and mpn_copyd), the calculations change as follows. add to pointers (-S & M) * B displacements 0 to -B*(N-1) OLD GAS 1.92.3 This version comes with FreeBSD 2.2.8 and has a couple of gremlins that affect gmp code. Firstly, an expression involving two forward references to labels comes out as zero. For example, addl $bar-foo, %eax foo: nop bar: This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax". When only one forward reference is involved, it works correctly, as for example, foo: addl $bar-foo, %eax nop bar: Secondly, an expression involving two labels can't be used as the displacement for an leal. For example, foo: nop bar: leal bar-foo(%eax,%ebx,8), %ecx A slightly cryptic error is given, "Unimplemented segment type 0 in parse_operand". When only one label is used it's ok, and the label can be a forward reference too, as for example, leal foo(%eax,%ebx,8), %ecx nop foo: These problems only affect PIC computed jump calculations. The workarounds are just to do an leal without a displacement and then an addl, and to make sure the code is placed so that there's at most one forward reference in the addl. REFERENCES "Intel Architecture Software Developer's Manual", volumes 1 to 3, 1999, order numbers 243190, 243191 and 243192. Available on-line, ftp://download.intel.com/design/PentiumII/manuals/243190.htm ftp://download.intel.com/design/PentiumII/manuals/243191.htm ftp://download.intel.com/design/PentiumII/manuals/243192.htm "Intel386 Family Binary Compatibility Specification 2", Intel Corporation, published by McGraw-Hill, 1991, ISBN 0-07-031219-2. "System V Application Binary Interface", Unix System Laboratories Inc, 1992, published by Prentice Hall, ISBN 0-13-880410-9. And the "Intel386 Processor Supplement", AT&T, 1991, ISBN 0-13-877689-X. (These have details of ELF shared library PIC coding.) ---------------- Local variables: mode: text fill-column: 76 End: