+++ /dev/null
-This directory contains mpn functions optimized for DEC Alpha processors.
-
-ALPHA ASSEMBLY RULES AND REGULATIONS
-
-The `.prologue N' pseudo op marks the end of instruction that needs
-special handling by unwinding. It also says whether $27 is really
-needed for computing the gp. The `.mask M' pseudo op says which
-registers are saved on the stack, and at what offset in the frame.
-
-Cray code is very very different...
-
-
-RELEVANT OPTIMIZATION ISSUES
-
-EV4
-
-1. This chip has very limited store bandwidth. The on-chip L1 cache is
- write-through, and a cache line is transfered from the store buffer to
- the off-chip L2 in as much 15 cycles on most systems. This delay hurts
- mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift.
-
-2. Pairing is possible between memory instructions and integer arithmetic
- instructions.
-
-3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of
- these cycles are pipelined. Thus, multiply instructions can be issued at
- a rate of one each 21st cycle.
-
-EV5
-
-1. The memory bandwidth of this chip seems excellent, both for loads and
- stores. Even when the working set is larger than the on-chip L1 and L2
- caches, the performance remain almost unaffected.
-
-2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
- umulh has a measured latency of 14 cycles and an issue rate of 1 each
- 10th cycle. But the exact timing is somewhat confusing.
-
-3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12
- are memory operations. This will take at least
- ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
- We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
- cache cycles, which should be completely hidden in the 19 issue cycles.
- The computation is inherently serial, with these dependencies:
-
- ldq ldq
- \ /\
- (or) addq |
- |\ / \ |
- | addq cmpult
- \ | |
- cmpult |
- \ /
- or
-
- I.e., 3 operations are needed between carry-in and carry-out, making 12
- cycles the absolute minimum for the 4 limbs. We could replace the `or'
- with a cmoveq/cmovne, which could issue one cycle earlier that the `or',
- but that might waste a cycle on EV4. The total depth remain unaffected,
- since cmov has a latency of 2 cycles.
-
- addq
- / \
- addq cmpult
- | \
- cmpult -> cmovne
-
-Montgomery has a slightly different way of computing carry that requires one
-less instruction, but has depth 4 (instead of the current 3). Since the
-code is currently instruction issue bound, Montgomery's idea should save us
-1/2 cycle per limb, or bring us down to a total of 17 cycles or 4.25
-cycles/limb. Unfortunately, this method will not be good for the EV6.
-
-EV6
-
-Here we have a really parallel pipeline, capable of issuing up to 4 integer
-instructions per cycle. One integer multiply instruction can issue each
-cycle. To get optimal speed, we need to pretend we are vectorizing the code,
-i.e., minimize the iterative dependencies.
-
-There are two dependencies to watch out for. 1) Address arithmetic
-dependencies, and 2) carry propagation dependencies.
-
-We can avoid serializing due to address arithmetic by unrolling the loop, so
-that addresses don't depend heavily on an index variable. Avoiding
-serializing because of carry propagation is trickier; the ultimate performance
-of the code will be determined of the number of latency cycles it takes from
-accepting carry-in to a vector point until we can generate carry-out.
-
-Most integer instructions can execute in either the L0, U0, L1, or U1
-pipelines. Shifts only execute in U0 and U1, and multiply only in U1.
-
-CMOV instructions split into two internal instructions, CMOV1 and CMOV2, but
-the execute efficiently. But CMOV split the mapping process (see pg 2-26 in
-cmpwrgd.pdf), suggesting the CMOV should always be placed as the last
-instruction of an aligned 4 instruction block (?).
-
-Perhaps the most important issue is the latency between the L0/U0 and L1/U1
-clusters; a result obtained on either cluster has an extra cycle of latency
-for consumers in the opposite cluster. Because of the dynamic nature of the
-implementation, it is hard to predict where an instruction will execute.
-
-The shift loops need (per limb):
- 1 load (Lx pipes)
- 1 store (Lx pipes)
- 2 shift (Ux pipes)
- 1 iaddlog (Lx pipes, Ux pipes)
-Obviously, since the pipes are very equally loaded, we should get 4 insn/cycle, or 1.25 cycles/limb.
-
-For mpn_add_n, we currently have
- 2 load (Lx pipes)
- 1 store (Lx pipes)
- 5 iaddlog (Lx pipes, Ux pipes)
-
-Again, we have a perfect balance and will be limited by carry propagation
-delays, currently three cycles. The superoptimizer indicates that ther
-might be sequences that--using a final cmov--have a carry propagation delay
-of just two. Montgomery's subtraction sequence could perhaps be used, by
-complementing some operands. All in all, we should get down to 2 cycles
-without much problems.
-
-For mpn_mul_1, we could do, just like for mpn_add_n:
- not newlo,notnewlo
- addq cylimb,newlo,newlo || cmpult cylimb,notnewlo,cyout
- addq cyout,newhi,cylimb
-and get 2-cycle carry propagation. The instructions needed will be
- 1 ld (Lx pipes)
- 1 st (Lx pipes)
- 2 mul (U1 pipe)
- 4 iaddlog (Lx pipes, Ux pipes)
-issue1: addq not mul ld
-issue2: cmpult addq mul st
-Conclusion: no cluster delays and 2-cycle carry delays will give us 2 cycles/limb!
-
-Last, we have mpn_addmul_1. Almost certainly, we will get down to 3
-cycles/limb, which would be absolutely awesome.
-
-Old, perhaps obsolete addmul_1 dependency diagram (needs 175 columns wide screen):
-
- i
- s
- s i
- u n
- e s
- d t
- r
- i u
-l n c
-i s t
-v t i
-e r o
- u n
-v c
-a t t
-l i y
-u o p
-e n e
-s s s
- issue
- in
- cycle
- -1 ldq
- / \
- 0 | \
- | \
- 1 | |
- | |
- 2 | | ldq
- | | / \
- 3 | mulq | \
- | \ | \
- 4 umulh \ | |
- | | | |
- 5 | | | | ldq
- | | | | / \
- 4calm 6 | | ldq | mulq | \
- | | / | \ | \
- 4casm 7 | | / umulh \ | |
-6 | || | | | |
- 3aal 8 | || | | | | ldq
-7 | || | | | | / \
- 4calm 9 | || | | ldq | mulq | \
-9 | || | | / | \ | \
- 4casm 10 | || | | / umulh \ | |
-9 | || | || | | | |
- 3aal 11 | addq | || | | | | ldq
-9 | // \ | || | | | | / \
- 4calm 12 \ cmpult addq<-cy | || | | ldq | mulq | \
-13 \ / // \ | || | | / | \ | \
- 4casm 13 addq cmpult stq | || | | / umulh \ | |
-11 \ / | || | || | | | |
- 3aal 14 addq | addq | || | | | | ldq
-10 \ | // \ | || | | | | / \
- 4calm 15 cy ----> \ cmpult addq<-cy | || | | ldq | mulq | \
-13 \ / // \ | || | | / | \ | \
- 4casm 16 addq cmpult stq | || | | / umulh \ | |
-11 \ / | || | || | | | |
- 3aal 17 addq | addq | || | | | |
-10 \ | // \ | || | | | |
- 4calm 18 cy ----> \ cmpult addq<-cy | || | | ldq | mulq
-13 \ / // \ | || | | / | \
- 4casm 19 addq cmpult stq | || | | / umulh \
-11 \ / | || | || | |
- 3aal 20 addq | addq | || | |
-10 \ | // \ | || | |
- 4calm 21 cy ----> \ cmpult addq<-cy | || | | ldq
- \ / // \ | || | | /
- 22 addq cmpult stq | || | | /
- \ / | || | ||
- 23 addq | addq | ||
- \ | // \ | ||
- 24 cy ----> \ cmpult addq<-cy | ||
- \ / // \ | ||
- 25 addq cmpult stq | ||
- \ / | ||
- 26 addq | addq
- \ | // \
- 27 cy ----> \ cmpult addq<-cy
- \ / // \
- 28 addq cmpult stq
- \ /
-As many as 6 consecutive points will be under execution simultaneously, or if we addq
-schedule loads even further away, maybe 7 or 8. But the number of live quantities \
-is reasonable, and can easily be satisfied. cy ---->