Node:Assembler Cache Handling, Next:Assembler Floating Point, Previous:Assembler Carry Propagation, Up:Assembler Coding
GMP aims to perform well both on operands that fit entirely in L1 cache and those which don't.
Basic routines like mpn_add_n or mpn_lshift are often used on
large operands, so L2 and main memory performance is important for them. 
mpn_mul_1 and mpn_addmul_1 are mostly used for multiply and
square basecases, so L1 performance matters most for them, unless assembler
versions of mpn_mul_basecase and mpn_sqr_basecase exist, in
which case the remaining uses are mostly for larger operands.
For L2 or main memory operands, memory access times will almost certainly be more than the calculation time. The aim therefore is to maximize memory throughput, by starting a load of the next cache line which processing the contents of the previous one. Clearly this is only possible if the chip has a lock-up free cache or some sort of prefetch instruction. Most current chips have both these features.
Prefetching sources combines well with loop unrolling, since a prefetch can be initiated once per unrolled loop (or more than once if the loop covers more than one cache line).
On CPUs without write-allocate caches, prefetching destinations will ensure
individual stores don't go further down the cache hierarchy, limiting
bandwidth.  Of course for calculations which are slow anyway, like
mpn_divrem_1, write-throughs might be fine.
The distance ahead to prefetch will be determined by memory latency versus throughput. The aim of course is to have data arriving continuously, at peak throughput. Some CPUs have limits on the number of fetches or prefetches in progress.
If a special prefetch instruction doesn't exist then a plain load can be used, but in that case care must be taken not to attempt to read past the end of an operand, since that might produce a segmentation violation.
Some CPUs or systems have hardware that detects sequential memory accesses and initiates suitable cache movements automatically, making life easy.