May 5, 2000
GCC now supports reordering of basic blocks to attempt to reduce the number of mis-predicted branches in the final executable.
GCC will use dynamic branch predictions from profiling or static branch predictions to drive the block reordering algorithm.
An excellent example of how block reordering can improve code can be found in the following example (inner loop from the infamous eqntott.c benchmark):
int cmppt (a, b)
PTERM *a[], *b[];
{
register int i, aa, bb;
for (i = 0; i < ninputs; i++) {
aa = a[0]->ptand[i];
bb = b[0]->ptand[i];
if (aa == 2)
aa = 0;
if (bb == 2)
bb = 0;
if (aa != bb) {
if (aa < bb) {
return (-1);
}
else {
return (1);
}
}
}
return (0);
}
Note the code inside the loop which returns (1) or (-1) in some circumstances. Without block reordering those statements will be inside the loop in the assembly output:
.L6:
movswl (%edi,%ebx,2),%ecx
movl -16(%ebp), %eax
movswl (%eax,%ebx,2),%edx
xorl %eax, %eax
cmpl $2, %edx
sete %al
decl %eax
andl %eax, %edx
xorl %eax, %eax
cmpl $2, %ecx
sete %al
decl %eax
andl %eax, %ecx
cmpl %ecx, %edx
je .L5
movl $0, %eax
setge %al
leal -1(%eax,%eax), %eax
jmp .L2
.p2align 4,,7
.L5:
incl %ebx
cmpl %esi, %ebx
jl .L6
There are two significant problems with the generated code. First many processors will predict the "je .L5" jump as not taken, when it fact it is almost always a taken branch. This can significantly harm performance on such processors since we have a mis-predicted branch each iteration of the loop. Second, the code after "je .L5" up to and including the "jmp .L2" statement pollutes the instruction cache with code that is usually not executed.
With block reordering enabled, we get the following code for the loop:
.L6:
movswl (%edi,%ebx,2),%ecx
movl -16(%ebp), %eax
movswl (%eax,%ebx,2),%edx
xorl %eax, %eax
cmpl $2, %edx
sete %al
decl %eax
andl %eax, %edx
xorl %eax, %eax
cmpl $2, %ecx
sete %al
decl %eax
andl %eax, %ecx
cmpl %ecx, %edx
jne .L14
incl %ebx
cmpl %esi, %ebx
jl .L6
As you can see, the jump to the exit code has been removed from the loop. Most processors will properly predict the "jne .L14" as not taken resulting in better performance. As a secondary benefit the loop itself is smaller which may improve instruction cache behavior.
Please send FSF & GNU inquiries & questions to gnu@gnu.org. There are also other ways to contact the FSF.
These pages are maintained by the GCC team.
For questions related to the use of GCC, please consult these web pages and the GCC manuals. If that fails, the gcc-help@gcc.gnu.org mailing list might help.Copyright (C) Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110, USA.
Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.
| Last modified 2006-06-21 |
|