Performance tuning is an important aspect of program development, once the program runs correctly. Ideally, tools should be able to tell exactly how much time is spent on each instruction or in each procedure in a program. The time spent in a procedure may or not include the time spent in nested procedures.
A first problem is to obtain the timing information. Measuring the exact time spent on each instruction is not possible. Instead, statistical measures are frequently used. At a regular interval, the program is interrupted and the stack is examined to determine in which procedures the program is. The number of times a procedure appears on the top of stack, when the program is interrupted, is a good indication of the CPU time spent in that procedure. Often, the number of times each procedure is called is also computed; the compiler inserts instructions to count the calls.
This information produces a list of call stacks that can be merged together to form a call tree. Each node in the tree is annotated with the number of times it appeared on the top of stack. This tree can be huge since each procedure (child) may be called from several others (parents) and even be called recursively. From there, for each procedure, the following information may be extracted and presented to the user:
This information should help determining which sections of a program need modifications to improve the performance. It may not be sufficient for a number of reasons:
This is particularly important since the bottleneck is rapidly moving from the processor, to the memory hierarchy and disks, as newer very fast processors are appearing. Fortunately, other tools may help to gather statistics about the cache memory, virtual memory and disk accesses.