Draft 1:

How to use the oops output on an Alpha running linux in kernel debugging.

One of the major challanges in kernel debugging is working out what you are being told in an oops trace. The oops trace is the last gasp of the system to try and tell you something meaningful about it's internal execution state before a serious execution error occured, most commonly a NULL pointer dereferance, normally caused by functions using incompletely filled structures, or sometimes invalid kernel paging requests.
 

What do oops look like?

ibmcam.c: IBM USB camera found (model 3, rev. 0x0301)
ibmcam.c: IBM camera: using 320x240
usbvideo.c: ibmcam on /dev/video0: canvas=320x240 videosize=320x240
Unable to handle kernel paging request at virtual address 1ffffc0000d4771c
CPU 0 vidcat(871): Oops 0
pc = [<fffffc0000990554>]  ra = [<fffffffc0030eab8>]  ps = 0007
v0 = 0000000000000000  t0 = 1ffffc0000d4771c  t1 = ffffffff7fffffff
t2 = fffffc0000d47720  t3 = ffffffffffffffff  t4 = 0000000000000000
t5 = 0000000000000000  t6 = fffffc0000d47720  t7 = fffffc001cf60000
s0 = fffffc0000abd950  s1 = fffffc000064a740  s2 = fffffc000064a750
s3 = 0000000000000007  s4 = 0000000000000000  s5 = fffffc001cf63d58
s6 = 000000000000001e
a0 = 0000000000000007  a1 = 0000000000000007  a2 = fffffc001cf63d58
a3 = 0000000000000001  a4 = 0000000000000001  a5 = 0000000000000080
t8 = fffffc0000d47700  t9 = fffffc0000d47700  t10= 0000000000000000
t11= 0000000000000080  pv = fffffc0000990320  at = 0000000000000000
gp = fffffc0000cd6928  sp = fffffc001cf63cb8
Code: 205f0001  lda t1,1(zero)
 48410722  sll t1,t0,t1
 20630020  lda t2,32(t2)
 4880b681  srl t3,5,t0
 47e20502  ornot zero,t1,t1
 40230441  s4addq t0,t2,t0
*a8a10000  ldl_l t4,0(t0)
 44a20005  and t4,t1,t4

Trace:fffffc00008627ec fffffc0000873508 fffffc0000860c14 fffffc0000860af0 fffffc0000861040 fffffc0000810c20

At first glance you'd be a bit overwhelmed by all the registers and dissassembly and stack trace, however, what you're looking at is an internal format of the state of play when the exception in the kernel occured. Whats NOT so obvious from this is that it's actually telling you what line number in the kernel code that all hell broke loose and unfortunately that takes a bit more effort to dig out.

Fortunately a lot of the work can be facilitated by ksymoops.
ie,
    dmesg | ksymoops
yields
Unable to handle kernel paging request at virtual address 1ffffc0000d4771c
CPU 0 vidcat(871): Oops 0
pc = [<fffffc0000990554>]  ra = [<fffffffc0030eab8>]  ps = 0007
Using defaults from ksymoops -t elf64-alpha -a alpha
v0 = 0000000000000000  t0 = 1ffffc0000d4771c  t1 = ffffffff7fffffff
t2 = fffffc0000d47720  t3 = ffffffffffffffff  t4 = 0000000000000000
t5 = 0000000000000000  t6 = fffffc0000d47720  t7 = fffffc001cf60000
s0 = fffffc0000abd950  s1 = fffffc000064a740  s2 = fffffc000064a750
s3 = 0000000000000007  s4 = 0000000000000000  s5 = fffffc001cf63d58
s6 = 000000000000001e
a0 = 0000000000000007  a1 = 0000000000000007  a2 = fffffc001cf63d58
a3 = 0000000000000001  a4 = 0000000000000001  a5 = 0000000000000080
t8 = fffffc0000d47700  t9 = fffffc0000d47700  t10= 0000000000000000
t11= 0000000000000080  pv = fffffc0000990320  at = 0000000000000000
gp = fffffc0000cd6928  sp = fffffc001cf63cb8
Code: 205f0001  lda t1,1(zero)
 48410722  sll t1,t0,t1
 20630020  lda t2,32(t2)
 4880b681  srl t3,5,t0
 47e20502  ornot zero,t1,t1
 40230441  s4addq t0,t2,t0
 44a20005  and t4,t1,t4
Trace:fffffc00008627ec fffffc0000873508 fffffc0000860c14 fffffc0000860af0 fffffc0000861040 fffffc0000810c20
Warning (Oops_read): Code line not seen, dumping what data is available

>>PC;  fffffc0000990554 <pci_pool_alloc+234/2d0>   <=====
Trace; fffffc00008627ec <chrdev_open+9c/f0>
Trace; fffffc0000873508 <get_write_access+18/80>
Trace; fffffc0000860c14 <dentry_open+104/210>
Trace; fffffc0000860af0 <filp_open+70/90>
Trace; fffffc0000861040 <sys_open+70/1a0>
Trace; fffffc0000810c20 <entSys+a8/c0>
 

The important differance here being that the "Trace:" has been matched up to the symbol table for the kernel to provide a map of the fuctions that were called/in play leading up to the kernel exception. The two numbers seen appended after the function are the offset of how far into the function the address is and the second number gives an indication of how large the fuction is (all in bytes). Unfortunately these numbers by themselves are not particulary useful.

In the original alpha oops you will notice that there was some disassembly

Code: 205f0001  lda t1,1(zero)
 48410722  sll t1,t0,t1
 20630020  lda t2,32(t2)
 4880b681  srl t3,5,t0
 47e20502  ornot zero,t1,t1
 40230441  s4addq t0,t2,t0
*a8a10000  ldl_l t4,0(t0)
 44a20005  and t4,t1,t4
 
Ksymoops unfortunately strips out the one line with the "*" in it which is the point that the exception occurs. Now then, believe it or not this assembly dump will provide you the actual code line that fell apart in the kernel. To do that, you'll have to have the kernel src built.

Oh, a bit of prepreperation, it's a good idea to have a current TAGS file around.
    find . -type f -name "*[ch]" | etags -

The tags file can be used with emacs or vi to allow you to quickly navigate around the kernel src tree. In this case we use the tags file to find out that pci_pool_alloc is to be found in the file drivers/pci/pci.c

Now a bit of magic has to occur. We have to rebuild drivers/pci/pci.c but in such a way that it leaves the assembly file so that we can look at it.

In this case that gcc command would be
gcc -D__KERNEL__ -I/usr/src/linux-2.4.3/include -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -fno-common -Wno-unused -pipe -mno-fp-regs -ffixed-8 -mcpu=ev5 -Wa,-mev6    -DEXPORT_SYMTAB -gstabs+ -S pci.c

(you can get a feel for what would be required by changing directory to drivers/pci , rm -f pci.o and then type in make -C ../..  SUBDIRS=`pwd` and when the gcc compile line pops up, press crtl-c, and cut and paste the compile line, replaceing the -c option with -gstabs+ -S. This rebuilds the assembly file with stabs (basically a means for us to work out what line in the C file is being executed/compiled without making differing assembly from when the kernel was built)

In this example you will be left with pci.s which is the assembly file used to create the object code. This file is intresting because it also has burried in it the line numbers of  the C statements being coded

Now, returning to the originally assembly,

Code: 205f0001  lda t1,1(zero)
 48410722  sll t1,t0,t1
 20630020  lda t2,32(t2)
 4880b681  srl t3,5,t0
 47e20502  ornot zero,t1,t1
 40230441  s4addq t0,t2,t0
*a8a10000  ldl_l t4,0(t0)
 44a20005  and t4,t1,t4
There is a slight problem here in that the code is being displayed in Tru64 assembly format as opposed to GNU assembly format. Don't worry, it's quite simple to translate.

If you look at /usr/include/alpha/regdef.h you will notice that the registers (t0-7,s0-6 etc) are all represented as $N. These are the only bits that you need to translate. so our code fragment becomes

Code: 205f0001  lda $2,1(zero)
 48410722  sll $2,$1,$2
 20630020  lda $3,32($3)
 4880b681  srl $4,5,$1
 47e20502  ornot $31,$2,$2
 40230441  s4addq $1,$3,$1
*a8a10000  ldl_l $5,0($1)
 44a20005  and $5,$2,$5
Given that the exception was in ldl_l $5,0($1) we can look for this code fragment in pci.s
.stabs "/usr/src/linux-2.4.3/include/asm/bitops.h",132,0,0,$Ltext134
$Ltext134:
$LM1157:
        .stabn 68,0,88,$LM1157
        addq $0,$22,$3
$LM1158:
        .stabn 68,0,110,$LM1158
        and $4,31,$1
        lda $2,1
        sll $2,$1,$2
$LM1159:
        .stabn 68,0,88,$LM1159
        lda $3,32($3)
$LM1160:
        .stabn 68,0,93,$LM1160
        srl $4,5,$1
$LM1161:
        .stabn 68,0,110,$LM1161
        ornot $31,$2,$2
$LM1162:
        .stabn 68,0,93,$LM1162
        s4addq $1,$3,$1
$LM1163:
        .stabn 68,0,110,$LM1163
        1:      ldl_l $5,0($1)
        and $5,$2,$5
        stl_c $5,0($1)
        beq $5,2f
NOTE: It is IMPORTANT to get the preceeding .stabs line as that refers to the source code file being assembled.

The important line now becomes .stabn 68,0,110,$LM1163 as this has the all important line number encoded in it (the third number,.. 110). Don't get thrown by the way that the assembly seems to be compiling code out of order,  (line 88, 93,110, 93,100..) this is simply the gcc optimizer at work.

If I now fire up emacs on /usr/src/linux-2.4.3/include/asm/bitops.h and look at line 110 I can find that the above problem is in this routine
 

extern __inline__ void clear_bit(unsigned long nr, volatile void * addr)
{
#ifndef BITOPS_NO_BRANCH
 unsigned long oldbit;
#endif
 unsigned long temp;
 unsigned int * m = ((unsigned int *) addr) + (nr >> 5); /*** This is line 93 ***/

#ifndef BITOPS_NO_BRANCH
 __asm__ __volatile__(
 "1: ldl_l %0,%4\n"
 " and %0,%3,%2\n"
 " beq %2,2f\n"
 " xor %0,%3,%0\n"
 " stl_c %0,%1\n"
 " beq %0,3f\n"
 "2:\n"
 ".subsection 2\n"
 "3: br 1b\n"
 ".previous"
 :"=&r" (temp), "=m" (*m), "=&r" (oldbit)
 :"Ir" (1UL << (nr & 31)), "m" (*m));
#else
 __asm__ __volatile__(  /*** This is line 110 ***/
 "1: ldl_l %0,%3\n"
 " and %0,%2,%0\n"
 " stl_c %0,%1\n"
 " beq %0,2f\n"
 ".subsection 2\n"
 "2: br 1b\n"
 ".previous"
 :"=&r" (temp), "=m" (*m)
 :"Ir" (~(1UL << (nr & 31))), "m" (*m));
#endif
}
 

So, in this particular example, we've chased this oops down to a possibly faulty piece of hand assembler in the bitops.h header file. But regardless, this is the point that it all goes horribly wrong. It may not be that the routine is wrong, it may simply be that the routine is being passed bad variables.

Importantly we can now tell what lines of pci_pool_alloc() may have failed because there are only two instances of the call in the routine... and there you have it.