Alan Hargreaves' Blog

The ramblings of an Australian SaND TSC* Principal Field Technologist

When looking at crashdumps, never trust the source!

When I did my first advanced crashdump analysis course a number of years ago, the Instructor (George Hines) kept harping on one thing.

“Never trust the source.”

What he actually meant by this is that what you see in the crashdump is not necessarily what you are going to see in the source code. There could be many reasons for this. I had this hammered home to me (again) just now.

We had a customer who had an XVR-100 graphics card in a Sunfire 280R running Solaris 9 with the latest kernel update and xvr-100 patch. This machine was consistantly panicing whenever the screen power management kicked in. Looking at the crashdumps showed that an address that should have contained a pointer to a stack variable (from further up the stack) contained rubbish. When we tried to dereference it, the machine panics. Reproducible 100% at the customer site.

After a few days of attempting to replicate this in our local lab after a few days of trying to follow code paths to see how this could have happened, I decided to revist the original crashdump. I went looking through pm_ioctl() to find any other reference to &devl just to check that I was looking in the right place. For this I went to the code and found that the following code fragment appears twice in pm_ioctl().

mutex_enter(...);
pm_enqueue_notify(...);
pm_enqueue_notify_others(&devl, ...);
mutex_exit(...);

OK, so lets go looking for any calls to pm_enqueue_notify_others(). The value I’m interested in will get put into %o0. Hmmm, I can find the calls to pm_enqueue_notify(), but not pm_enqueue_notify_others(). The assembly looks like:

pm:pm_ioctl+0x7dc:         call       genunix:pm_enqueue_notify
pm:pm_ioctl+0x7e0:         or %g0,  0x0, %o5  ( mov   0x0, %o5 )
pm:pm_ioctl+0x7e4:         call       unix:mutex_exit
pm:pm_ioctl+0x7e8:         or %g0, %l0, %o0   ( mov   %l0, %o0 )

Which set the alarm bells ringing. On checking the revision of the module in the crash we see

id flags        modctl      textaddr     size cnt name
125 LI    0x30003cf0e28    0x780f4000   0x5466   1 pm (power management driver v1.101)

But the source was showing v1.104. Now, 1.104 should have been installed with patch 112233-12, which is what the crashdump reported the kernel as running. v1.101 was in 112233-11. It looks like 112233-12 may not have completed installing. To verify this I put the v1.101 binary onto my lab box and ran “xset dpms force off“.

Bingo

I got a panic identical to what the customer was seeing.

The fix should be to simply backout 112233-12 and re-apply it.

The moral? The assembly from the crashdump is what the system ran. The source code is a guide. Trust the crashdump first. Thanks George

Advertisements

Written by Alan

June 29, 2004 at 11:02 pm

Posted in Solaris

%d bloggers like this: