Alan Hargreaves' Blog

The ramblings of a former Australian SaND TSC* Principal Field Technologist

Dtrace should be used just like any other tool

I believe that we will have succeeded with dtrace when all kinds of folks, from sysadmins to developers start usign dtrace as a normal part of their troubleshooting toolkit.

As a simple example, today I noticed my workstation running with a load average of just above 1. Not unusual, except I really wan’t having the machine do anything at the time. Unfortunately, I haven’t the output from the commands as I addressed the problem this morning and it’s way out of my scroll buffer now. I’m actually pulling the commands back out of my shell history.

Maybe we’ve got something fork/execing a lot (that won’t show in prstat). OK, let’s pull out dtrace and have a look at what’s going on …

# dtrace -n 'proc::exec_common:exec-success {trace(arg0);}'

No, that was pretty quiet. OK, let’s look at what is getting scheduled.

# dtrace -n 'sched::resume:on-cpu {printf("%s: %d\n", execname, pid);}'

OK, this is scrolling past relatively quickly, so it appears that we’ve got a bit of contect switching going on and pretty much always got something ready to run. This probe will print a line for every time we put something onto proc. How about using an aggregation to tidy things up a bit?

# dtrace -n 'sched::resume:on-cpu {@[execname,pid] = count();}'

Hmmmm, much nicer, I can see that sched and one particular instance of mozilla (I have to run a few in different profiles) is finding itself put onto proc much more often than anything else. The mozilla in question is running with pid 20051. Running pargs 20051 tells me a bit more about which instance it is. It’s the one that I use generally and do my email with. At this point I could probably have killed off and restarted the mozilla and the problem would have gone away, but I was interested in what it was doing.

OK, so what are the system calls?

# dtrace -n 'syscall:::entry /pid == 20051/ { @[probefunc] = count();}

This showed me that we were spread pretty equally around an ioctl on fd3, a read on fd5 and a write on fd6. Running pfiles 20051 tells us that these file descriptors are pipes, and very little else.

If we were interested we could also aggregate on the user level stacks to each of the calls, like this.

# dtrace -n 'syscall::ioctl:entry /pid == 20051/ { @[ustack()] = count();}'
# dtrace -n 'syscall::read:entry /pid == 20051/ { @[ustack()] = count();}'
# dtrace -n 'syscall::write:entry /pid == 20051/ { @[ustack()] = count();}'

One interesting there here was that for each of the above probes, there was only one user stack. I suppose that shouldn’t be surprising given that they are always calling with the same file descriptor.

It was also interesting that the other instance I had running at the time was not doing anything even vaguely similar. There are two main differences between them.

  • My general instance also does my email
  • My non-general instance only talks to our escalation tracking system, so it’s only really one website

At this point I killed and restarted the mozilla, ran some of the commands before and after starting my mail. The load on the box was much lighter, but mail didn’t make a difference. I can only conclude that a site I must have visited left something playing up in mozilla. Oh well, won’t be the last time.

The point of me blogging this is to show just how easy it is to start getting to the bottom of problems that would be otherwise very difficult to observe. That’s the whole idea of dtrace. It’s a tool to let you observe the system and it can be used just as easily as you would use the others that I interspersed here (eg pfile, prstat and pargs).

As I have been saying to Jacob before every soccer match, “Get in and have a go”. It’s the only way to really start to appreciate the usefulness.


Written by Alan

August 26, 2004 at 4:55 am

Posted in Solaris Express

3 Responses

Subscribe to comments with RSS.

  1. Nice example — exactly how I use dtrace day-to-day.

    As a comment, <tt>sched:::on-cpu</tt> would be better than <tt>sched:::resume:on-cpu</tt> — it’s generally a bad idea to include function names for the stable providers (like proc and sched) — it’s less typing, and the function name is Private and can change pretty much arbitrarily.

    Jonathan Adams

    August 26, 2004 at 5:47 am

  2. Alan: Since software postings on b.s.c. seem to be unfortunately bimodal (Java XOR Solaris 10), I’d love to see an example of using dtrace to solve a problem with a Java application. Interested?

    Geoff Arnold

    August 26, 2004 at 12:22 pm

  3. Java support is still somewhat limited due to the nature of the language and trying to analyse bytecode, but I belive there is support for ustack() .
    On the other hand what kind of example are you thinking off?
    – Fintan

    Fintan Ryan

    August 27, 2004 at 6:46 am

Comments are closed.

%d bloggers like this: