Alan Hargreaves' Blog

The ramblings of an Australian SaND TSC* Principal Field Technologist

Kprobes in Linux vs Dtrace

An article on osnews points at an article at IBM about Kprobes on Linux

While this is probably a step in the right direction I still have some concerns. I would encourage the author to look at adding in some more protection. i.e. Always practice safe probing.

  1. I don’t see any checking for NULL Pointer dereferences for the printk’s. If this is the case, then a poorly written kprobe can still take out a production box. In fact any bad piece of code could take it out.
  2. It stills rather clunky to get simple probes inserted. Looking through the article shows a lot of work required to get the probes in. The equivalent probe in dtrace would be
    #!/usr/sbin/dtrace -s
    syscall::fork1:entry, syscall::forkall:entry, syscall::vfork:entry {
    printf("\n\tpid=%d kthread=0x%llx\n", pid, (long long)curthread);
    printf("\tt_state=0x%x cpu=%d\n",
    curthread->t_state, curthread->t_cpu->cpu_id);
    printf("\n\tCaller program is \"%s\"\n\n", execname);
    printf("\tUser Space stack\n");
    ustack();
    printf("\n\tKernel Space Stack\n");
    stack(10);
    }
    

    Which gives us the following results

    # ./fork.d
    dtrace: script './fork.d' matched 3 probes
    CPU     ID                    FUNCTION:NAME
    2    207                      vfork:entry
    pid=1443 kthread=0x300056a3c60
    t_state=0x4 cpu=2
    Caller program is "csh"
    User Space stack
    libc.so.1`vfork+0x20
    csh`execute+0xcbc
    csh`process+0x360
    csh`main+0xe94
    csh`_start+0x108
    Kernel Space Stack
    unix`syscall_trap32+0xcc
    

    Alternately, with the knowledge that in Solaris each of these three system calls call cfork() (which you could also determine with dtrace), we could simply do

    #!/usr/sbin/dtrace -s
    fbt::cfork:entry {
    printf("\n\tpid=%d kthread=0x%llx\n", pid, (long long)curthread);
    printf("\tt_state=0x%x cpu=%d\n",
    curthread->t_state, curthread->t_cpu->cpu_id);
    printf("\n\tCaller program is \"%s\"\n\n", execname);
    printf("\tUser Space stack\n");
    ustack();
    printf("\n\tKernel Space Stack\n");
    stack(10);
    }
    

    Which would give us exactly the same output as the calls to cfork() are done with tail recursion. On an x86 box it would look something like:

    # ./fork.d
    dtrace: script './fork.d' matched 1 probe
    CPU     ID                    FUNCTION:NAME
    0   3882                      cfork:entry
    pid=669 kthread=0xffffffffd5ae6000
    t_state=0x4 cpu=0
    Caller program is "csh"
    User Space stack
    libc.so.1`vfork+0x45
    csh`execute+0x12f
    csh`process+0x24b
    csh`main+0xa25
    80580ea
    Kernel Space Stack
    unix`sys_call+0xda
    

Now there are also a couple of other nice things to consider here.

  1. No need to register and unregister the probe. If I’m not running the dtrace script, then the probe does not exist.
  2. If I want to change the query, I just edit the script.

  3. This one is actually a pretty basic probe. I can get much more complex information with very little effort, and as I have already stated, it’s just a matter of modifying the script and the probe does not exist unless I am running the script.

But the most important thing to remember is that we have protection against the probe taking out the system. That means that we have no hesitation in running dtrace probes on production boxes, where outage time is measured in thousands of dollars per second (yes we have such customers).

Update

Dan Price made a suggestion which tidies the script up even more, meaning that even if we change the way that we do fork(), the script will remain working. This gives us stability with kernel releases as well. To see the new script, look at the comments for this entry.

Advertisements

Written by Alan

August 22, 2004 at 7:19 pm

Posted in Solaris Express

3 Responses

Subscribe to comments with RSS.

  1. Alan, A minor modification to your idea above shows off one of the more esoteric but cool features of dtrace: documented stability. If you use ‘proc:::create’ you can employ the ‘proc’ provider, instead of relying on specific syscall or function semantics. That’s super-cool because it means your probe can continue to work in future releases, even if we rename ‘cfork()’ to something else, or add a new fork variant. The ability to build higher-level providers is a key advantage, because instrumenting the OS becomes an API in and of itself.

    Dan Price

    August 22, 2004 at 7:55 pm

  2. Good point Dan, this would make the script:

    #!/usr/sbin/dtrace -s
    proc:::create {
    printf("\n\tpid=%d kthread=0x%llx\n", pid, (long long)curthread);
    printf("\tt_state=0x%x cpu=%d\n",
    curthread->t_state, curthread->t_cpu->cpu_id);
    printf("\n\tCaller program is \"%s\"\n\n", execname);
    printf("\tUser Space stack\n");
    ustack();
    printf("\n\tKernel Space Stack\n");
    stack(10);
    }
    

    And the output would remain similar …

    # ./fork.d
    dtrace: script './fork.d' matched 1 probe
    CPU     ID                    FUNCTION:NAME
    2  14916                     cfork:create
    pid=1443 kthread=0x300056a3c60
    t_state=0x4 cpu=2
    Caller program is "csh"
    User Space stack
    libc.so.1`vfork+0x20
    csh`execute+0xcbc
    csh`process+0x360
    csh`main+0xe94
    csh`_start+0x108
    Kernel Space Stack
    genunix`cfork+0x78c
    unix`syscall_trap32+0xcc
    

    Alan Hargreaves

    August 22, 2004 at 8:13 pm

  3. But the most important thing to remember is that we have protection against the probe taking out the system. That means that we have no hesitation in running dtrace probes on production boxes, where outage time is measured in thousands of dollars per second (yes we have such customers).
    This is critically important to be considered as a safe, enterprise-ready product. It is by no accident that DTrace has this property as it is intended to be useful and safe on production systems.

    Richard Elling

    August 23, 2004 at 9:03 am


Comments are closed.

%d bloggers like this: