Alan Hargreaves' Blog

The ramblings of an Australian SaND TSC* Principal Field Technologist

I have a performance problem

So start 95% of the performance calls that I receive. They usually continue something like:

I have gathered some *stat data for you (eg the guds tool from Document 1285485.1), can you please root cause our problem?

So, do you think you could?

Neither can I, based on this my answer inevitably has to be “No”.

Given this kind of problem statement, I have no idea about the expectations, the boundary conditions, or even the application. The answer may as well be “Performance problems? Consult your local Doctor for Viagra”. It’s really not a lot to go on.

So, What kind of problem description is going to allow me to start work on the issue that is being seen? I don’t doubt that there really is an issue, it just needs to be pinned down somewhat.

What behavior exactly are you expecting to see?

Be specific and use business metrics. For example “run-time”, “response-time” and “throughput”.

This helps us define exit criteria.

Now, let’s look at the system that is having problems.

How is what you are seeing different? Use the same type of metrics.

The answers to these two questions take us a long way towards being able to work a call.

Even more helpful are answers to questions like

Has this system ever worked to expectation?

If so, when did it start exhibiting this behavior?

Is the problem always present, or does it sometimes work to expectation?

If it sometimes works to expectation, when are you seeing the problem? Is there any discernible pattern?

Is the impact of the problem getting better, worse, or remaining constant?

What kind of differences are there between when the system was performing to expectation and when it is not?

Are there other machines where we could expect to see the same issue (eg similar usage and load), but are not? Again, differences?

Once we start to gather information like this we start to build up a much clearer picture of exactly what we need to investigate, and what we need to achieve so that both you and me agree that the problem has been solved.

Please help get that figure of poorly defined problem statements down from its current 95% value.

Written by Alan

June 27, 2011 at 6:59 pm

Posted in Solaris, Work

5 Responses

Subscribe to comments with RSS.

  1. I concur. How do we know it’s Viagra and not Cialis that will help? Long live SGR!

    Sarah Boyd Turner

    June 28, 2011 at 6:45 am

  2. I know this is a bit off topic, but do you know why Oracle documents all have dot one on the end?
    eg 1285485.1
    Seems like the .1 is redundant?

    Paul

    June 28, 2011 at 8:11 pm

    • Sorry I don’t know. I would have thought perhaps versioning, but I’ve seen no evidence of that.

      Alan

      June 29, 2011 at 2:25 pm

  3. Good post! another question I like to ask, when handed a metric:

    how do you think this metric is affecting the workload?

    Many metrics can look odd, and not matter so much (asynchronous). Some can look related to workload pain, but are symptoms and not the cause. I try to convert everything into a synchronous component of workload latency to figure out how much they really matter.

    Brendan Gregg

    July 13, 2011 at 1:48 pm


Comments are closed.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: