Alan Hargreaves' Blog

The ramblings of an Australian SaND TSC* Principal Field Technologist

A “Performance” issue on a T2000

For the last week or so, I’ve been troubleshooting a performance
issue on a T2000. Not quite there yet.

Background

The box that went out to the customer was a beta version of the T2000
that only had four cores (i.e. 16 virtual cpus). It was also running
an earlier release of the kernel (Solaris 10 hw 2 build 3) than what is
currently shipping on the T2000s (Solaris 10 hw 2 build 5), which had
some nice little gotchas in it (ipge hang on write problem and another
that would stop DTrace working).

The customer here was running an MQ Series (v5.3) test and was interested
in the number of packets/second that could be processed. The baseline
was that they could do 4000/second on a v440 and expected to be able to
do 16000 on a T2000. The problem was that they were only seeing about
2000.

So, what happened?

First off I tried to address the DTrace issue by bringing the beta box
up to KU-20. While installing the patch, I noticed that a number of packages
were missing, and thus not patched. When we tried to reboot this box, it
complained about missing files and refused to boot.

OK, I though, I asked these guys to make a flash archive of the full
box before I started playing with it, we’ll just re-install.

It appears that the people working with this box (which is in another
country to me) did not have installation media. OK, I pointed them to
where they could get cd images of the version currently shipping on
T2000s and they could use that to bootstrap the flash image that they
had taken.

Guess what? There is a known issue with booting this version on those
beta boxes, which is adressed by adding a few lines to /etc/system.
Unfortunately, last time I looked, most installation media is read only.

It also turns out that the “flash archive” that was taken was ufsdumps
of root and var.

By this time I had gained access to a released version of a T2000 in the
Sydney lab. Fortunately it had a second disk in it that I could drop the
ufsdump images onto, and after a bit of fiddling (mainly getting IP address
right and fixing vfstab to point at the correct disk) I was able to get it up
and running locally. *phew*

I applied a kludge to enable DTrace to work (commenting out some lines
in sched.d) and ran up the server and the client.

mpstat shows a pretty idle box (95-98%). iostat shows very
little disk activity. Time to crank out DTrace.

First off, who is doing read(2) system calls?

# dtrace -q -n 'syscall::read:entry { @[execname] = count();}
tick-10s { printa(@); clear(@); }'
nscd                                                              1
fmd                                                               3
java                                                          12284
amqrmppa                                                      24566
fmd                                                               0
nscd                                                              2
nfsmapid                                                          2
java                                                          11938
amqrmppa                                                      23869
nscd                                                              0
nfsmapid                                                          0
fmd                                                               0
ttymon                                                            2
sac                                                               2
java                                                          12306
amqrmppa                                                      24611

Ok, we have java and amqrmppa. The client is java so we
will leave that as we’re interested in the server. Let’s have a look at
the number of reads per second that each thread of this process is
doing.

# dtrace -q -n 'syscall::read:entry
/execname == "amqrmppa"/ { @[tid] = count();}
tick-10s { normalize(@,10); printa(@); clear(@); }'
5             2563
5             2462
5             2557

There are two things of interest here.

  1. We are seeing around 2500 reads per second, which gels with the customer
    seeing about 2000 messages/second. This is probably a pretty good gauge
    of messages/second.
  2. Only one thread is doing any of the reading. The server is running
    single threaded!

Running single threaded might be good on a box that has a small number
of very fast cpus, but is about the worst possible thing that you could
do on a T2000.

Out of interest, let’s see what stacks are doing the reads, just to make
we are in the right place.

# dtrace -q -n 'syscall::read:entry
/execname == "amqrmppa"/ { @[ustack(20)] = count();}
tick-10s { normalize(@,10); printa(@); clear(@); }'
libc.so.1`_read+0x8
amqcctca`cciTcpReceive+0xc24
libmqmr.so`ccxReceive+0x1d0
libmqmr.so`rriMQIServer+0x2f4
libmqmr.so`rrxResponder+0x52c
libmqmr.so`ccxResponder+0x14c
libmqmr.so`cciResponderThread+0xac
libmqmcs.so`ThreadMain+0x890
libc.so.1`_lwp_start
2534

Which kind of looks like we have a server receiving packets.

One other thing that I noticed is that each time I killed and
restarted the client, it looks like it attaches to a new single
thread in the server.

I spent quite some time going through the mqm trees and google
but so far I have been unable to come up with a way to increase
the number of threads in the sever.

For all intents and purposes we are running single threaded. If
we can increase the number of server threads to match the
platform, then I would expect to see an incredible increase in
the #packets/second processed.

If any of my readers have any suggestions on how to increase the number
of server threads, I’d love to hear from you. MQ Series is not something
that I’ve spent a lot of time with.

An Aside

I should mention one other thing that is incredibly useful if you happen
to have a machine running a relatively current nevada or open solaris.

As Bryan mentioned when he addressed
SOSUG in Sydney, the output from the dtrace command when given a -l
and a -v command has been enhanced to also give you the types
of the arguments. I used this a bit while looking at other things to get
a feel for the system. It saved me having to dig up the reference manual.
For example:

$ dtrace -v -l -n io:::wait-start
ID   PROVIDER            MODULE                          FUNCTION NAME
514         io           genunix                           biowait wait-start
Probe Description Attributes
Identifier Names: Private
Data Semantics:   Private
Dependency Class: Unknown
Argument Attributes
Identifier Names: Evolving
Data Semantics:   Evolving
Dependency Class: ISA
Argument Types
args[0]: bufinfo_t *
args[1]: devinfo_t *
args[2]: fileinfo_t *

Update

update 1

I suspect that what we are seeing here is a client that does a connect, and then spawns all of it’s threads. It looks like the way that the server is working is that we get a thread per connection.

I’m currently looking at a way to verify this suspicion.

update 2

Just for kicks, I modified MQSender.properties back to 1 client thread and
then proceeded to start up 8 instances of the client.

This looks much better, we are tending around just under
16000/second on the server side. What is peculiar is that
about every 38 seconds we see the count drop to 0 for about 4-6 seconds. At this
time we see idle jump to 100% and more interestingly iostat
shows a lot of IO to /var with active service times blowing out
to half a second.

Playing with the SCSI write cache and forcing the filesystem
to forcedirectio does not appear to help us.

Looking through the java source to the client, it looks like
the connection is shared between all threads created. It
would probably be more useful (and be more likely to reflect
real life) if each of the threads had their own connection.

update 3

I’ve got to say that I’m also a bit suspect of the relevance (to reality) of “benchmarking” an application server platform having both the client and server both living on the one machine. Do people actually do this in real life where the server is likely to be pushed to it’s limit? I would have thought that it would be much more reasonable (and likely) to run the application server seperately to it’s clients.

It might be interesting to try splitting the client and server onto two different boxes. Unfortunately I’ve only got one T2000 to play with.

Technorati Tags:
,
,
,

Advertisements

Written by Alan

December 13, 2005 at 5:07 pm

Posted in Solaris

2 Responses

Subscribe to comments with RSS.

  1. Are you sure the server really is running in single thread?
    I’m lousy at debuging Java but, I remember that older versions of Java didn’t had a paralell garbage Collector.
    Because Niagara runs so much faster, the amount of Garbage the collector has to free also increases and, you may have to tune the full GC to run less times.
    Hope this helps

    Jaime Cardoso

    December 19, 2005 at 12:34 pm

  2. Ops, after a more carefull read, I understood that’s not the case. Sorry

    Jaime Cardoso

    December 19, 2005 at 1:45 pm


Comments are closed.

%d bloggers like this: