Discussion:
Runaway Java Processes
Ken Schweigert
2006-07-27 15:33:39 UTC
Permalink
We are having a problem with one of our WO applications. I'm hoping
someone will have some guidance as to what to do. I have a lot of
symptoms, but no real idea where to go next.

The machine is running on MacOSX-10.3.5 with WO-5.2. We run 5
instances of it and about 3 or 4 weeks the application becomes
unresponsive. We also host a few other applications that never seem
to have this problem. When is 'ssh' in to the machine and do 'top' I
see:

--------------------------
Processes: 58 total, 4 running, 54 sleeping... 1355
threads 11:45:42
Load Avg: 0.46, 0.61, 0.43 CPU usage: 11.2% user, 10.8% sys,
78.1% idle
SharedLibs: num = 89, resident = 7.17M code, 860K data, 2.69M
LinkEdit
MemRegions: num = 14137, resident = 1.24G + 3.60M private, 24.1M shared
PhysMem: 148M wired, 420M active, 900M inactive, 1.43G used,
67.1M free
VM: 9.15G + 57.3M 416495(0) pageins, 942650(0) pageouts

PID COMMAND %CPU TIME #TH #PRTS #MREGS RPRVT RSHRD
RSIZE VSIZE
27600 java 0.0% 67:18.84 29 558 285 80.8M 15.0M
81.4M 307M
13570 top 19.2% 0:08.34 1 17 26 592K 352K
968K 27.1M
13569 bash 0.0% 0:00.03 1 12 15 180K 796K
792K 18.2M
13568 sshd 0.0% 0:00.01 1 9 41 112K 1.30M
448K 30.0M
13566 sshd 0.0% 0:00.21 1 15 41 116K 1.30M
1.45M 30.0M
13441 java 0.0% 28:00.67 30 510 248 89.7M 11.0M
98.7M 317M
13439 java 18.4% 20:07.25 30 535 257 90.7M+ 11.0M 99.8M
+ 317M
13363 java 0.8% 14:48.68 30 531 248 93.0M 11.0M
102M 317M
13339 java 0.0% 32:54.35 30 522 251 91.0M 11.0M
100M 317M
13335 java 0.0% 30:38.76 30 546 262 101M 11.0M
110M 325M
13004 httpd 0.0% 0:00.78 1 10 103 232K 4.41M
2.20M 36.6M
8771 httpd 0.0% 0:00.76 1 10 106 356K 4.41M 2.27M
36.7M
6043 httpd 0.0% 0:00.63 1 10 106 352K 4.41M 2.28M
36.7M
2941 java 0.8% 99:44.73 >> >>> 983 199M 10.3M
110M 942M
2906 java 0.0% 53:56.68 >> >>> 1080 13.5M 10.3M
6.41M 576M
2871 java 0.0% 49:29.13 >> >>> 1060 26.1M 10.3M
15.9M 543M
2836 java 0.0% 2:10:53 >> >>> 1213 238M 10.3M 116M
1.12G
2781 java 0.0% 47:16.36 78 836 482 64.2M 10.3M
58.8M 414M
477 java 0.0% 82:49.47 28 524 280 50.7M 15.8M
48.8M 307M
--------------------------

While reading the 'top' documentation I see that when the '#TH'
column has a '>>' that it has more than 99 threads and when '#PRTS'
has a '>>>' that there are more than 999 ports. I find this a little
excessive seeing that none of the other applications use this many
resources.

I can also verify that each of the java processes with a '>>>' is a
process of the same application. I used 'lsof -i -n -P | grep java |
less' and can see that associated tcp port with the one set in the
application. Also scattered throughout the runaway process's output
I see quite a few of these lines:

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
<snip>
java 19994 root 181u IPv6 0t0 TCP can't
read in6pcb at 0x00000000
java 19994 root 182u IPv6 0t0 TCP can't
read in6pcb at 0x00000000

Another thing that also concerns me is the 'VSIZE' column having one
instances with 1.12GB and another with 942MB.

So, as you can see, there are lot of symptoms and I'm not sure if
they are all pointing to the same problem. I should also mention
that our WO developer has since left so I'm in the position where I
need to get this under control.

Thank you for any help you can offer.
--
Ken Schweigert, Network Administrator
Byte Productions, LLC
Sacha Michel Mallais
2006-07-27 15:52:09 UTC
Permalink
Post by Ken Schweigert
We are having a problem with one of our WO applications. I'm
hoping someone will have some guidance as to what to do. I have a
lot of symptoms, but no real idea where to go next.
The machine is running on MacOSX-10.3.5 with WO-5.2. We run 5
instances of it and about 3 or 4 weeks the application becomes
unresponsive. We also host a few other applications that never
seem to have this problem. When is 'ssh' in to the machine and do
I was just talking about this in another thread: it is normal to
regularly schedule your apps to restart or a regular basis. You
might start with that.
Post by Ken Schweigert
While reading the 'top' documentation I see that when the '#TH'
column has a '>>' that it has more than 99 threads and when '#PRTS'
has a '>>>' that there are more than 999 ports. I find this a
little excessive seeing that none of the other applications use
this many resources.
It is excessive. You should only see that under two conditions:
1) your app is being hit that many times at the same time
2) worker threads are getting locked up so that WO has to spawn more
of them

Assuming its #2, one way to determine where the deadlock is occurring
is to send the process a QUIT signal, as in "kill -QUIT <pid>". This
will tell java to spew out a stack trace. Unfortunately, this stack
trace is sent to /dev/null by default, so you'll have to edit the
startup script SpawnOfWotaskd as described under "Where's my stderr"
on this page: http://en.wikibooks.org/wiki/Programming:WebObjects/
Web_Applications/Deployment/Common_Pitfalls_and_Troubleshooting.
Post by Ken Schweigert
I can also verify that each of the java processes with a '>>>' is a
process of the same application. I used 'lsof -i -n -P | grep java
| less' and can see that associated tcp port with the one set in
the application. Also scattered throughout the runaway process's
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
<snip>
java 19994 root 181u IPv6 0t0 TCP can't
read in6pcb at 0x00000000
java 19994 root 182u IPv6 0t0 TCP can't
read in6pcb at 0x00000000
Another thing that also concerns me is the 'VSIZE' column having
one instances with 1.12GB and another with 942MB.
Probably related to the number of worker threads.


sacha
--
Sacha Michel Mallais Senior Developer / President
Global Village Consulting Inc. http://www.global-village.net/
PGP Key ID: 7D757B65 AIM: smallais
"I resist change even as I call for it." -- Mason Cooley
Ken Schweigert
2006-07-27 17:27:13 UTC
Permalink
Post by Sacha Michel Mallais
Post by Ken Schweigert
We are having a problem with one of our WO applications. I'm
hoping someone will have some guidance as to what to do. I have a
lot of symptoms, but no real idea where to go next.
The machine is running on MacOSX-10.3.5 with WO-5.2. We run 5
instances of it and about 3 or 4 weeks the application becomes
unresponsive. We also host a few other applications that never
seem to have this problem. When is 'ssh' in to the machine and do
I was just talking about this in another thread: it is normal to
regularly schedule your apps to restart or a regular basis. You
might start with that.
I've tried scheduling the instances to restart, but usually get a
case where something in the instances "hangs" and the instance stays
in a state of "refuse new instances" and then all of them are like
that and then the application is unreachable. I would schedule each
instance a couple of hours apart hoping that by the time next next
instance restarts, the previous one would've closed down and restarted.
Post by Sacha Michel Mallais
Post by Ken Schweigert
While reading the 'top' documentation I see that when the '#TH'
column has a '>>' that it has more than 99 threads and when
'#PRTS' has a '>>>' that there are more than 999 ports. I find
this a little excessive seeing that none of the other applications
use this many resources.
1) your app is being hit that many times at the same time
2) worker threads are getting locked up so that WO has to spawn
more of them
Assuming its #2, one way to determine where the deadlock is
occurring is to send the process a QUIT signal, as in "kill -QUIT
<pid>". This will tell java to spew out a stack trace.
Unfortunately, this stack trace is sent to /dev/null by default, so
you'll have to edit the startup script SpawnOfWotaskd as described
under "Where's my stderr" on this page: http://en.wikibooks.org/
wiki/Programming:WebObjects/Web_Applications/Deployment/
Common_Pitfalls_and_Troubleshooting.
Great tip! I just modified that startup script and am in the process
of restarting the application's instances. I'm sure I'll be able to
'kill' and instance soon.
Post by Sacha Michel Mallais
Post by Ken Schweigert
I can also verify that each of the java processes with a '>>>' is
a process of the same application. I used 'lsof -i -n -P | grep
java | less' and can see that associated tcp port with the one set
in the application. Also scattered throughout the runaway
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
<snip>
java 19994 root 181u IPv6 0t0 TCP can't
read in6pcb at 0x00000000
java 19994 root 182u IPv6 0t0 TCP can't
read in6pcb at 0x00000000
Another thing that also concerns me is the 'VSIZE' column having
one instances with 1.12GB and another with 942MB.
Probably related to the number of worker threads.
Just killed an instance and noticed that the free memory shot way up.


--
Ken Schweigert, Network Administrator
Byte Productions, LLC
Sacha Michel Mallais
2006-07-27 18:03:01 UTC
Permalink
Post by Ken Schweigert
Post by Sacha Michel Mallais
Post by Ken Schweigert
We are having a problem with one of our WO applications. I'm
hoping someone will have some guidance as to what to do. I have
a lot of symptoms, but no real idea where to go next.
The machine is running on MacOSX-10.3.5 with WO-5.2. We run 5
instances of it and about 3 or 4 weeks the application becomes
unresponsive. We also host a few other applications that never
seem to have this problem. When is 'ssh' in to the machine and
I was just talking about this in another thread: it is normal to
regularly schedule your apps to restart or a regular basis. You
might start with that.
I've tried scheduling the instances to restart, but usually get a
case where something in the instances "hangs" and the instance
stays in a state of "refuse new instances" and then all of them are
like that and then the application is unreachable. I would
schedule each instance a couple of hours apart hoping that by the
time next next instance restarts, the previous one would've closed
down and restarted.
Sounds familiar: I've run into the same behaviour myself, and it is
another indicator that your app is deadlocking. Are you using the
MultiECLockManager (written by, if I'm not mistaken, Jon Rochkind)?
This little helper really simplifies the Editing Context locking that
you have to do.

Another common problem that causes deadlocks is exceptions being
raised during session check-in or check-out, usually in Session.sleep
or Session.awake. If you throw any exception from these methods,
your app is on a one-way train bound for deadlock city.

HTH,


sacha
--
Sacha Michel Mallais Software Poet
Global Village Consulting Inc. http://www.global-village.net/
PGP Key ID: 7D757B65 AIM: smallais
There are three types of people in this world:
those that can count, and those that can't
Loading...