Ken Schweigert
2006-07-27 15:33:39 UTC
We are having a problem with one of our WO applications. I'm hoping
someone will have some guidance as to what to do. I have a lot of
symptoms, but no real idea where to go next.
The machine is running on MacOSX-10.3.5 with WO-5.2. We run 5
instances of it and about 3 or 4 weeks the application becomes
unresponsive. We also host a few other applications that never seem
to have this problem. When is 'ssh' in to the machine and do 'top' I
see:
--------------------------
Processes: 58 total, 4 running, 54 sleeping... 1355
threads 11:45:42
Load Avg: 0.46, 0.61, 0.43 CPU usage: 11.2% user, 10.8% sys,
78.1% idle
SharedLibs: num = 89, resident = 7.17M code, 860K data, 2.69M
LinkEdit
MemRegions: num = 14137, resident = 1.24G + 3.60M private, 24.1M shared
PhysMem: 148M wired, 420M active, 900M inactive, 1.43G used,
67.1M free
VM: 9.15G + 57.3M 416495(0) pageins, 942650(0) pageouts
PID COMMAND %CPU TIME #TH #PRTS #MREGS RPRVT RSHRD
RSIZE VSIZE
27600 java 0.0% 67:18.84 29 558 285 80.8M 15.0M
81.4M 307M
13570 top 19.2% 0:08.34 1 17 26 592K 352K
968K 27.1M
13569 bash 0.0% 0:00.03 1 12 15 180K 796K
792K 18.2M
13568 sshd 0.0% 0:00.01 1 9 41 112K 1.30M
448K 30.0M
13566 sshd 0.0% 0:00.21 1 15 41 116K 1.30M
1.45M 30.0M
13441 java 0.0% 28:00.67 30 510 248 89.7M 11.0M
98.7M 317M
13439 java 18.4% 20:07.25 30 535 257 90.7M+ 11.0M 99.8M
+ 317M
13363 java 0.8% 14:48.68 30 531 248 93.0M 11.0M
102M 317M
13339 java 0.0% 32:54.35 30 522 251 91.0M 11.0M
100M 317M
13335 java 0.0% 30:38.76 30 546 262 101M 11.0M
110M 325M
13004 httpd 0.0% 0:00.78 1 10 103 232K 4.41M
2.20M 36.6M
8771 httpd 0.0% 0:00.76 1 10 106 356K 4.41M 2.27M
36.7M
6043 httpd 0.0% 0:00.63 1 10 106 352K 4.41M 2.28M
36.7M
2941 java 0.8% 99:44.73 >> >>> 983 199M 10.3M
110M 942M
2906 java 0.0% 53:56.68 >> >>> 1080 13.5M 10.3M
6.41M 576M
2871 java 0.0% 49:29.13 >> >>> 1060 26.1M 10.3M
15.9M 543M
2836 java 0.0% 2:10:53 >> >>> 1213 238M 10.3M 116M
1.12G
2781 java 0.0% 47:16.36 78 836 482 64.2M 10.3M
58.8M 414M
477 java 0.0% 82:49.47 28 524 280 50.7M 15.8M
48.8M 307M
--------------------------
While reading the 'top' documentation I see that when the '#TH'
column has a '>>' that it has more than 99 threads and when '#PRTS'
has a '>>>' that there are more than 999 ports. I find this a little
excessive seeing that none of the other applications use this many
resources.
I can also verify that each of the java processes with a '>>>' is a
process of the same application. I used 'lsof -i -n -P | grep java |
less' and can see that associated tcp port with the one set in the
application. Also scattered throughout the runaway process's output
I see quite a few of these lines:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
<snip>
java 19994 root 181u IPv6 0t0 TCP can't
read in6pcb at 0x00000000
java 19994 root 182u IPv6 0t0 TCP can't
read in6pcb at 0x00000000
Another thing that also concerns me is the 'VSIZE' column having one
instances with 1.12GB and another with 942MB.
So, as you can see, there are lot of symptoms and I'm not sure if
they are all pointing to the same problem. I should also mention
that our WO developer has since left so I'm in the position where I
need to get this under control.
Thank you for any help you can offer.
--
Ken Schweigert, Network Administrator
Byte Productions, LLC
someone will have some guidance as to what to do. I have a lot of
symptoms, but no real idea where to go next.
The machine is running on MacOSX-10.3.5 with WO-5.2. We run 5
instances of it and about 3 or 4 weeks the application becomes
unresponsive. We also host a few other applications that never seem
to have this problem. When is 'ssh' in to the machine and do 'top' I
see:
--------------------------
Processes: 58 total, 4 running, 54 sleeping... 1355
threads 11:45:42
Load Avg: 0.46, 0.61, 0.43 CPU usage: 11.2% user, 10.8% sys,
78.1% idle
SharedLibs: num = 89, resident = 7.17M code, 860K data, 2.69M
LinkEdit
MemRegions: num = 14137, resident = 1.24G + 3.60M private, 24.1M shared
PhysMem: 148M wired, 420M active, 900M inactive, 1.43G used,
67.1M free
VM: 9.15G + 57.3M 416495(0) pageins, 942650(0) pageouts
PID COMMAND %CPU TIME #TH #PRTS #MREGS RPRVT RSHRD
RSIZE VSIZE
27600 java 0.0% 67:18.84 29 558 285 80.8M 15.0M
81.4M 307M
13570 top 19.2% 0:08.34 1 17 26 592K 352K
968K 27.1M
13569 bash 0.0% 0:00.03 1 12 15 180K 796K
792K 18.2M
13568 sshd 0.0% 0:00.01 1 9 41 112K 1.30M
448K 30.0M
13566 sshd 0.0% 0:00.21 1 15 41 116K 1.30M
1.45M 30.0M
13441 java 0.0% 28:00.67 30 510 248 89.7M 11.0M
98.7M 317M
13439 java 18.4% 20:07.25 30 535 257 90.7M+ 11.0M 99.8M
+ 317M
13363 java 0.8% 14:48.68 30 531 248 93.0M 11.0M
102M 317M
13339 java 0.0% 32:54.35 30 522 251 91.0M 11.0M
100M 317M
13335 java 0.0% 30:38.76 30 546 262 101M 11.0M
110M 325M
13004 httpd 0.0% 0:00.78 1 10 103 232K 4.41M
2.20M 36.6M
8771 httpd 0.0% 0:00.76 1 10 106 356K 4.41M 2.27M
36.7M
6043 httpd 0.0% 0:00.63 1 10 106 352K 4.41M 2.28M
36.7M
2941 java 0.8% 99:44.73 >> >>> 983 199M 10.3M
110M 942M
2906 java 0.0% 53:56.68 >> >>> 1080 13.5M 10.3M
6.41M 576M
2871 java 0.0% 49:29.13 >> >>> 1060 26.1M 10.3M
15.9M 543M
2836 java 0.0% 2:10:53 >> >>> 1213 238M 10.3M 116M
1.12G
2781 java 0.0% 47:16.36 78 836 482 64.2M 10.3M
58.8M 414M
477 java 0.0% 82:49.47 28 524 280 50.7M 15.8M
48.8M 307M
--------------------------
While reading the 'top' documentation I see that when the '#TH'
column has a '>>' that it has more than 99 threads and when '#PRTS'
has a '>>>' that there are more than 999 ports. I find this a little
excessive seeing that none of the other applications use this many
resources.
I can also verify that each of the java processes with a '>>>' is a
process of the same application. I used 'lsof -i -n -P | grep java |
less' and can see that associated tcp port with the one set in the
application. Also scattered throughout the runaway process's output
I see quite a few of these lines:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
<snip>
java 19994 root 181u IPv6 0t0 TCP can't
read in6pcb at 0x00000000
java 19994 root 182u IPv6 0t0 TCP can't
read in6pcb at 0x00000000
Another thing that also concerns me is the 'VSIZE' column having one
instances with 1.12GB and another with 942MB.
So, as you can see, there are lot of symptoms and I'm not sure if
they are all pointing to the same problem. I should also mention
that our WO developer has since left so I'm in the position where I
need to get this under control.
Thank you for any help you can offer.
--
Ken Schweigert, Network Administrator
Byte Productions, LLC