Discussion:
CLOSE_WAIT issues
John Pollard
2006-07-25 17:03:32 UTC
Permalink
List,

We have had similar problems in the past, but this seems to be
happening more often now:

Web site performance becomes erratic (some instances not responding)
Viewing JavaMonitor one or two instances show 0 in the Transactions
column, even though you know that is not the case (there will have
been hundreds or thousands)

Our WO app instances are all within the port range of 2000 - 2050:

Running:
/usr/sbin/lsof -i tcp:2001-2050 -P
gives some ok LISTEN lines + some problem lines like this:

java 553 root 20u IPv6 0x04bbcd68 0t0 TCP plug:2025-
plug:64477 (CLOSE_WAIT)
java 553 root 21u IPv6 0x0646b2c8 0t0 TCP plug:2025-
plug:64483 (CLOSE_WAIT)
When I send a kill -9 to the problem process, 533 these are of course
cleared.

Casting my lsof net a bit wider:
/usr/sbin/lsof -i tcp:2001-2050 -P
gives about 10 lines like this:

java 281 root 9u IPv6 0x04bba470 0t0 TCP [::127.0.0.1]:
49228->[::127.0.0.1]:3306 (CLOSE_WAIT)
java 569 root 9u IPv6 0x0616dd70 0t0 TCP [::127.0.0.1]:
49247->[::127.0.0.1]:3306 (CLOSE_WAIT)

These all seem to be wotaskd processes. Are there meant to be 10
wotaskd processes running?

I tried sending a kill -QUIT signal to one of my hung WO instances to
force a stack trace, but nothing came out in the log file. I have now
read the posting about editing SpawnOfWotaskd.sh to get the log
output, so will do this for next time.

As an aside, we try to use lsof every night in a script to detect
these problems and reboot, but lsof sometimes returns absolutely
nothing or "unable to read process table" or something close to that.
I have read this is a bug with 10.4.6. Not sure if 10.4.7 helps.
Perhaps related to also having > 2G of RAM.

These CLOSE_WAIT problems seem to appear with a release of WO (I
forget which) and never went away, though I am still using the WO
version one back from the latest (where WO is not bundled in XCode).

Is the root of this problem likely to be a deadlock in our instances?
I will report back if I can get some stack trace info next time.

Thanks
John
Sacha Michel Mallais
2006-07-25 17:20:42 UTC
Permalink
Post by John Pollard
We have had similar problems in the past, but this seems to be
Web site performance becomes erratic (some instances not responding)
Viewing JavaMonitor one or two instances show 0 in the Transactions
column, even though you know that is not the case (there will have
been hundreds or thousands)
/usr/sbin/lsof -i tcp:2001-2050 -P
java 553 root 20u IPv6 0x04bbcd68 0t0 TCP plug:2025-
plug:64477 (CLOSE_WAIT)
java 553 root 21u IPv6 0x0646b2c8 0t0 TCP plug:2025-
plug:64483 (CLOSE_WAIT)
When I send a kill -9 to the problem process, 533 these are of
course cleared.
/usr/sbin/lsof -i tcp:2001-2050 -P
127.0.0.1]:49228->[::127.0.0.1]:3306 (CLOSE_WAIT)
127.0.0.1]:49247->[::127.0.0.1]:3306 (CLOSE_WAIT)
These all seem to be wotaskd processes. Are there meant to be 10
wotaskd processes running?
You definitely should only have 1 wotaskd running. However, its
unlikely that these are both wotaskd: on startup, wotaskd attaches to
port 1085 and if there's a previous wotaskd attached, it fails and
quits.
Post by John Pollard
I tried sending a kill -QUIT signal to one of my hung WO instances
to force a stack trace, but nothing came out in the log file. I
have now read the posting about editing SpawnOfWotaskd.sh to get
the log output, so will do this for next time.
This is helpful if your apps are deadlocked, but the infamous
CLOSE_WAIT problem can occur without a deadlock.
Post by John Pollard
As an aside, we try to use lsof every night in a script to detect
these problems and reboot, but lsof sometimes returns absolutely
nothing or "unable to read process table" or something close to
that. I have read this is a bug with 10.4.6. Not sure if 10.4.7
helps. Perhaps related to also having > 2G of RAM.
I've had similar problems in the distant past (you can probably find
my posts on the related MacOSX Admin list), but the only solution I
found was to re-install.
Post by John Pollard
These CLOSE_WAIT problems seem to appear with a release of WO (I
forget which) and never went away, though I am still using the WO
version one back from the latest (where WO is not bundled in XCode).
Is the root of this problem likely to be a deadlock in our
instances? I will report back if I can get some stack trace info
next time.
Unlikely a deadlock, especially if you're still using WO 5.1 or 5.2.1
or (maybe?) 5.2.2. These versions all have known problems with
CLOSE_WAIT states. Rebooting will work, but so will killing wotaskd
and your apps, which should be faster to recover from. Of course,
your best bet is to upgrade to the lastest 5.2 or better yet, 5.3.


sacha
--
Sacha Michel Mallais Senior Developer / President
Global Village Consulting Inc. http://www.global-village.net/
PGP Key ID: 7D757B65 AIM: smallais
ObAd: read "Practical WebObjects" <fnord>
http://www.global-village.net/products/practical_webobjects
John Pollard
2006-07-25 18:26:33 UTC
Permalink
Thanks Sacha,
I believe we are on WO5.2.4, though not sure how to confirm that 100% these
days. We do need to upgrade to 5.3, so will try to bring that process
forwards in case it helps.
You are right, I now think I was mistaken about the multiple wotaskd
processes. These are actually our java instances talking to mysqld (at
3306). Not sure if these being in a CLOSE_WAIT state is a problem or not? I
suspect not.
In our nightly script we do call:
/sbin/SystemStarter restart 'WebObjects Services'
but the CLOSE_WAITS do seem to remain afterwards, even with a 5 min sleep
after the wo restart, so then we reboot
John
Post by Sacha Michel Mallais
Post by John Pollard
We have had similar problems in the past, but this seems to be
Web site performance becomes erratic (some instances not responding)
Viewing JavaMonitor one or two instances show 0 in the Transactions
column, even though you know that is not the case (there will have
been hundreds or thousands)
/usr/sbin/lsof -i tcp:2001-2050 -P
java 553 root 20u IPv6 0x04bbcd68 0t0 TCP
plug:2025- >plug:64477 (CLOSE_WAIT)
java 553 root 21u IPv6 0x0646b2c8 0t0 TCP
plug:2025- >plug:64483 (CLOSE_WAIT)
When I send a kill -9 to the problem process, 533 these are of
course cleared.
/usr/sbin/lsof -i tcp:2001-2050 -P
127.0.0.1]:49228->[::127.0.0.1]:3306 (CLOSE_WAIT)
127.0.0.1]:49247->[::127.0.0.1]:3306 (CLOSE_WAIT)
These all seem to be wotaskd processes. Are there meant to be 10
wotaskd processes running?
You definitely should only have 1 wotaskd running. However, its
unlikely that these are both wotaskd: on startup, wotaskd attaches to
port 1085 and if there's a previous wotaskd attached, it fails and
quits.
Post by John Pollard
I tried sending a kill -QUIT signal to one of my hung WO instances
to force a stack trace, but nothing came out in the log file. I
have now read the posting about editing SpawnOfWotaskd.sh to get
the log output, so will do this for next time.
This is helpful if your apps are deadlocked, but the infamous
CLOSE_WAIT problem can occur without a deadlock.
Post by John Pollard
As an aside, we try to use lsof every night in a script to detect
these problems and reboot, but lsof sometimes returns absolutely
nothing or "unable to read process table" or something close to
that. I have read this is a bug with 10.4.6. Not sure if 10.4.7
helps. Perhaps related to also having > 2G of RAM.
I've had similar problems in the distant past (you can probably find
my posts on the related MacOSX Admin list), but the only solution I
found was to re-install.
Post by John Pollard
These CLOSE_WAIT problems seem to appear with a release of WO (I
forget which) and never went away, though I am still using the WO
version one back from the latest (where WO is not bundled in XCode).
Is the root of this problem likely to be a deadlock in our
instances? I will report back if I can get some stack trace info
next time.
Unlikely a deadlock, especially if you're still using WO 5.1 or 5.2.1
or (maybe?) 5.2.2. These versions all have known problems with
CLOSE_WAIT states. Rebooting will work, but so will killing wotaskd
and your apps, which should be faster to recover from. Of course,
your best bet is to upgrade to the lastest 5.2 or better yet, 5.3.
sacha
John Pollard
2006-07-25 18:40:29 UTC
Permalink
I have confirmed WO5.2.4 from /Library/Receipts
It is the Developer version though, with a deployment license added, I
wonder if that could be a problem. We deploy on Mac OS X Client boxes, as
there doesn't seem much point in splashing out for Mac OS X Server. Could
this be our undoing on this issue?
Post by John Pollard
Thanks Sacha,
I believe we are on WO5.2.4, though not sure how to confirm that 100%
these days. We do need to upgrade to 5.3, so will try to bring that
process forwards in case it helps.
You are right, I now think I was mistaken about the multiple wotaskd
processes. These are actually our java instances talking to mysqld (at
3306). Not sure if these being in a CLOSE_WAIT state is a problem or not?
I suspect not.
/sbin/SystemStarter restart 'WebObjects Services'
but the CLOSE_WAITS do seem to remain afterwards, even with a 5 min sleep
after the wo restart, so then we reboot
John
Sacha Michel Mallais
2006-07-25 18:49:01 UTC
Permalink
Post by John Pollard
I have confirmed WO5.2.4 from /Library/Receipts
It is the Developer version though, with a deployment license
added, I wonder if that could be a problem. We deploy on Mac OS X
Client boxes, as there doesn't seem much point in splashing out for
Mac OS X Server. Could this be our undoing on this issue?
I suppose it is possible that there is some difference between the 2
versions, but I don't think so. I don't recommend what you're doing
(simply because spending even a few hours configuring a client box is
usually not worth it if you can buy something that will work out-of-
the-box), but I would expect to see missing classes or something
along those lines if that was your problem.

Have you verified that the CLOSE_WAIT does, in fact, occur when
communicating to your DB? Or is it between wotaskd and the apps. My
experience is with the latter...


sacha
--
Sacha Michel Mallais 400 kg chimp
Global Village Consulting Inc. http://www.global-village.net/
PGP Key ID: 7D757B65 AIM: smallais
"Choke on that, causality!" -- the Professor, "Futurama"
Aurelien Minet
2006-07-25 19:26:29 UTC
Permalink
Post by Sacha Michel Mallais
Post by John Pollard
I have confirmed WO5.2.4 from /Library/Receipts
It is the Developer version though, with a deployment license added,
I wonder if that could be a problem. We deploy on Mac OS X Client
boxes, as there doesn't seem much point in splashing out for Mac OS X
Server. Could this be our undoing on this issue?
I suppose it is possible that there is some difference between the 2
versions, but I don't think so. I don't recommend what you're doing
(simply because spending even a few hours configuring a client box is
usually not worth it if you can buy something that will work
out-of-the-box), but I would expect to see missing classes or
something along those lines if that was your problem.
Have you verified that the CLOSE_WAIT does, in fact, occur when
communicating to your DB? Or is it between wotaskd and the apps. My
experience is with the latter...
sacha
--Sacha Michel Mallais 400 kg chimp
Global Village Consulting Inc. http://www.global-village.net/
PGP Key ID: 7D757B65 AIM: smallais
"Choke on that, causality!" -- the Professor, "Futurama"
_______________________________________________
WebObjects-admin mailing list
http://www.omnigroup.com/mailman/listinfo/webobjects-admin
Hi all,

I have CLOSE_WAIT , mostly with DB connections, but I have transactions
in JavaMonitor with no active session.
A TCP connection in CLOSE_WAIT means that the remote close the
connection (FIN) and the local system has yet say ok (ACK). This state
should ends when the application tries to use the file descriptor, if it
is a DB connection (which have timeout) it would try to reconnect to it,
a client it ends the session .....
You may check the version of your jdcb driver, and check if it has no
problem with IPV6 ( your lsof report TCP connection in IPV6 on the loopback.
Putting your application un debug mode may help.

Aurelien
--
Aurelien Minet
Direction des Systemes d'Information
Universite Rene Descartes
John Pollard
2006-07-27 12:07:38 UTC
Permalink
Am using mysql-connector-java-3.0.16-ga-bin.jar, will look at upgrading,
thanks.
Post by Aurelien Minet
Hi all,
I have CLOSE_WAIT , mostly with DB connections, but I have transactions
in JavaMonitor with no active session.
A TCP connection in CLOSE_WAIT means that the remote close the
connection (FIN) and the local system has yet say ok (ACK). This state
should ends when the application tries to use the file descriptor, if it
is a DB connection (which have timeout) it would try to reconnect to it,
a client it ends the session .....
You may check the version of your jdcb driver, and check if it has no
problem with IPV6 ( your lsof report TCP connection in IPV6 on the loopback.
Putting your application un debug mode may help.
Aurelien
Sacha Michel Mallais
2006-07-25 18:43:58 UTC
Permalink
Post by John Pollard
I believe we are on WO5.2.4, though not sure how to confirm that
100% these days. We do need to upgrade to 5.3, so will try to bring
that process forwards in case it helps.
Hmm... I've never had CLOSE_WAIT problems with 5.2.4 -- and I've
heard disturbing reports that 5.3 has some problems leaving files in
CLOSE_WAIT (though not as bad as 5.2.2 and 5.2.1). Given that you're
having problems with lsof (and I assume ps too), I'd very strongly
recommend a re-install of the entire OS.
Post by John Pollard
You are right, I now think I was mistaken about the multiple
wotaskd processes. These are actually our java instances talking to
mysqld (at 3306). Not sure if these being in a CLOSE_WAIT state is
a problem or not? I suspect not.
Not sure. I've never noticed CLOSE_WAIT being a problem when
communicating with a database, so maybe this is something different.
What I've seen is that both wotaskd and any WO app can get stuck in
CLOSE_WAIT when talking to each other.
Post by John Pollard
/sbin/SystemStarter restart 'WebObjects Services'
but the CLOSE_WAITS do seem to remain afterwards, even with a 5 min
sleep after the wo restart, so then we reboot
Do you see wotaskd and your apps restart when you issue this
command? Because that is what is required -- and I've had problems
using SystemStarter in the past. If this is restarting wotaskd and
your apps properly and you're still having problems then its even
more evidence that you need to re-install the OS. One of my clients
had similar problems (restarting wotaskd and apps didn't help) and
_nothing_ we did helped -- short of a reboot, as you are describing
-- until he re-installed everything.


sacha
--
Sacha Michel Mallais Senior Developer / President
Global Village Consulting Inc. http://www.global-village.net/
PGP Key ID: 7D757B65 AIM: smallais
John Pollard
2006-07-27 12:04:12 UTC
Permalink
Sacha,

As you say is disturbing to hear people report CLOSE_WAITs with 5.3 also.
I have your point about an OS install in mind, thanks for the tip.
At present I am rebooting every night and since doing so, no hiccups yet,
though I do not count this as a long term solution.
wotaskd was definitely restarting.
Though we were restarting wotaskd previously (SystemRestarter seems to work
ok), we were killing/restarting the stuck applications. Perhaps scheduling
restarts for every app instance and restarting wotaskd every night is
normal practice and what we should have been doing?
Post by Sacha Michel Mallais
I suppose it is possible that there is some difference between the 2
versions, but I don't think so. I don't recommend what you're doing
(simply because spending even a few hours configuring a client box is
usually not worth it if you can buy something that will work out-of-
the-box), but I would expect to see missing classes or something
along those lines if that was your problem.
I think we were almost forced to keep away from X Server because it came
preinstalled with a version of WO that at the time completely broke our
Java Client applications and there was no supported way of having an older
WO version. I actually find that the installation work needed on X Client
is pretty much the same as on X Server.
Post by Sacha Michel Mallais
Have you verified that the CLOSE_WAIT does, in fact, occur when
communicating to your DB? Or is it between wotaskd and the apps. My
experience is with the latter...
I have been looking into this a bit more. We do have CLOSE_WAITs from some
of our WO apps connecting to mysql, but I haven't seen these grow out of
control (have only seen one per app instance) in the way they do for the
wotaskd CLOSE_WAIT connections. I expect there is something slightly wrong
there, but I don't have any evidence that this stops anything working.
Themain problem seems to be when the CLOSE_WAITs happen between wotaskd and
app instances as you say and the app instances become non responsive.

I am yet to see a repeat (since adding nightly reboots) so have not yet had
a chance to get a stack trace.

John
Sacha Michel Mallais
2006-07-27 15:41:30 UTC
Permalink
Post by John Pollard
As you say is disturbing to hear people report CLOSE_WAITs with 5.3 also.
I have your point about an OS install in mind, thanks for the tip.
At present I am rebooting every night and since doing so, no
hiccups yet, though I do not count this as a long term solution.
wotaskd was definitely restarting.
Though we were restarting wotaskd previously (SystemRestarter seems
to work ok), we were killing/restarting the stuck applications.
Perhaps scheduling restarts for every app instance and restarting
wotaskd every night is normal practice and what we should have been
doing?
I would say it is normal to schedule your apps to restart on a
regular basis: how often depends on how confident you are in your
code. :-) As for wotaskd, it _should_ run forever without restarting.
Post by John Pollard
Post by Sacha Michel Mallais
I suppose it is possible that there is some difference between the 2
versions, but I don't think so. I don't recommend what you're doing
(simply because spending even a few hours configuring a client box is
usually not worth it if you can buy something that will work out-
of- the-box), but I would expect to see missing classes or something
along those lines if that was your problem.
I think we were almost forced to keep away from X Server because it
came preinstalled with a version of WO that at the time completely
broke our Java Client applications and there was no supported way
of having an older WO version. I actually find that the
installation work needed on X Client is pretty much the same as on
X Server.
Fair enough.


sacha
--
Sacha Michel Mallais Senior Developer / President
Global Village Consulting Inc. http://www.global-village.net/
PGP Key ID: 7D757B65 AIM: smallais
"Good people do not need laws to tell them to act responsibly,
while bad people will find a way around the laws." -- Plato
Continue reading on narkive:
Loading...