|
2005-06-21
| ||
| 20:02 | • Closed ticket [1220692fff]: tk locks up plus 6 other changes artifact: 0c714c90da user: davidw | |
|
2005-06-16
| ||
| 05:24 | • Ticket [1220692fff]: 4 changes artifact: 9dfc2acf77 user: dkf | |
| 05:06 | • Ticket [1220692fff]: 4 changes artifact: 9dca40a99b user: davidw | |
| 00:19 | • Ticket [1220692fff]: 4 changes artifact: f8c86e3691 user: vasiljevic | |
| 00:09 | • Ticket [1220692fff]: 4 changes artifact: b88361bb45 user: davidw | |
|
2005-06-15
| ||
| 22:19 | • Ticket [1220692fff]: 4 changes artifact: d9aa558d8c user: vasiljevic | |
| 21:46 | • Ticket [1220692fff]: 4 changes artifact: 42e0cacc05 user: davidw | |
| 21:28 | • Ticket [1220692fff]: 4 changes artifact: 8869891b08 user: vasiljevic | |
| 21:20 | • Ticket [1220692fff]: 4 changes artifact: 58a1b574ff user: davidw | |
| 19:51 | • Ticket [1220692fff]: 4 changes artifact: bb58b51a81 user: dkf | |
| 19:51 | • Ticket [1220692fff]: 4 changes artifact: 4410bed9dc user: vasiljevic | |
| 19:46 | • Ticket [1220692fff]: 4 changes artifact: 28dc65b01c user: dkf | |
| 19:27 | • Ticket [1220692fff]: 4 changes artifact: 6017323e36 user: vasiljevic | |
| 19:05 | • Ticket [1220692fff]: 4 changes artifact: ff88560801 user: davidw | |
| 18:41 | • Ticket [1220692fff]: 4 changes artifact: a6c3cd1cb2 user: vasiljevic | |
| 16:57 | • Ticket [1220692fff]: 4 changes artifact: e2f8596f72 user: davidw | |
| 15:58 | • Ticket [1220692fff]: 4 changes artifact: 91ae732aea user: vasiljevic | |
| 05:52 | • Ticket [1220692fff]: 4 changes artifact: 5cfc278b77 user: davidw | |
| 04:53 | • Ticket [1220692fff]: 4 changes artifact: 8b0c9215db user: davidw | |
| 04:30 | • Ticket [1220692fff]: 4 changes artifact: b7eaf968fb user: davidw | |
| 02:54 | • Ticket [1220692fff]: 4 changes artifact: ebd6901829 user: davidw | |
|
2005-06-14
| ||
| 19:49 | • New ticket [1220692fff]. artifact: 3bba6c1dc2 user: davidw | |
| Ticket UUID: | 1220692 | |||
| Title: | tk locks up | |||
| Type: | Bug | Version: | None | |
| Submitter: | davidw | Created on: | 2005-06-14 19:49:29 | |
| Subsystem: | 70. Event Loop | Assigned To: | hobbs | |
| Priority: | 5 Medium | Severity: | ||
| Status: | Closed | Last Modified: | 2005-06-21 20:02:25 | |
| Resolution: | None | Closed By: | davidw | |
| Closed on: | 2005-06-21 13:02:25 | |||
| Description: |
This script, when run under strace, locks up Tk on Linux:
# loop.tcl
# Run like so:
# strace tclsh loop.tcl
package require Tk
proc screenupdate {num} {
incr num
.foo configure -text $num
after 10 [list screenupdate $num]
}
button .foo
pack .foo
screenupdate 0
-----
Or better, it seems to reproduce the bug I see in a
script that locks up even without strace.
The following script will lock it up without strace.
Ok - so it's kind of evil, but still, it seems to be
tickling a race condition or something. Note that it's
not just slow - it's locked, it's not doing anything.
# loop.tcl
# Run like so:
# strace tclsh loop.tcl
package require Tk
proc afterburner {} {
after 10 afterburner
}
proc screenupdate {num} {
incr num
.foo configure -text $num
after 10 [list screenupdate $num]
}
button .foo
pack .foo
screenupdate 0
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
afterburner
-------------
| |||
| User Comments: |
davidw added on 2005-06-21 20:02:25:
Logged In: YES user_id=240 I'm going to go ahead and close thus. I'm convinced the problem lies with Debian's libc. If I discover that to not be the case, we can always reopen this bug. dkf added on 2005-06-16 05:24:11: Logged In: YES user_id=79902 Reviewing the issues you've found, it does indeed look like glibc 2.3.2 is subtly bust in the threading department. This might explain some bizarre hangs of production Java apps on some of our systems. :-( Some of the comments in that bugzilla report didn't exactly raise the spirits either, but then I've long suspected that glibc was not maintained to the quality levels I'm accustomed to with Tcl... davidw added on 2005-06-16 05:06:54: Logged In: YES
user_id=240
Let's compare libc versions. Debian sarge has:
Version: 2.3.2.ds1-22
The last entry in /usr/share/doc/libc6/changelog.gz is dated:
2003-09-19 Ulrich Drepper <drepper@redhat.com>
* elf/Makefile ($(objpfx)librtld.os): Create link
map also for
final build to ease quality control.
So to me it looks like more and more things point to libc6.
I've started a thread on the Debian list - once we have
confirmation there, we can close this bug.
http://lists.debian.org/debian-glibc/2005/06/msg00134.html
vasiljevic added on 2005-06-16 00:19:53: Logged In: YES
user_id=95086
Hm....
This is the test system (SuSE Linux on a Dell 2.4GHz):
zoran@zvlinux:~/sf/core-8-4-branch/unix> uname -a
Linux zvlinux 2.6.4-52-default #1 Wed Apr 7 02:08:30 UTC 2004 i686
i686 i386 GNU/Linux
zoran@zvlinux:~/sf/core-8-4-branch/unix> strace -o dummy ./tclsh
% array get tcl_platform
osVersion 2.6.4-52-default byteOrder littleEndian threaded 1 machine
i686 platform unix os Linux user zoran wordSize 4
% set tcl_patchLevel
8.4.10
This is the test program which exercises timer events,
file events and inter-thread communication events
(using Tcl thread package and thread::broadcast command)
package req Thread
for {set i 0} {$i < 16} {incr i} {
thread::create {
set tack 0
set fl [open /dev/zero]
proc afterburner {} {
if {($::tack % 1000) == 0} {
puts AB:tack([thread::id])-$::tack
}
after 1 afterburner
}
proc screenupdate num {
if {[incr num] % 1000 == 0} {
puts SU:tick([thread::id])-$num
}
thread::broadcast "set ::tack $num"
after 1 screenupdate $num
}
proc readzero {chan} {
if {($::tack % 1000) == 0} {
puts RZ:tack([thread::id])-$::tack
}
after 1
}
fconfigure $fl -blocking 0
fileevent $fl readable [list readzero $fl]
afterburner
screenupdate 1
thread::wait
}
}
If I start this, it will create 16 threads which will chew-up lots of CPU
and exercise the notifier *a lot*. This runs happily on my system:
% thread::names
tid0x424b1bb0 tid0x422b0bb0 tid0x420afbb0 tid0x41eaebb0
tid0x41cadbb0 tid0x41aacbb0 tid0x418abbb0 tid0x416aabb0
tid0x414a9bb0 tid0x412a8bb0 tid0x410a7bb0 tid0x40ea6bb0
tid0x40ca5bb0 tid0x40aa4bb0 tid0x408a3bb0 tid0x406a2bb0
tid0x402392a0
SU:tick(tid0x424b1bb0)-124000
AB:tack(tid0x418abbb0)-124000
SU:tick(tid0x408a3bb0)-124000
RZ:tack(tid0x424b1bb0)-124000
RZ:tack(tid0x418abbb0)-124000
RZ:tack(tid0x40ca5bb0)-124000
RZ:tack(tid0x412a8bb0)-124000
RZ:tack(tid0x40ea6bb0)-124000
AB:tack(tid0x412a8bb0)-124000
AB:tack(tid0x40ea6bb0)-124000
AB:tack(tid0x424b1bb0)-124000
SU:tick(tid0x40aa4bb0)-124000
SU:tick(tid0x412a8bb0)-124000
AB:tack(tid0x40ca5bb0)-124000
SU:tick(tid0x416aabb0)-124000
SU:tick(tid0x40ea6bb0)-124000
SU:tick(tid0x41aacbb0)-124000
SU:tick(tid0x422b0bb0)-124000
SU:tick(tid0x41eaebb0)-124000
AB:tack(tid0x416aabb0)-124000
SU:tick(tid0x414a9bb0)-125000
RZ:tack(tid0x408a3bb0)-125000
RZ:tack(tid0x420afbb0)-125000
SU:tick(tid0x418abbb0)-124000
SU:tick(tid0x40ca5bb0)-125000
RZ:tack(tid0x41cadbb0)-125000
RZ:tack(tid0x414a9bb0)-125000
RZ:tack(tid0x422b0bb0)-125000
RZ:tack(tid0x406a2bb0)-125000
RZ:tack(tid0x41eaebb0)-125000
RZ:tack(tid0x40aa4bb0)-125000
(and so on and so on...)
You can observe the alternating thread-ID's and counters.
NOTE: it runs under "strace", swamping my disk and chewing
my CPU:
zoran@zvlinux:/usr/local> ps -efl | grep tclsh
0 S zoran 5979 4125 0 76 0 - 394 wait4 19:04 pts/40 00:00:00
strace -o dummy ./tclsh
0 S zoran 5980 5979 0 75 0 - 48855 schedu 19:04 pts/40
00:00:00 ./tclsh
top - 19:11:03 up 42 min, 2 users, load average: 11.08, 7.83, 4.16
I cannot conclude anything but that this might be
some weird Linux issue on your config. Can you
verify this on some other Linux distribution or
perhaps other processor?
I gave it a try under Solaris/Sparc (2.8) and Darwin/PPC (10.4)
and it also works as expected.
(Confused) Zoran
davidw added on 2005-06-16 00:09:57: Logged In: YES user_id=240 This bug looks an awful lot like this: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=108631 although their provided code doesn't hang... I've got an email into the debian libc maintainers to see if they have anything to say. vasiljevic added on 2005-06-15 22:19:04: Logged In: YES
user_id=95086
Most interesting....
I do not believe that this might have to do anything with
this problem, but the code arround line 755 in unix/tclUnixNotfy.c
should be better looking like this:
while (!tsdPtr->eventReady) {
Tcl_ConditionWait(&tsdPtr->waitCV, ¬ifierMutex, timePtr);
}
instead of
if (!tsdPtr->eventReady) {
Tcl_ConditionWait(&tsdPtr->waitCV, ¬ifierMutex, timePtr);
}
This is more defensive and could save us some headaches
down the road.
The fprintf debugging shows clearly that the notify event
actually never reached the waiter and I could not imagine
why, unless there is really a quirk in the libc.
Let me see If I can reproduce this on my machine.
davidw added on 2005-06-15 21:46:42: Logged In: YES user_id=240 The reason why I think it's NPTL is that I can't reproduce the crash on non-NPTL systems, for instance, my PPC Linux box, or the exact same code when you set the environmental variable: export LD_ASSUME_KERNEL=2.4.1 which "turns off" NPTL. I wouldn't completely rule out a bug in the kernel or libc6, but before we go that route, let's see if we can isolate it further so they don't blame it on us. Here's another data point. I added some debugging fprintf's around the pthread_cond_wait and broadcast calls: *68 is the notifier thread, *80 is the main thread pthread_cond_wait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 cond_wait returned pthread_cond_wait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 cond_wait returned pthread_cond_timedwait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 timedwait returned pthread_cond_wait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 cond_wait returned pthread_cond_wait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 cond_wait returned pthread_cond_timedwait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 timedwait returned pthread_cond_timedwait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 timedwait returned pthread_cond_wait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 *HANGS* So it never wakes up from its cond_wait, even though the broadcast happened, supposedly... vasiljevic added on 2005-06-15 21:28:56: Logged In: YES user_id=95086 Good (or bad, depends on the perspective) :-) Thanks to you, we have at least isolated the Tk out. This makes things simpler (less of complex code to dig...) So, as you say, this is a NPTL thingy? I do have a RH (Enterprise) Dell box here so I assume I could give it a test-try and observe (hopefully) the same behaviour... davidw added on 2005-06-15 21:20:43: Logged In: YES
user_id=240
The suggested Tcl only scripts do not *seem* to cause a
hang, but of course, with something that in other situations
has lasted hours with no hangs, we can only prove that it
does hang.
This Tcl only script does cause a hang rather quickly (with
strace):
set fl [open /dev/zero]
proc afterburner {} {after 10 afterburner}
proc screenupdate num {
if {[incr num] % 100 == 0} {puts tick-$num}
after 10 screenupdate $num
}
proc readzero {chan } {
}
fconfigure $fl -blocking 0
fileevent $fl readable [list readzero $fl]
afterburner
screenupdate 1;
vwait forever
Stack traces are the same.
Local variables:
(gdb) up
#2 0x400ecfcf in Tcl_WaitForEvent (timePtr=0x804c1a8) at
tclUnixNotfy.c:774
774 Tcl_ConditionWait(&tsdPtr->waitCV,
¬ifierMutex, myTimePtr);
(gdb) info locals
filePtr = (FileHandler *) 0x0
fileEvPtr = (FileHandlerEvent *) 0x0
mask = 134541992
myTime = {sec = 0, usec = 0}
waitForFiles = 1
myTimePtr = (Tcl_Time *) 0x0
tsdPtr = (ThreadSpecificData *) 0x804eba8
(gdb) thread 2
[Switching to thread 2 (Thread 1085053872 (LWP 19025))]#0
0x4021ea27 in select
() from /lib/tls/libc.so.6
(gdb) up
#1 0x400ed563 in NotifierThreadProc (clientData=0x0) at
tclUnixNotfy.c:993
993 if (select(numFdBits, &readableMask,
&writableMask, &exceptionalMask,
(gdb) info locals
tsdPtr = (ThreadSpecificData *) 0x0
readableMask = {__fds_bits = {8, 0 <repeats 31 times>}}
writableMask = {__fds_bits = {0 <repeats 32 times>}}
exceptionalMask = {__fds_bits = {0 <repeats 32 times>}}
fds = {3, 4}
i = -1
status = 2049
numFdBits = 6
receivePipe = 3
found = 1
poll = {tv_sec = 0, tv_usec = 0}
timePtr = (struct timeval *) 0x0
buf = "\000"
Hrm...
dkf added on 2005-06-15 19:51:58: Logged In: YES user_id=79902 What are all the contents of the local variables (at the Tcl-library level, of course) in both threads when the threads deadlock? vasiljevic added on 2005-06-15 19:51:15: Logged In: YES user_id=95086 Yes. I assumed David would understand the *intention* and I just sketched the think briefly... Main intention is to filter the noise Tk might bring into. dkf added on 2005-06-15 19:46:31: Logged In: YES
user_id=79902
I think Zoran means the following script... ;-)
proc afterburner {} {after 10 afterburner}
proc screenupdate num {
if {[incr num] % 100 == 0} {puts tick-$num}
after 10 screenupdate $num
}
afterburner;screenupdate 1; vwait forever
vasiljevic added on 2005-06-15 19:27:52: Logged In: YES
user_id=95086
OK. If I understand this well, you could/should have been able
to reproduce such behaviour even without Tk, by just using the
plain Tcl shell:
proc afterburner {} {
after 10 afterburner
}
proc screenupdate {num} {
incr num
if {$num % 100} {
puts"tick-$num"
}
after 10 [list screenupdate $num]
}
The above should emit "tick-1", "tick-2" and so on... about each second.
Can you try this on your machine (maybe employing the strace trick)
and see if this locks-up as well?
davidw added on 2005-06-15 19:05:28: Logged In: YES
user_id=240
It's the same thing:
Thread 2 (Thread 1084599216 (LWP 19682)):
#0 0x401f6a27 in select () from /lib/tls/libc.so.6
#1 0x400c7f9b in NotifierThreadProc (clientData=0x0) at
tclUnixNotfy.c:976
#2 0x400f8b63 in start_thread () from /lib/tls/libpthread.so.0
#3 0x401fd18a in clone () from /lib/tls/libc.so.6
Thread 1 (Thread 1076207744 (LWP 19681)):
#0 0x400fb295 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib/tls/libpthread.so.0
#1 0x400c6e15 in Tcl_ConditionWait (condPtr=0x8052558,
mutexPtr=0x400e25c0,
timePtr=0x0) at tclUnixThrd.c:725
#2 0x400c7a07 in Tcl_WaitForEvent (timePtr=0x0) at
tclUnixNotfy.c:755
#3 0x4009df35 in Tcl_DoOneEvent (flags=-3) at
../generic/tclNotify.c:924
#4 0x40accef4 in Tk_MainLoop ()
at /home/davidw/workshop/stabletcl/tk/generic/tkEvent.c:1501
#5 0x40098e5d in Tcl_Main (argc=1, argv=0xbffff9b8,
appInitProc=0x8048746 <Tcl_AppInit>) at
../generic/tclMain.c:485
#6 0x0804873c in main (argc=2, argv=0xbffff9b4) at
tclAppInit.c:90
#0 0x400fb295 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib/tls/libpthread.so.0
Here are the final syscalls of both threads too, just for fun:
Main thread:
14:03:46.010856 write(4, "\0", 1) = 1
14:03:46.011039 futex(0x80f7238, FUTEX_WAIT, 10186, NULL) = 0
14:03:46.011719 futex(0x804a880, FUTEX_WAIT, 2, NULL) = 0
14:03:46.011967 futex(0x804a880, FUTEX_WAKE, 1) = 0
14:03:46.012093 gettimeofday({1118837026, 12164},
{4294967176, 0}) = 0
14:03:46.012262 gettimeofday({1118837026, 12292},
{4294967176, 0}) = 0
14:03:46.012326 write(4, "\0", 1) = 1
14:03:46.012384 gettimeofday({1118837026, 12402},
{4294967176, 0}) = 0
14:03:46.012431 clock_gettime(CLOCK_REALTIME, {1118837026,
12452000}) = 0
14:03:46.012477 futex(0x80f7238, FUTEX_WAIT, 10187, {0,
607000}) = -1 ETIMEDOUT (Connection timed out)
14:03:46.014276 write(4, "\0", 1) = 1
14:03:46.014503 futex(0x804a880, FUTEX_WAKE, 1) = 0
14:03:46.014721 gettimeofday({1118837026, 14897},
{4294967176, 0}) = 0
14:03:46.014989 gettimeofday({1118837026, 15053},
{4294967176, 0}) = 0
14:03:46.015325 gettimeofday({1118837026, 15497},
{4294967176, 0}) = 0
14:03:46.015636 gettimeofday({1118837026, 15721},
{4294967176, 0}) = 0
14:03:46.015817 write(5,
"\2\30\4\0\16\0\0\5\2\0\0\0\354\354\354\0", 16) = 16
14:03:46.016247 write(4, "\0", 1) = 1
14:03:46.016423 futex(0x80f7238, FUTEX_WAIT, 10188, NULL) = 0
14:03:46.016786 futex(0x804a880, FUTEX_WAKE, 1) = 0
14:03:46.016845 gettimeofday({1118837026, 16864},
{4294967176, 0}) = 0
14:03:46.016959 write(5,
"5\30\4\0\17\0\0\5\16\0\0\0058\0\34\0F\216\5\0\17\0\0\5"...,
236) = 236
14:03:46.017577 gettimeofday({1118837026, 17616},
{4294967176, 0}) = 0
14:03:46.017659 write(4, "\0", 1) = 1
14:03:46.043395 gettimeofday({1118837026, 73560},
{4294967176, 0}) = 0
14:03:46.080086 futex(0x804a880, FUTEX_WAKE, 1) = 1
14:03:46.084923 futex(0x80f7228, FUTEX_WAKE, 1) = 1
14:03:46.085127 clock_gettime(CLOCK_REALTIME, {1118837026,
85253000}) = 0
14:03:46.085655 futex(0x80f7228, FUTEX_WAKE, 1) = 0
14:03:46.085793 futex(0x804a880, FUTEX_WAKE, 1) = 0
14:03:46.085924 gettimeofday({1118837026, 85983},
{4294967176, 0}) = 0
14:03:46.086069 ioctl(5, FIONREAD, [32]) = 0
14:03:46.086240 read(5,
"\16\0\344\217\16\0\0\5\0\0>\0\0\0\0\0\0\0\0\0\17\0\0\5"...,
32) = 32
14:03:46.086551 gettimeofday({1118837026, 86643},
{4294967176, 0}) = 0
14:03:46.086903 gettimeofday({1118837026, 87058},
{4294967176, 0}) = 0
14:03:46.087353 gettimeofday({1118837026, 87514},
{4294967176, 0}) = 0
14:03:46.087658 write(5,
"\2\30\4\0\16\0\0\5\2\0\0\0\354\354\354\0", 16) = 16
14:03:46.088171 write(4, "\0", 1) = 1
14:03:46.088569 futex(0x804a880, FUTEX_WAKE, 1) = 1
14:03:46.089297 futex(0x80f7228, FUTEX_WAKE, 1) = 1
14:03:46.089604 futex(0x80f7238, FUTEX_WAIT, 10191, NULL
<unfinished ...>
Notifier thread:
14:03:46.003650 read(3, "\0", 1) = 1
14:03:46.003698 select(6, [3 5], [], [], {0, 0}) = 0 (Timeout)
14:03:46.003802 futex(0x80f7228, FUTEX_WAIT, 2, NULL) = 0
14:03:46.003931 futex(0x80f7228, FUTEX_WAKE, 1) = 0
14:03:46.003969 futex(0x80f7238, FUTEX_REQUEUE, 1,
2147483647, 0x804a880) = 0
14:03:46.004010 select(6, [3], [], [], NULL) = 1 (in [3])
14:03:46.005133 futex(0x804a880, FUTEX_WAIT, 2, NULL) = 0
14:03:46.005647 futex(0x804a880, FUTEX_WAKE, 1) = 0
14:03:46.005779 read(3, "\0", 1) = 1
14:03:46.005916 select(6, [3 5], [], [], NULL) = 1 (in [5])
14:03:46.006136 futex(0x80f7238, FUTEX_REQUEUE, 1,
2147483647, 0x804a880) = 1
14:03:46.006446 futex(0x804a880, FUTEX_WAKE, 1) = 1
14:03:46.007490 select(6, [3], [], [], NULL) = 1 (in [3])
14:03:46.008408 read(3, "\0", 1) = 1
14:03:46.009190 select(6, [3 5], [], [], NULL) = 1 (in [5])
14:03:46.009451 futex(0x80f7238, FUTEX_REQUEUE, 1,
2147483647, 0x804a880) = 1
14:03:46.009737 futex(0x804a880, FUTEX_WAKE, 1) = 1
14:03:46.010384 select(6, [3], [], [], NULL) = 1 (in [3])
14:03:46.011252 read(3, "\0", 1) = 1
14:03:46.011397 select(6, [3 5], [], [], {0, 0}) = 0 (Timeout)
14:03:46.011588 futex(0x80f7238, FUTEX_REQUEUE, 1,
2147483647, 0x804a880) = 1
14:03:46.011846 futex(0x804a880, FUTEX_WAKE, 1) = 1
14:03:46.012547 select(6, [3 5], [], [], NULL) = 1 (in [3])
14:03:46.012698 read(3, "\0", 1) = 1
14:03:46.012744 select(6, [3 5], [], [], NULL) = 1 (in [3])
14:03:46.014541 read(3, "\0", 1) = 1
14:03:46.014761 select(6, [3], [], [], NULL) = 1 (in [3])
14:03:46.016452 read(3, "\0", 1) = 1
14:03:46.016511 select(6, [3 5], [], [], {0, 0}) = 0 (Timeout)
14:03:46.016637 futex(0x80f7238, FUTEX_REQUEUE, 1,
2147483647, 0x804a880) = 1
14:03:46.016699 select(6, [3], [], [], NULL) = 1 (in [3])
14:03:46.073447 futex(0x804a880, FUTEX_WAIT, 2, NULL) = 0
14:03:46.080279 futex(0x804a880, FUTEX_WAKE, 1) = 0
14:03:46.080327 read(3, "\0", 1) = 1
14:03:46.080397 select(6, [3 5], [], [], NULL) = 1 (in [5])
14:03:46.080545 futex(0x80f7228, FUTEX_WAIT, 2, NULL) = 0
14:03:46.085179 futex(0x80f7228, FUTEX_WAKE, 1) = 0
14:03:46.085319 futex(0x80f7238, FUTEX_REQUEUE, 1,
2147483647, 0x804a880) = 0
14:03:46.085444 select(6, [3], [], [], NULL) = 1 (in [3])
14:03:46.088418 futex(0x804a880, FUTEX_WAIT, 2, NULL) = 0
14:03:46.088697 futex(0x804a880, FUTEX_WAKE, 1) = 0
14:03:46.088813 read(3, "\0", 1) = 1
14:03:46.088945 select(6, [3 5], [], [], {0, 0}) = 1 (in
[5], left {0, 0})
14:03:46.089182 futex(0x80f7228, FUTEX_WAIT, 2, NULL) = 0
14:03:46.089390 futex(0x80f7228, FUTEX_WAKE, 1) = 0
14:03:46.089435 select(6, [3], [], [], NULL <unfinished ...>
Seems to be pretty much the same as HEAD.
vasiljevic added on 2005-06-15 18:41:30: Logged In: YES user_id=95086 Try checking out the core-8-4-branch, recompile with threads/symbols and give it another try (supplying the stacktraces of all threads at the time of the hung). davidw added on 2005-06-15 16:57:42: Logged In: YES
user_id=240
Here you are:
Thread 2 (Thread 1085053872 (LWP 14698)):
#0 0x4021ea27 in select () from /lib/tls/libc.so.6
#1 0x400ed563 in NotifierThreadProc (clientData=0x0) at
tclUnixNotfy.c:993
#2 0x40120b63 in start_thread () from /lib/tls/libpthread.so.0
#3 0x4022518a in clone () from /lib/tls/libc.so.6
Thread 1 (Thread 1076371584 (LWP 14697)):
#0 0x40123295 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib/tls/libpthread.so.0
#1 0x400ec3cb in Tcl_ConditionWait (condPtr=0x804eec0,
mutexPtr=0x4010ae5c,
timePtr=0x0) at tclUnixThrd.c:769
#2 0x400ecfcf in Tcl_WaitForEvent (timePtr=0x804c1a8) at
tclUnixNotfy.c:774
#3 0x400b14ac in Tcl_DoOneEvent (flags=-3) at
../generic/tclNotify.c:930
#4 0x40af6a6b in Tk_MainLoop ()
at /home/davidw/workshop/tk/generic/tkEvent.c:2019
#5 0x400a80cd in Tcl_Main (argc=1, argv=0xbffff9b8,
appInitProc=0x8048746 <Tcl_AppInit>) at
../generic/tclMain.c:636
#6 0x0804873c in main (argc=2, argv=0xbffff9b4) at
tclAppInit.c:87
#0 0x40123295 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib/tls/libpthread.so.0
Interesting enough:
(gdb) thread 2
[Switching to thread 2 (Thread 1085053872 (LWP 14698))]#0
0x4021ea27 in select
() from /lib/tls/libc.so.6
(gdb) up
#1 0x400ed563 in NotifierThreadProc (clientData=0x0) at
tclUnixNotfy.c:993
993 if (select(numFdBits, &readableMask,
&writableMask, &exceptionalMask,
(gdb) print timePtr
$1 = (struct timeval *) 0x0
So basically, *both* the select and condition have a null
time parameter and block indefinitely.
By the way, to repeat this, I had to be a bit sneaky:
1 start the tclsh
2 attach an strace to it, which seems to slow it down, or do
whatever else seems to trigger the bug more easily than just
waiting for it to happen naturally.
3 detach the strace and attach gdb
The thing that really jumps out at me is this:
#1 0x400ec3cb in Tcl_ConditionWait (condPtr=0x804eec0,
mutexPtr=0x4010ae5c, timePtr=0x0) at tclUnixThrd.c:769
#2 0x400ecfcf in Tcl_WaitForEvent (timePtr=0x804c1a8) at
tclUnixNotfy.c:774
The timePtr goes to NULL! That's a recipe to get into the
bad condition where both threads wait forever. Why's that
happening.... Looks like this code was fiddled recently.
I'll look at it some more.
vasiljevic added on 2005-06-15 15:58:53: Logged In: YES user_id=95086 When you run into this, can you please attach the gdb to the hung-up process and provide: a. info threads b. for each thread, the stack backtrace (bt) When using the NPTL, there is only one process in the process-list not one process per-thread (as with LinuxThreads). So just fire up the gdb and say attach <pid> You'd need to compile the code with --enable-symbols to get any meaningful info out of there, of course. Thanks davidw added on 2005-06-15 05:52:47: Logged In: YES user_id=240 export LD_ASSUME_KERNEL=2.4.1 seems to make it not use the NPTL thread system, reverting to an older but apparently more stable way of doing things. davidw added on 2005-06-15 04:53:47: Logged In: YES user_id=240 Another data point: it seems that x86 Linux uses futexes, whereas PPC uses another system... davidw added on 2005-06-15 04:30:07: Logged In: YES user_id=240 My PPC machine, running Debian stable, does *not* seem to exhibit this problem, or at least not as readily as x86 systems do. That doesn't mean it doesn't exist here, but should it not be present here, perhaps it's indicative of a problem that in reality belongs to the kernel or libc... davidw added on 2005-06-15 02:54:48: Logged In: YES user_id=240 Here is some info: % info patch 8.4.9 Running on Debian stable (davidw) 51 % puts $tk_patchLevel 8.4.9 More info as needed. | |||