Tk Source Code: View Ticket

2005-06-21
20:02		• Closed ticket [1220692fff]: tk locks up plus 6 other changes artifact: 0c714c90da user: davidw
2005-06-16
05:24		• Ticket [1220692fff]: 4 changes artifact: 9dfc2acf77 user: dkf
05:06		• Ticket [1220692fff]: 4 changes artifact: 9dca40a99b user: davidw
00:19		• Ticket [1220692fff]: 4 changes artifact: f8c86e3691 user: vasiljevic
00:09		• Ticket [1220692fff]: 4 changes artifact: b88361bb45 user: davidw
2005-06-15
22:19		• Ticket [1220692fff]: 4 changes artifact: d9aa558d8c user: vasiljevic
21:46		• Ticket [1220692fff]: 4 changes artifact: 42e0cacc05 user: davidw
21:28		• Ticket [1220692fff]: 4 changes artifact: 8869891b08 user: vasiljevic
21:20		• Ticket [1220692fff]: 4 changes artifact: 58a1b574ff user: davidw
19:51		• Ticket [1220692fff]: 4 changes artifact: bb58b51a81 user: dkf
19:51		• Ticket [1220692fff]: 4 changes artifact: 4410bed9dc user: vasiljevic
19:46		• Ticket [1220692fff]: 4 changes artifact: 28dc65b01c user: dkf
19:27		• Ticket [1220692fff]: 4 changes artifact: 6017323e36 user: vasiljevic
19:05		• Ticket [1220692fff]: 4 changes artifact: ff88560801 user: davidw
18:41		• Ticket [1220692fff]: 4 changes artifact: a6c3cd1cb2 user: vasiljevic
16:57		• Ticket [1220692fff]: 4 changes artifact: e2f8596f72 user: davidw
15:58		• Ticket [1220692fff]: 4 changes artifact: 91ae732aea user: vasiljevic
05:52		• Ticket [1220692fff]: 4 changes artifact: 5cfc278b77 user: davidw
04:53		• Ticket [1220692fff]: 4 changes artifact: 8b0c9215db user: davidw
04:30		• Ticket [1220692fff]: 4 changes artifact: b7eaf968fb user: davidw
02:54		• Ticket [1220692fff]: 4 changes artifact: ebd6901829 user: davidw
2005-06-14
19:49		• New ticket [1220692fff]. artifact: 3bba6c1dc2 user: davidw

Ticket UUID:	1220692
Title:	tk locks up
Type:	Bug	Version:	None
Submitter:	davidw	Created on:	2005-06-14 19:49:29
Subsystem:	70. Event Loop	Assigned To:	hobbs
Priority:	5 Medium	Severity:
Status:	Closed	Last Modified:	2005-06-21 20:02:25
Resolution:	None	Closed By:	davidw
		Closed on:	2005-06-21 13:02:25
Description:	This script, when run under strace, locks up Tk on Linux: # loop.tcl # Run like so: # strace tclsh loop.tcl package require Tk proc screenupdate {num} { incr num .foo configure -text $num after 10 [list screenupdate $num] } button .foo pack .foo screenupdate 0 ----- Or better, it seems to reproduce the bug I see in a script that locks up even without strace. The following script will lock it up without strace. Ok - so it's kind of evil, but still, it seems to be tickling a race condition or something. Note that it's not just slow - it's locked, it's not doing anything. # loop.tcl # Run like so: # strace tclsh loop.tcl package require Tk proc afterburner {} { after 10 afterburner } proc screenupdate {num} { incr num .foo configure -text $num after 10 [list screenupdate $num] } button .foo pack .foo screenupdate 0 afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner afterburner -------------
User Comments:	davidw added on 2005-06-21 20:02:25: Logged In: YES user_id=240 I'm going to go ahead and close thus. I'm convinced the problem lies with Debian's libc. If I discover that to not be the case, we can always reopen this bug. dkf added on 2005-06-16 05:24:11: Logged In: YES user_id=79902 Reviewing the issues you've found, it does indeed look like glibc 2.3.2 is subtly bust in the threading department. This might explain some bizarre hangs of production Java apps on some of our systems. :-( Some of the comments in that bugzilla report didn't exactly raise the spirits either, but then I've long suspected that glibc was not maintained to the quality levels I'm accustomed to with Tcl... davidw added on 2005-06-16 05:06:54: Logged In: YES user_id=240 Let's compare libc versions. Debian sarge has: Version: 2.3.2.ds1-22 The last entry in /usr/share/doc/libc6/changelog.gz is dated: 2003-09-19 Ulrich Drepper <drepper@redhat.com> * elf/Makefile ($(objpfx)librtld.os): Create link map also for final build to ease quality control. So to me it looks like more and more things point to libc6. I've started a thread on the Debian list - once we have confirmation there, we can close this bug. http://lists.debian.org/debian-glibc/2005/06/msg00134.html vasiljevic added on 2005-06-16 00:19:53: Logged In: YES user_id=95086 Hm.... This is the test system (SuSE Linux on a Dell 2.4GHz): zoran@zvlinux:~/sf/core-8-4-branch/unix> uname -a Linux zvlinux 2.6.4-52-default #1 Wed Apr 7 02:08:30 UTC 2004 i686 i686 i386 GNU/Linux zoran@zvlinux:~/sf/core-8-4-branch/unix> strace -o dummy ./tclsh % array get tcl_platform osVersion 2.6.4-52-default byteOrder littleEndian threaded 1 machine i686 platform unix os Linux user zoran wordSize 4 % set tcl_patchLevel 8.4.10 This is the test program which exercises timer events, file events and inter-thread communication events (using Tcl thread package and thread::broadcast command) package req Thread for {set i 0} {$i < 16} {incr i} { thread::create { set tack 0 set fl [open /dev/zero] proc afterburner {} { if {($::tack % 1000) == 0} { puts AB:tack([thread::id])-$::tack } after 1 afterburner } proc screenupdate num { if {[incr num] % 1000 == 0} { puts SU:tick([thread::id])-$num } thread::broadcast "set ::tack $num" after 1 screenupdate $num } proc readzero {chan} { if {($::tack % 1000) == 0} { puts RZ:tack([thread::id])-$::tack } after 1 } fconfigure $fl -blocking 0 fileevent $fl readable [list readzero $fl] afterburner screenupdate 1 thread::wait } } If I start this, it will create 16 threads which will chew-up lots of CPU and exercise the notifier a lot. This runs happily on my system: % thread::names tid0x424b1bb0 tid0x422b0bb0 tid0x420afbb0 tid0x41eaebb0 tid0x41cadbb0 tid0x41aacbb0 tid0x418abbb0 tid0x416aabb0 tid0x414a9bb0 tid0x412a8bb0 tid0x410a7bb0 tid0x40ea6bb0 tid0x40ca5bb0 tid0x40aa4bb0 tid0x408a3bb0 tid0x406a2bb0 tid0x402392a0 SU:tick(tid0x424b1bb0)-124000 AB:tack(tid0x418abbb0)-124000 SU:tick(tid0x408a3bb0)-124000 RZ:tack(tid0x424b1bb0)-124000 RZ:tack(tid0x418abbb0)-124000 RZ:tack(tid0x40ca5bb0)-124000 RZ:tack(tid0x412a8bb0)-124000 RZ:tack(tid0x40ea6bb0)-124000 AB:tack(tid0x412a8bb0)-124000 AB:tack(tid0x40ea6bb0)-124000 AB:tack(tid0x424b1bb0)-124000 SU:tick(tid0x40aa4bb0)-124000 SU:tick(tid0x412a8bb0)-124000 AB:tack(tid0x40ca5bb0)-124000 SU:tick(tid0x416aabb0)-124000 SU:tick(tid0x40ea6bb0)-124000 SU:tick(tid0x41aacbb0)-124000 SU:tick(tid0x422b0bb0)-124000 SU:tick(tid0x41eaebb0)-124000 AB:tack(tid0x416aabb0)-124000 SU:tick(tid0x414a9bb0)-125000 RZ:tack(tid0x408a3bb0)-125000 RZ:tack(tid0x420afbb0)-125000 SU:tick(tid0x418abbb0)-124000 SU:tick(tid0x40ca5bb0)-125000 RZ:tack(tid0x41cadbb0)-125000 RZ:tack(tid0x414a9bb0)-125000 RZ:tack(tid0x422b0bb0)-125000 RZ:tack(tid0x406a2bb0)-125000 RZ:tack(tid0x41eaebb0)-125000 RZ:tack(tid0x40aa4bb0)-125000 (and so on and so on...) You can observe the alternating thread-ID's and counters. NOTE: it runs under "strace", swamping my disk and chewing my CPU: zoran@zvlinux:/usr/local> ps -efl \| grep tclsh 0 S zoran 5979 4125 0 76 0 - 394 wait4 19:04 pts/40 00:00:00 strace -o dummy ./tclsh 0 S zoran 5980 5979 0 75 0 - 48855 schedu 19:04 pts/40 00:00:00 ./tclsh top - 19:11:03 up 42 min, 2 users, load average: 11.08, 7.83, 4.16 I cannot conclude anything but that this might be some weird Linux issue on your config. Can you verify this on some other Linux distribution or perhaps other processor? I gave it a try under Solaris/Sparc (2.8) and Darwin/PPC (10.4) and it also works as expected. (Confused) Zoran davidw added on 2005-06-16 00:09:57: Logged In: YES user_id=240 This bug looks an awful lot like this: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=108631 although their provided code doesn't hang... I've got an email into the debian libc maintainers to see if they have anything to say. vasiljevic added on 2005-06-15 22:19:04: Logged In: YES user_id=95086 Most interesting.... I do not believe that this might have to do anything with this problem, but the code arround line 755 in unix/tclUnixNotfy.c should be better looking like this: while (!tsdPtr->eventReady) { Tcl_ConditionWait(&tsdPtr->waitCV, &notifierMutex, timePtr); } instead of if (!tsdPtr->eventReady) { Tcl_ConditionWait(&tsdPtr->waitCV, &notifierMutex, timePtr); } This is more defensive and could save us some headaches down the road. The fprintf debugging shows clearly that the notify event actually never reached the waiter and I could not imagine why, unless there is really a quirk in the libc. Let me see If I can reproduce this on my machine. davidw added on 2005-06-15 21:46:42: Logged In: YES user_id=240 The reason why I think it's NPTL is that I can't reproduce the crash on non-NPTL systems, for instance, my PPC Linux box, or the exact same code when you set the environmental variable: export LD_ASSUME_KERNEL=2.4.1 which "turns off" NPTL. I wouldn't completely rule out a bug in the kernel or libc6, but before we go that route, let's see if we can isolate it further so they don't blame it on us. Here's another data point. I added some debugging fprintf's around the pthread_cond_wait and broadcast calls: 68 is the notifier thread, 80 is the main thread pthread_cond_wait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 cond_wait returned pthread_cond_wait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 cond_wait returned pthread_cond_timedwait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 timedwait returned pthread_cond_wait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 cond_wait returned pthread_cond_wait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 cond_wait returned pthread_cond_timedwait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 timedwait returned pthread_cond_timedwait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 timedwait returned pthread_cond_wait: thread 1076375680 mutex 804c7d0 pthread_cond_broadcast: thread 1085057968 condptr 806bb00 HANGS So it never wakes up from its cond_wait, even though the broadcast happened, supposedly... vasiljevic added on 2005-06-15 21:28:56: Logged In: YES user_id=95086 Good (or bad, depends on the perspective) :-) Thanks to you, we have at least isolated the Tk out. This makes things simpler (less of complex code to dig...) So, as you say, this is a NPTL thingy? I do have a RH (Enterprise) Dell box here so I assume I could give it a test-try and observe (hopefully) the same behaviour... davidw added on 2005-06-15 21:20:43: Logged In: YES user_id=240 The suggested Tcl only scripts do not seem to cause a hang, but of course, with something that in other situations has lasted hours with no hangs, we can only prove that it does hang. This Tcl only script does cause a hang rather quickly (with strace): set fl [open /dev/zero] proc afterburner {} {after 10 afterburner} proc screenupdate num { if {[incr num] % 100 == 0} {puts tick-$num} after 10 screenupdate $num } proc readzero {chan } { } fconfigure $fl -blocking 0 fileevent $fl readable [list readzero $fl] afterburner screenupdate 1; vwait forever Stack traces are the same. Local variables: (gdb) up #2 0x400ecfcf in Tcl_WaitForEvent (timePtr=0x804c1a8) at tclUnixNotfy.c:774 774 Tcl_ConditionWait(&tsdPtr->waitCV, &notifierMutex, myTimePtr); (gdb) info locals filePtr = (FileHandler ) 0x0 fileEvPtr = (FileHandlerEvent ) 0x0 mask = 134541992 myTime = {sec = 0, usec = 0} waitForFiles = 1 myTimePtr = (Tcl_Time ) 0x0 tsdPtr = (ThreadSpecificData ) 0x804eba8 (gdb) thread 2 [Switching to thread 2 (Thread 1085053872 (LWP 19025))]#0 0x4021ea27 in select () from /lib/tls/libc.so.6 (gdb) up #1 0x400ed563 in NotifierThreadProc (clientData=0x0) at tclUnixNotfy.c:993 993 if (select(numFdBits, &readableMask, &writableMask, &exceptionalMask, (gdb) info locals tsdPtr = (ThreadSpecificData ) 0x0 readableMask = {__fds_bits = {8, 0 <repeats 31 times>}} writableMask = {__fds_bits = {0 <repeats 32 times>}} exceptionalMask = {__fds_bits = {0 <repeats 32 times>}} fds = {3, 4} i = -1 status = 2049 numFdBits = 6 receivePipe = 3 found = 1 poll = {tv_sec = 0, tv_usec = 0} timePtr = (struct timeval ) 0x0 buf = "\000" Hrm... dkf added on 2005-06-15 19:51:58: Logged In: YES user_id=79902 What are all the contents of the local variables (at the Tcl-library level, of course) in both threads when the threads deadlock? vasiljevic added on 2005-06-15 19:51:15: Logged In: YES user_id=95086 Yes. I assumed David would understand the intention and I just sketched the think briefly... Main intention is to filter the noise Tk might bring into. dkf added on 2005-06-15 19:46:31: Logged In: YES user_id=79902 I think Zoran means the following script... ;-) proc afterburner {} {after 10 afterburner} proc screenupdate num { if {[incr num] % 100 == 0} {puts tick-$num} after 10 screenupdate $num } afterburner;screenupdate 1; vwait forever vasiljevic added on 2005-06-15 19:27:52: Logged In: YES user_id=95086 OK. If I understand this well, you could/should have been able to reproduce such behaviour even without Tk, by just using the plain Tcl shell: proc afterburner {} { after 10 afterburner } proc screenupdate {num} { incr num if {$num % 100} { puts"tick-$num" } after 10 [list screenupdate $num] } The above should emit "tick-1", "tick-2" and so on... about each second. Can you try this on your machine (maybe employing the strace trick) and see if this locks-up as well? davidw added on 2005-06-15 19:05:28: Logged In: YES user_id=240 It's the same thing: Thread 2 (Thread 1084599216 (LWP 19682)): #0 0x401f6a27 in select () from /lib/tls/libc.so.6 #1 0x400c7f9b in NotifierThreadProc (clientData=0x0) at tclUnixNotfy.c:976 #2 0x400f8b63 in start_thread () from /lib/tls/libpthread.so.0 #3 0x401fd18a in clone () from /lib/tls/libc.so.6 Thread 1 (Thread 1076207744 (LWP 19681)): #0 0x400fb295 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #1 0x400c6e15 in Tcl_ConditionWait (condPtr=0x8052558, mutexPtr=0x400e25c0, timePtr=0x0) at tclUnixThrd.c:725 #2 0x400c7a07 in Tcl_WaitForEvent (timePtr=0x0) at tclUnixNotfy.c:755 #3 0x4009df35 in Tcl_DoOneEvent (flags=-3) at ../generic/tclNotify.c:924 #4 0x40accef4 in Tk_MainLoop () at /home/davidw/workshop/stabletcl/tk/generic/tkEvent.c:1501 #5 0x40098e5d in Tcl_Main (argc=1, argv=0xbffff9b8, appInitProc=0x8048746 <Tcl_AppInit>) at ../generic/tclMain.c:485 #6 0x0804873c in main (argc=2, argv=0xbffff9b4) at tclAppInit.c:90 #0 0x400fb295 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 Here are the final syscalls of both threads too, just for fun: Main thread: 14:03:46.010856 write(4, "\0", 1) = 1 14:03:46.011039 futex(0x80f7238, FUTEX_WAIT, 10186, NULL) = 0 14:03:46.011719 futex(0x804a880, FUTEX_WAIT, 2, NULL) = 0 14:03:46.011967 futex(0x804a880, FUTEX_WAKE, 1) = 0 14:03:46.012093 gettimeofday({1118837026, 12164}, {4294967176, 0}) = 0 14:03:46.012262 gettimeofday({1118837026, 12292}, {4294967176, 0}) = 0 14:03:46.012326 write(4, "\0", 1) = 1 14:03:46.012384 gettimeofday({1118837026, 12402}, {4294967176, 0}) = 0 14:03:46.012431 clock_gettime(CLOCK_REALTIME, {1118837026, 12452000}) = 0 14:03:46.012477 futex(0x80f7238, FUTEX_WAIT, 10187, {0, 607000}) = -1 ETIMEDOUT (Connection timed out) 14:03:46.014276 write(4, "\0", 1) = 1 14:03:46.014503 futex(0x804a880, FUTEX_WAKE, 1) = 0 14:03:46.014721 gettimeofday({1118837026, 14897}, {4294967176, 0}) = 0 14:03:46.014989 gettimeofday({1118837026, 15053}, {4294967176, 0}) = 0 14:03:46.015325 gettimeofday({1118837026, 15497}, {4294967176, 0}) = 0 14:03:46.015636 gettimeofday({1118837026, 15721}, {4294967176, 0}) = 0 14:03:46.015817 write(5, "\2\30\4\0\16\0\0\5\2\0\0\0\354\354\354\0", 16) = 16 14:03:46.016247 write(4, "\0", 1) = 1 14:03:46.016423 futex(0x80f7238, FUTEX_WAIT, 10188, NULL) = 0 14:03:46.016786 futex(0x804a880, FUTEX_WAKE, 1) = 0 14:03:46.016845 gettimeofday({1118837026, 16864}, {4294967176, 0}) = 0 14:03:46.016959 write(5, "5\30\4\0\17\0\0\5\16\0\0\0058\0\34\0F\216\5\0\17\0\0\5"..., 236) = 236 14:03:46.017577 gettimeofday({1118837026, 17616}, {4294967176, 0}) = 0 14:03:46.017659 write(4, "\0", 1) = 1 14:03:46.043395 gettimeofday({1118837026, 73560}, {4294967176, 0}) = 0 14:03:46.080086 futex(0x804a880, FUTEX_WAKE, 1) = 1 14:03:46.084923 futex(0x80f7228, FUTEX_WAKE, 1) = 1 14:03:46.085127 clock_gettime(CLOCK_REALTIME, {1118837026, 85253000}) = 0 14:03:46.085655 futex(0x80f7228, FUTEX_WAKE, 1) = 0 14:03:46.085793 futex(0x804a880, FUTEX_WAKE, 1) = 0 14:03:46.085924 gettimeofday({1118837026, 85983}, {4294967176, 0}) = 0 14:03:46.086069 ioctl(5, FIONREAD, [32]) = 0 14:03:46.086240 read(5, "\16\0\344\217\16\0\0\5\0\0>\0\0\0\0\0\0\0\0\0\17\0\0\5"..., 32) = 32 14:03:46.086551 gettimeofday({1118837026, 86643}, {4294967176, 0}) = 0 14:03:46.086903 gettimeofday({1118837026, 87058}, {4294967176, 0}) = 0 14:03:46.087353 gettimeofday({1118837026, 87514}, {4294967176, 0}) = 0 14:03:46.087658 write(5, "\2\30\4\0\16\0\0\5\2\0\0\0\354\354\354\0", 16) = 16 14:03:46.088171 write(4, "\0", 1) = 1 14:03:46.088569 futex(0x804a880, FUTEX_WAKE, 1) = 1 14:03:46.089297 futex(0x80f7228, FUTEX_WAKE, 1) = 1 14:03:46.089604 futex(0x80f7238, FUTEX_WAIT, 10191, NULL <unfinished ...> Notifier thread: 14:03:46.003650 read(3, "\0", 1) = 1 14:03:46.003698 select(6, [3 5], [], [], {0, 0}) = 0 (Timeout) 14:03:46.003802 futex(0x80f7228, FUTEX_WAIT, 2, NULL) = 0 14:03:46.003931 futex(0x80f7228, FUTEX_WAKE, 1) = 0 14:03:46.003969 futex(0x80f7238, FUTEX_REQUEUE, 1, 2147483647, 0x804a880) = 0 14:03:46.004010 select(6, [3], [], [], NULL) = 1 (in [3]) 14:03:46.005133 futex(0x804a880, FUTEX_WAIT, 2, NULL) = 0 14:03:46.005647 futex(0x804a880, FUTEX_WAKE, 1) = 0 14:03:46.005779 read(3, "\0", 1) = 1 14:03:46.005916 select(6, [3 5], [], [], NULL) = 1 (in [5]) 14:03:46.006136 futex(0x80f7238, FUTEX_REQUEUE, 1, 2147483647, 0x804a880) = 1 14:03:46.006446 futex(0x804a880, FUTEX_WAKE, 1) = 1 14:03:46.007490 select(6, [3], [], [], NULL) = 1 (in [3]) 14:03:46.008408 read(3, "\0", 1) = 1 14:03:46.009190 select(6, [3 5], [], [], NULL) = 1 (in [5]) 14:03:46.009451 futex(0x80f7238, FUTEX_REQUEUE, 1, 2147483647, 0x804a880) = 1 14:03:46.009737 futex(0x804a880, FUTEX_WAKE, 1) = 1 14:03:46.010384 select(6, [3], [], [], NULL) = 1 (in [3]) 14:03:46.011252 read(3, "\0", 1) = 1 14:03:46.011397 select(6, [3 5], [], [], {0, 0}) = 0 (Timeout) 14:03:46.011588 futex(0x80f7238, FUTEX_REQUEUE, 1, 2147483647, 0x804a880) = 1 14:03:46.011846 futex(0x804a880, FUTEX_WAKE, 1) = 1 14:03:46.012547 select(6, [3 5], [], [], NULL) = 1 (in [3]) 14:03:46.012698 read(3, "\0", 1) = 1 14:03:46.012744 select(6, [3 5], [], [], NULL) = 1 (in [3]) 14:03:46.014541 read(3, "\0", 1) = 1 14:03:46.014761 select(6, [3], [], [], NULL) = 1 (in [3]) 14:03:46.016452 read(3, "\0", 1) = 1 14:03:46.016511 select(6, [3 5], [], [], {0, 0}) = 0 (Timeout) 14:03:46.016637 futex(0x80f7238, FUTEX_REQUEUE, 1, 2147483647, 0x804a880) = 1 14:03:46.016699 select(6, [3], [], [], NULL) = 1 (in [3]) 14:03:46.073447 futex(0x804a880, FUTEX_WAIT, 2, NULL) = 0 14:03:46.080279 futex(0x804a880, FUTEX_WAKE, 1) = 0 14:03:46.080327 read(3, "\0", 1) = 1 14:03:46.080397 select(6, [3 5], [], [], NULL) = 1 (in [5]) 14:03:46.080545 futex(0x80f7228, FUTEX_WAIT, 2, NULL) = 0 14:03:46.085179 futex(0x80f7228, FUTEX_WAKE, 1) = 0 14:03:46.085319 futex(0x80f7238, FUTEX_REQUEUE, 1, 2147483647, 0x804a880) = 0 14:03:46.085444 select(6, [3], [], [], NULL) = 1 (in [3]) 14:03:46.088418 futex(0x804a880, FUTEX_WAIT, 2, NULL) = 0 14:03:46.088697 futex(0x804a880, FUTEX_WAKE, 1) = 0 14:03:46.088813 read(3, "\0", 1) = 1 14:03:46.088945 select(6, [3 5], [], [], {0, 0}) = 1 (in [5], left {0, 0}) 14:03:46.089182 futex(0x80f7228, FUTEX_WAIT, 2, NULL) = 0 14:03:46.089390 futex(0x80f7228, FUTEX_WAKE, 1) = 0 14:03:46.089435 select(6, [3], [], [], NULL <unfinished ...> Seems to be pretty much the same as HEAD. vasiljevic added on 2005-06-15 18:41:30: Logged In: YES user_id=95086 Try checking out the core-8-4-branch, recompile with threads/symbols and give it another try (supplying the stacktraces of all threads at the time of the hung). davidw added on 2005-06-15 16:57:42: Logged In: YES user_id=240 Here you are: Thread 2 (Thread 1085053872 (LWP 14698)): #0 0x4021ea27 in select () from /lib/tls/libc.so.6 #1 0x400ed563 in NotifierThreadProc (clientData=0x0) at tclUnixNotfy.c:993 #2 0x40120b63 in start_thread () from /lib/tls/libpthread.so.0 #3 0x4022518a in clone () from /lib/tls/libc.so.6 Thread 1 (Thread 1076371584 (LWP 14697)): #0 0x40123295 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #1 0x400ec3cb in Tcl_ConditionWait (condPtr=0x804eec0, mutexPtr=0x4010ae5c, timePtr=0x0) at tclUnixThrd.c:769 #2 0x400ecfcf in Tcl_WaitForEvent (timePtr=0x804c1a8) at tclUnixNotfy.c:774 #3 0x400b14ac in Tcl_DoOneEvent (flags=-3) at ../generic/tclNotify.c:930 #4 0x40af6a6b in Tk_MainLoop () at /home/davidw/workshop/tk/generic/tkEvent.c:2019 #5 0x400a80cd in Tcl_Main (argc=1, argv=0xbffff9b8, appInitProc=0x8048746 <Tcl_AppInit>) at ../generic/tclMain.c:636 #6 0x0804873c in main (argc=2, argv=0xbffff9b4) at tclAppInit.c:87 #0 0x40123295 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 Interesting enough: (gdb) thread 2 [Switching to thread 2 (Thread 1085053872 (LWP 14698))]#0 0x4021ea27 in select () from /lib/tls/libc.so.6 (gdb) up #1 0x400ed563 in NotifierThreadProc (clientData=0x0) at tclUnixNotfy.c:993 993 if (select(numFdBits, &readableMask, &writableMask, &exceptionalMask, (gdb) print timePtr $1 = (struct timeval ) 0x0 So basically, both* the select and condition have a null time parameter and block indefinitely. By the way, to repeat this, I had to be a bit sneaky: 1 start the tclsh 2 attach an strace to it, which seems to slow it down, or do whatever else seems to trigger the bug more easily than just waiting for it to happen naturally. 3 detach the strace and attach gdb The thing that really jumps out at me is this: #1 0x400ec3cb in Tcl_ConditionWait (condPtr=0x804eec0, mutexPtr=0x4010ae5c, timePtr=0x0) at tclUnixThrd.c:769 #2 0x400ecfcf in Tcl_WaitForEvent (timePtr=0x804c1a8) at tclUnixNotfy.c:774 The timePtr goes to NULL! That's a recipe to get into the bad condition where both threads wait forever. Why's that happening.... Looks like this code was fiddled recently. I'll look at it some more. vasiljevic added on 2005-06-15 15:58:53: Logged In: YES user_id=95086 When you run into this, can you please attach the gdb to the hung-up process and provide: a. info threads b. for each thread, the stack backtrace (bt) When using the NPTL, there is only one process in the process-list not one process per-thread (as with LinuxThreads). So just fire up the gdb and say attach <pid> You'd need to compile the code with --enable-symbols to get any meaningful info out of there, of course. Thanks davidw added on 2005-06-15 05:52:47: Logged In: YES user_id=240 export LD_ASSUME_KERNEL=2.4.1 seems to make it not use the NPTL thread system, reverting to an older but apparently more stable way of doing things. davidw added on 2005-06-15 04:53:47: Logged In: YES user_id=240 Another data point: it seems that x86 Linux uses futexes, whereas PPC uses another system... davidw added on 2005-06-15 04:30:07: Logged In: YES user_id=240 My PPC machine, running Debian stable, does not seem to exhibit this problem, or at least not as readily as x86 systems do. That doesn't mean it doesn't exist here, but should it not be present here, perhaps it's indicative of a problem that in reality belongs to the kernel or libc... davidw added on 2005-06-15 02:54:48: Logged In: YES user_id=240 Here is some info: % info patch 8.4.9 Running on Debian stable (davidw) 51 % puts $tk_patchLevel 8.4.9 More info as needed.