Tcl Source Code: View Ticket

Ticket UUID:	3613609
Title:	lsort -nocase does not sort non-ASCII correctly
Type:	Bug	Version:	current: 8.6.0
Submitter:	apnadkarni	Created on:	2013-05-20 17:42:15
Subsystem:	17. Commands I-L	Assigned To:	dkf
Priority:	5 Medium	Severity:
Status:	Closed	Last Modified:	2013-05-22 20:08:06
Resolution:	Fixed	Closed By:	dkf
		Closed on:	2013-05-22 13:08:06
Description:	From the chat I don't expect the following two commands to give the same answer [00:00]apn(wits) 93 % lsort [list \u101 \u100] Ā ā (wits) 94 % lsort -nocase [list \u101 \u100] Ā ā [00:01]apnWith -nocase, I would expect the second command to return ā Ā [00:02]yukonbob exits, stage left! [00:02]apnjust like [00:02]apn(wits) 96 % lsort [list a A] A a (wits) 97 % lsort -nocase [list a A] a A [00:04]dgpfile the bug The issue boils down to MergeLists using strcasecmp which does not handle non-ASCII characters.
User Comments:	dkf added on 2013-05-22 20:08:06: allow_comments - 1 All sorted. Thanks for your help. nijtmans added on 2013-05-22 20:03:02: Thanks! I couldn't have done it better! dkf added on 2013-05-22 20:01:28: Fixed in 8.5 branch; rolling forward into trunk. dkf added on 2013-05-22 19:52:17: I'm just testing the fix nijtmans added on 2013-05-22 19:42:57: ... but if you have a suggestion how to fix this, that's great too. nijtmans added on 2013-05-22 19:40:16: Good catch! I'll have a look. dkf added on 2013-05-22 19:37:22: Found a problem in TclUtfCasecmp(); I get the wrong result from this: lsort -ascii -nocase [list a\u0000a a] The problem comes when two strings are a different length but the character in the longer at the index of the length of the shorter is \u0000; we're saying the wrong string is shorter (or that the two strings are the same). (Evil edge cases 'Я' us!) dkf added on 2013-05-21 19:41:27: Changes to implementation look OK at first glance. Will need to check test suite for whether that's also got enough tests. nijtmans added on 2013-05-21 16:39:03: Slight improvement: if cs = "\xC0\x80" and ct = "\x00", loop would continue after NUL-byte, this should not happen. See: <http://core.tcl.tk/tcl/info/a765f37f78> nijtmans added on 2013-05-21 16:28:28: Proposed solution here: <http://core.tcl.tk/tcl/info/66c30c4369> @dkf, please evaluate dkf added on 2013-05-21 16:22:31: [string compare] uses Tcl_UtfNcasecmp, which does the right thing (well, in 8.6 it does and I've not checked earlier versions). apnadkarni added on 2013-05-21 09:59:01: I wrote a replacement stub along the lines of char a, b; int alen, blen, len; int comparison; a = Tcl_GetStringFromObj(oaP, &alen); b = Tcl_GetStringFromObj(obP, &blen); /* TBD - maybe check first letter before calling ? / // This is much slower than the strcmp method alen = Tcl_NumUtfChars(a, alen); / Num bytes -> num chars / blen = Tcl_NumUtfChars(b, blen); / Num bytes -> num chars / len = alen < blen ? alen : blen; / len is the shorter length */ comparison = (ignorecase ? Tcl_UtfNcasecmp : Tcl_UtfNcmp)(a, b, len); if (comparison == 0) { comparison = alen-blen; } return (comparison > 0) ? 1 : ((comparison < 0) ? -1 : 0); This seems to work correctly (for the single test case!). However, by my measurements, it is almost "3 times slower" than the current strcmp based code. I don't know if that can be fixed. And if it has not bothered anyone else for so many years, well... Failure in the lsearch case is quite likely more important (comparing paths on Windows for instance). string compare seems to do the right thing /Ashok dkf added on 2013-05-21 03:04:24: Also, [lsearch] has an equivalent problem. dkf added on 2013-05-21 03:03:30: Line 4236 of tclCmdIL.c on the trunk has strcasecmp() in use, and it seems that at least some versions of that function don't handle non-ASCII right. (What a surprise, NOT!) There's a Tcl_UtfNcasecmp() but there's no Tcl_UtfCasecmp()…