Fossil: Diff

Differences From Artifact [33df06781c]:

File src/lookslike.c — part of check-in [79341394e2] at 2014-04-25 08:38:56 on branch invalid-utf8 — Add a commit warning when a to-be-committed file contains invalid UTF-8 byte-sequences. See: [http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences]. This warning can be disabled by the "encoding-glob" setting. Implements determination of LOOK_INVALID flag determination when text is otherwise assumed to be UTF-8 and adds test-cases for it. (user: jan.nijtmans size: 15544)

To Artifact [16b5e3c23c]:

File src/lookslike.c — part of check-in [636da047cc] at 2014-04-25 15:03:36 on branch invalid-utf8 — Fix handling of overlong UTF-8 forms: All overlong forms except 0xC0 0x80 (\u0000) are considered invalid. Run same test-cases as on trunk, which now contains various overlong UTF-8 sequences, as proof that everything is correct. (user: jan.nijtmans size: 15647) [more...]

︙			︙
134 135 136 137 138 139 140 ~~141~~ 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 ~~162 163 164~~ ~~165~~ 166 167 168 169 170 171 172	return flags; } /* Checks for proper UTF-8. It uses the method described in: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences except for the "overlong form" which is not considered invalid here: Some languages like Java and Tcl use it. For UTF-8 characters > 7f, the variable 'c2' not necessary means the previous character. It's number of higher 1-bits indicate the number of continuation bytes that are expected to be followed. E.g. when 'c2' has a value in the range 0xc0..0xdf it means that 'c' is expected to contain the last continuation byte of a UTF-8 character. A value 0xe0..0xef means that after 'c' one more continuation byte is expected. / int invalid_utf8(const Blob pContent){ const unsigned char z = (unsigned char ) blob_buffer(pContent); unsigned int n = blob_size(pContent); unsigned char c, c2; if( n==0 ) return 0; /* Empty file -> OK / c = z; while( --n>0 ){ c2 = c; c = ++z; if( c2>=0x80 ){ ~~~~if(~~ (c2<0xC0) \|\| (c2>~~=0xF8~~) \|\| ((c&0xC0)!=0x80) ){ return 1; / Invalid UTF-8 / }~~ ~~c = (c2 >= 0xE0) ? (c2<<1) : ' ';~~ } } return c>=0x80; / Last byte must be ASCII. / } /	\| > > \| \| \| > \|	134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175	return flags; } /* Checks for proper UTF-8. It uses the method described in: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences except for the "overlong form" of \u0000 which is not considered invalid here: Some languages like Java and Tcl use it. For UTF-8 characters > 7f, the variable 'c2' not necessary means the previous character. It's number of higher 1-bits indicate the number of continuation bytes that are expected to be followed. E.g. when 'c2' has a value in the range 0xc0..0xdf it means that 'c' is expected to contain the last continuation byte of a UTF-8 character. A value 0xe0..0xef means that after 'c' one more continuation byte is expected. / int invalid_utf8(const Blob pContent){ const unsigned char z = (unsigned char ) blob_buffer(pContent); unsigned int n = blob_size(pContent); unsigned char c, c2; if( n==0 ) return 0; /* Empty file -> OK / c = z; while( --n>0 ){ c2 = c; c = ++z; if( c2>=0x80 ){ if( (c2!=0xc0) \|\| (c!=0x80) ){ if( ((c2==0xf4) && (c>=0x90)) \|\| (c2<0xc2) \|\| (c2>0xf4) \|\| ((c&0xc0)!=0x80) ){ return 1; / Invalid UTF-8 / } } c = (c2 >= 0xe0) ? (c2<<1)+1 : ' '; } } return c>=0x80; / Last byte must be ASCII. / } /
︙			︙
188 189 190 191 192 193 194 ~~195~~ 196 197 198 199 200 201 202	#define UTF16_LENGTH_MASK_SZ (LENGTH_MASK_SZ-(sizeof(WCHAR_T)-sizeof(char))) #define UTF16_LENGTH_MASK ((1<<UTF16_LENGTH_MASK_SZ)-1) /* This macro is used to swap the byte order of a UTF-16 character in the looks_like_utf16() function. / ~~#define UTF16_SWAP(ch) ((((ch) << 8) & 0xFF00) \| (((ch) >> 8) & 0xFF))~~ #define UTF16_SWAP_IF(expr,ch) ((expr) ? UTF16_SWAP((ch)) : (ch)) / This function attempts to scan each logical line within the blob to determine the type of content it appears to contain. The return value is a combination of one or more of the LOOK_XXX flags (see above):	\|	191 192 193 194 195 196 197 198 199 200 201 202 203 204 205	#define UTF16_LENGTH_MASK_SZ (LENGTH_MASK_SZ-(sizeof(WCHAR_T)-sizeof(char))) #define UTF16_LENGTH_MASK ((1<<UTF16_LENGTH_MASK_SZ)-1) /* This macro is used to swap the byte order of a UTF-16 character in the looks_like_utf16() function. / #define UTF16_SWAP(ch) ((((ch) << 8) & 0xff00) \| (((ch) >> 8) & 0xff)) #define UTF16_SWAP_IF(expr,ch) ((expr) ? UTF16_SWAP((ch)) : (ch)) / This function attempts to scan each logical line within the blob to determine the type of content it appears to contain. The return value is a combination of one or more of the LOOK_XXX flags (see above):
︙			︙
292 293 294 295 296 297 298 ~~299~~ 300 301 302 303 304 305 306	/* This function returns an array of bytes representing the byte-order-mark for UTF-8. / const unsigned char get_utf8_bom(int pnByte){ static const unsigned char bom[] = { ~~0xEF, 0xBB, 0xBF, 0x00, 0x00, 0x00~~ }; if( pnByte ) pnByte = 3; return bom; } /* ** This function returns non-zero if the blob starts with a UTF-8	\|	295 296 297 298 299 300 301 302 303 304 305 306 307 308 309	/* This function returns an array of bytes representing the byte-order-mark for UTF-8. / const unsigned char get_utf8_bom(int pnByte){ static const unsigned char bom[] = { 0xef, 0xbb, 0xbf, 0x00, 0x00, 0x00 }; if( pnByte ) pnByte = 3; return bom; } /* ** This function returns non-zero if the blob starts with a UTF-8
︙			︙