Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
| Comment: | If the table is encoded as start-value/size, a variable and a comparison can be saved. Should be even faster .... |
|---|---|
| Downloads: | Tarball | ZIP archive |
| Timelines: | family | ancestors | descendants | both | invalid_utf8_improvements |
| Files: | files | file ages | folders |
| SHA1: |
758e3d318893fe5478bbcade2a582657 |
| User & Date: | jan.nijtmans 2016-06-18 16:50:53.508 |
Context
|
2016-06-26
| ||
| 17:04 | Improve comments ... (Closed-Leaf check-in: 8bdd0abc7a user: jan.nijtmans tags: invalid_utf8_improvements) | |
|
2016-06-18
| ||
| 16:50 | If the table is encoded as start-value/size, a variable and a comparison can be saved. Should be even faster .... ... (check-in: 758e3d3188 user: jan.nijtmans tags: invalid_utf8_improvements) | |
| 14:44 | Juggle variables and code arround, making it as efficient and readable as possible. Also add more comments. ... (check-in: 7f067f2940 user: jan.nijtmans tags: invalid_utf8_improvements) | |
Changes
Changes to src/lookslike.c.
| ︙ | ︙ | |||
145 146 147 148 149 150 151 | ** It's number of higher 1-bits indicate the number of continuation bytes ** that are expected to be followed. E.g. when 'c2' has a value in the range ** 0xc0..0xdf it means that 'c' is expected to contain the last continuation ** byte of a UTF-8 character. A value 0xe0..0xef means that after 'c' one ** more continuation byte is expected. */ | | > | | | | | | | | 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
** It's number of higher 1-bits indicate the number of continuation bytes
** that are expected to be followed. E.g. when 'c2' has a value in the range
** 0xc0..0xdf it means that 'c' is expected to contain the last continuation
** byte of a UTF-8 character. A value 0xe0..0xef means that after 'c' one
** more continuation byte is expected.
*/
/* definitions for various UTF-8 sequence lengths, encoded as start value
* and size of each valid range belonging to some lead byte*/
#define US2A 0x80, 0x01 /* for lead byte 0xC0 */
#define US2B 0x80, 0x40 /* for lead bytes 0xC2-0xDF */
#define US3A 0xA0, 0x20 /* for lead byte 0xE0 */
#define US3B 0x80, 0x40 /* for lead bytes 0xE1-0xEF */
#define US4A 0x90, 0x30 /* for lead byte 0xF0 */
#define US4B 0x80, 0x40 /* for lead bytes 0xF1-0xF3 */
#define US4C 0x80, 0x10 /* for lead byte 0xF4 */
#define US0A 0xFF, 0x00 /* for any other lead byte */
/* a table used for quick lookup of the definition that goes with a
* particular lead byte */
static const unsigned char lb_tab[] = {
US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
|
| ︙ | ︙ | |||
187 188 189 190 191 192 193 |
unsigned int n = blob_size(pContent);
unsigned char c; /* lead byte to be handled. */
if( n==0 ) return 0; /* Empty file -> OK */
c = *z;
while( --n>0 ){
if( c>=0x80 ){
| < | | 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
unsigned int n = blob_size(pContent);
unsigned char c; /* lead byte to be handled. */
if( n==0 ) return 0; /* Empty file -> OK */
c = *z;
while( --n>0 ){
if( c>=0x80 ){
const unsigned char *def; /* pointer to range table*/
c <<= 1; /* multiply by 2 and get rid of highest bit */
def = &lb_tab[c]; /* search fb's valid range in table */
if( (unsigned int)(*++z-def[0])>=def[1] ){
return LOOK_INVALID; /* Invalid UTF-8 */
}
c = (c>=0xC0) ? (c|3) : ' '; /* determine next lead byte */
} else {
c = *++z;
}
}
|
| ︙ | ︙ |