Fossil

Check-in [758e3d3188]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:If the table is encoded as start-value/size, a variable and a comparison can be saved. Should be even faster ....
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | invalid_utf8_improvements
Files: files | file ages | folders
SHA1: 758e3d318893fe5478bbcade2a5826574a07ec62
User & Date: jan.nijtmans 2016-06-18 16:50:53.508
Context
2016-06-26
17:04
Improve comments ... (Closed-Leaf check-in: 8bdd0abc7a user: jan.nijtmans tags: invalid_utf8_improvements)
2016-06-18
16:50
If the table is encoded as start-value/size, a variable and a comparison can be saved. Should be even faster .... ... (check-in: 758e3d3188 user: jan.nijtmans tags: invalid_utf8_improvements)
14:44
Juggle variables and code arround, making it as efficient and readable as possible. Also add more comments. ... (check-in: 7f067f2940 user: jan.nijtmans tags: invalid_utf8_improvements)
Changes
Unified Diff Ignore Whitespace Patch
Changes to src/lookslike.c.
145
146
147
148
149
150
151
152

153
154
155
156
157
158
159
160
161
162
163
164
165
166
** It's number of higher 1-bits indicate the number of continuation bytes
** that are expected to be followed. E.g. when 'c2' has a value in the range
** 0xc0..0xdf it means that 'c' is expected to contain the last continuation
** byte of a UTF-8 character. A value 0xe0..0xef means that after 'c' one
** more continuation byte is expected.
*/

/* definitions for various UTF-8 sequence lengths */

#define US2A  0x7F, 0x80 /* for lead byte 0xC0 */
#define US2B  0x7F, 0xBF /* for lead bytes 0xC2-0xDF */
#define US3A  0x9F, 0xBF /* for lead byte 0xE0 */
#define US3B  0x7F, 0xBF /* for lead bytes 0xE1-0xEF */
#define US4A  0x8F, 0xBF /* for lead byte 0xF0 */
#define US4B  0x7F, 0xBF /* for lead bytes 0xF1-0xF3 */
#define US4C  0x7F, 0x8F /* for lead byte 0xF4 */
#define US0A  0xFF, 0x00 /* for any other lead byte */

/* a table used for quick lookup of the definition that goes with a
 * particular lead byte */
static const unsigned char lb_tab[] = {
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,







|
>
|
|
|
|
|
|
|







145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
** It's number of higher 1-bits indicate the number of continuation bytes
** that are expected to be followed. E.g. when 'c2' has a value in the range
** 0xc0..0xdf it means that 'c' is expected to contain the last continuation
** byte of a UTF-8 character. A value 0xe0..0xef means that after 'c' one
** more continuation byte is expected.
*/

/* definitions for various UTF-8 sequence lengths, encoded as start value
 * and size of each valid range belonging to some lead byte*/
#define US2A  0x80, 0x01 /* for lead byte 0xC0 */
#define US2B  0x80, 0x40 /* for lead bytes 0xC2-0xDF */
#define US3A  0xA0, 0x20 /* for lead byte 0xE0 */
#define US3B  0x80, 0x40 /* for lead bytes 0xE1-0xEF */
#define US4A  0x90, 0x30 /* for lead byte 0xF0 */
#define US4B  0x80, 0x40 /* for lead bytes 0xF1-0xF3 */
#define US4C  0x80, 0x10 /* for lead byte 0xF4 */
#define US0A  0xFF, 0x00 /* for any other lead byte */

/* a table used for quick lookup of the definition that goes with a
 * particular lead byte */
static const unsigned char lb_tab[] = {
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
  US0A, US0A, US0A, US0A, US0A, US0A, US0A, US0A,
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
  unsigned int n = blob_size(pContent);
  unsigned char c; /* lead byte to be handled. */

  if( n==0 ) return 0;  /* Empty file -> OK */
  c = *z;
  while( --n>0 ){
    if( c>=0x80 ){
      unsigned char fb = *++z; /* follow-up byte after lead byte */
      const unsigned char *def; /* pointer to range table*/

      c <<= 1; /* multiply by 2 and get rid of highest bit */
      def = &lb_tab[c]; /* search fb's valid range in table */
      if( (fb<=def[0]) || (fb>def[1]) ){
        return LOOK_INVALID; /* Invalid UTF-8 */
      }
      c = (c>=0xC0) ? (c|3) : ' '; /* determine next lead byte */
    } else {
      c = *++z;
    }
  }







<




|







188
189
190
191
192
193
194

195
196
197
198
199
200
201
202
203
204
205
206
  unsigned int n = blob_size(pContent);
  unsigned char c; /* lead byte to be handled. */

  if( n==0 ) return 0;  /* Empty file -> OK */
  c = *z;
  while( --n>0 ){
    if( c>=0x80 ){

      const unsigned char *def; /* pointer to range table*/

      c <<= 1; /* multiply by 2 and get rid of highest bit */
      def = &lb_tab[c]; /* search fb's valid range in table */
      if( (unsigned int)(*++z-def[0])>=def[1] ){
        return LOOK_INVALID; /* Invalid UTF-8 */
      }
      c = (c>=0xC0) ? (c|3) : ' '; /* determine next lead byte */
    } else {
      c = *++z;
    }
  }