Fossil

Check-in [fdeebddea7]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Faster resolution of common symbolic tags such as "trunk".
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256: fdeebddea7abf6addf1640da01804f230e9a10c2b081b0aafdcac1642d0412ba
User & Date: drh 2021-01-08 15:16:00.894
An analysis of the SQL problem that this check-in attempts to resolve.

Consider the following schema in which there are a large number objects and each object can have zero or more tags. A single tag might apply to any number of objects. Most tags apply to only a few objects, but a few tags might apply to many objects. One or two tags might apply to 95% of the objects.

The "objects" here are really EVENT table entries and "tags" are the combination of the TAG and TAGXREF tables. Let's assume that there are 1,000,000 objects and 50,000 distinct tagNames. Most objects have exactly one tag. 5% of the objects have two or more tags. Most tagNames are assigned to less than 20 objects, but a few tags are assigned to many objects, and one tag in particular is assigned to about 95% of the objects.

CREATE TABLE object(
  objectId INTEGER PRIMARY KEY,
  value REAL
);
CREATE INDEX object_value ON object(value);
CREATE TABLE tag(
  tagName TEXT,
  objectId INTEGER REFERENCES object ON DELETE CASCADE,
  PRIMARY KEY(tagName, objectId)
);

Given a tagName $TAG, we want to find the object with that tag that has the largest value. The code was computing that objectId as follows:

SELECT objectId
  FROM object JOIN tag USING(objectId)
 WHERE tagName=$TAG
 ORDER BY value DESC
 LIMIT 1;

The previous query works well when the number of objects associated with $TAG are small (at least compared to the total number of objects). But what if 95% of the objects have $TAG? In that case, the query above is inefficient since it needs to search most of the object table to find the object with the largest value. The following query works by scaning objects in decreasing value order looking for the first one that has $TAG:

SELECT objectId
  FROM object
 WHERE EXISTS(SELECT 1 FROM tag
               WHERE tag.objectId=object.objectId
                 AND tag.tagName=$TAG)
 ORDER BY value DESC
 LIMIT 1;

The problem is, we don't know in advance if $TAG is abundant or rare. This check-in attempts to resolve the problem by using the following grotesque query:

SELECT objectId, value
  FROM (SELECT objectId, value FROM object ORDER BY value DESC LIMIT 30) AS ox
 WHERE EXISTS(SELECT 1 FROM tag
               WHERE tag.objectId=ox.objectId
                 AND tag.tagName=$TAG)
UNION ALL
SELECT objectId, value FROM (
  SELECT objectId, value
    FROM object JOIN tag USING(objectId)
   WHERE tagName=$TAG
   ORDER BY object.value DESC
   LIMIT 1
)
LIMIT 1;

The previous query works as follows: The first term of the UNION ALL searches for an object with $TAG amoung the 30 objects with the largest values. If there are no matches, then it falls down into the second term of the UNION ALL which searches all objects with $TAG for the one with the largest value. The query works because SQLite evaluates UNION ALL compounds sequentially. (This is not guaranteed by the SQLite documentation, but it has always been so. Perhaps SQLite should make it a hard guarantee moving forward?)

The grotesque query is still inefficient if, for example, all one million objects have $TAG except for the 50 largest. We could denormalize the schema, making a copy of the object.value into the tag table, and keeping the values in sync using triggers. like this:

CREATE TABLE object(
  objectId INTEGER PRIMARY KEY,
  value REAL
);
CREATE INDEX object_value ON object(value);
CREATE TABLE tag(
  tagName TEXT,
  objectId INTEGER REFERENCES object ON DELETE CASCASE,
  objectValue REAL, -- Copy of object.value
  PRIMARY KEY(tagName, objectId)
);
CREATE INDEX tag_value ON tag(objectValue);
CREATE TRIGGER AFTER UPDATE ON object(value) BEGIN
  UPDATE tag SET objectValue=new.value WHERE objectId=new.objectId;
END;
CREATE TRIGGER AFTER INSERT ON tag BEGIN
  UPDATE tag
     SET objectValue=(SELECT value FROM object WHERE objectId=new.objectId)
   WHERE tagName=new.tagName and objectId=new.objectId;
END;
CREATE TRIGGER AFTER UPDATE ON tag BEGIN
  UPDATE tag
     SET objectValue=(SELECT value FROM object WHERE objectId=new.objectId)
   WHERE tagName=new.tagName and objectId=new.objectId;
END;

In that case, the following query is always efficient:

SELECT objectId
  FROM tag
 WHERE tagName=$TAG
 ORDER BY objectValue DESC
 LIMIT 1;

The downside is that there are now three triggers and an extra index in the schema and the schema is denormalized.

Context
2021-01-08
15:25
In the makefiles, put sqlite3.o early in the dependency list so that on a multithreaded make, it starts earliest. This makes the multithreaded makes finish sooner. check-in: 8ca760ce97 user: drh tags: trunk
15:16
Faster resolution of common symbolic tags such as "trunk". check-in: fdeebddea7 user: drh tags: trunk
14:23
Improved wording on the change log. check-in: a28778e1d8 user: drh tags: trunk
Changes
Unified Diff Ignore Whitespace Patch
Changes to src/name.c.
179
180
181
182
183
184
185











































186
187
188
189
190
191
192
  if( eType==2 && ans>0 ){
    zBr = branch_of_rid(ans);
    ans = compute_youngest_ancestor_in_branch(rid, zBr);
    fossil_free(zBr);
  }
  return ans;
}












































/*
** Convert a symbolic name into a RID.  Acceptable forms:
**
**   *  artifact hash (optionally enclosed in [...])
**   *  4-character or larger prefix of a artifact
**   *  Symbolic Name







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
  if( eType==2 && ans>0 ){
    zBr = branch_of_rid(ans);
    ans = compute_youngest_ancestor_in_branch(rid, zBr);
    fossil_free(zBr);
  }
  return ans;
}

/*
** Find the RID of the most recent object with symbolic tag zTag
** and having a type that matches zType.
**
** Return 0 if there are no matches.
**
** This is a tricky query to do efficiently.  
** If the tag is very common (ex: "trunk") then
** we want to use the query identified below as Q1 - which searching
** the most recent EVENT table entries for the most recent with the tag.
** But if the tag is relatively scarce (anything other than "trunk", basically)
** then we want to do the indexed search show below as Q2.
*/
static int most_recent_event_with_tag(const char *zTag, const char *zType){
  return db_int(0,
    "SELECT objid FROM ("
      /* Q1:  Begin by looking for the tag in the 30 most recent events */
      "SELECT objid"
       " FROM (SELECT * FROM event ORDER BY mtime DESC LIMIT 30) AS ex"
      " WHERE type GLOB '%q'"
        " AND EXISTS(SELECT 1 FROM tagxref, tag"
                     " WHERE tag.tagname='sym-%q'"
                       " AND tagxref.tagid=tag.tagid"
                       " AND tagxref.tagtype>0"
                       " AND tagxref.rid=ex.objid)"
      " ORDER BY mtime DESC LIMIT 1"
    ") UNION ALL SELECT * FROM ("
      /* Q2: If the tag is not found in the 30 most recent events, then using
      ** the tagxref table to index for the tag */
      "SELECT event.objid"
       " FROM tag, tagxref, event"
      " WHERE tag.tagname='sym-%q'"
        " AND tagxref.tagid=tag.tagid"
        " AND tagxref.tagtype>0"
        " AND event.objid=tagxref.rid"
        " AND event.type GLOB '%q'"
      " ORDER BY event.mtime DESC LIMIT 1"
    ") LIMIT 1;",
    zType, zTag, zTag, zType
  );
}


/*
** Convert a symbolic name into a RID.  Acceptable forms:
**
**   *  artifact hash (optionally enclosed in [...])
**   *  4-character or larger prefix of a artifact
**   *  Symbolic Name
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
  int rid = 0;
  int nTag;
  int i;
  int startOfBranch = 0;
  const char *zXTag;     /* zTag with optional [...] removed */
  int nXTag;             /* Size of zXTag */
  const char *zDate;     /* Expanded date-time string */
  const char *zTagPrefix = "sym";

  if( zType==0 || zType[0]==0 ){
    zType = "*";
  }else if( zType[0]=='b' ){
    zType = "ci";
    startOfBranch = 1;
  }







<







266
267
268
269
270
271
272

273
274
275
276
277
278
279
  int rid = 0;
  int nTag;
  int i;
  int startOfBranch = 0;
  const char *zXTag;     /* zTag with optional [...] removed */
  int nXTag;             /* Size of zXTag */
  const char *zDate;     /* Expanded date-time string */


  if( zType==0 || zType[0]==0 ){
    zType = "*";
  }else if( zType[0]=='b' ){
    zType = "ci";
    startOfBranch = 1;
  }
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
      " ORDER BY mtime DESC LIMIT 1",
      fossil_roundup_date(&zTag[4]), zType);
    return rid;
  }

  /* "tag:" + symbolic-name */
  if( memcmp(zTag, "tag:", 4)==0 ){
    rid = db_int(0,
       "SELECT event.objid, max(event.mtime)"
       "  FROM tag, tagxref, event"
       " WHERE tag.tagname='sym-%q' "
       "   AND tagxref.tagid=tag.tagid AND tagxref.tagtype>0 "
       "   AND event.objid=tagxref.rid "
       "   AND event.type GLOB '%q'",
       &zTag[4], zType
    );
    if( startOfBranch ) rid = start_of_branch(rid,1);
    return rid;
  }

  /* root:BR -> The origin of the branch named BR */
  if( strncmp(zTag, "root:", 5)==0 ){
    rid = symbolic_name_to_rid(zTag+5, zType);







<
<
<
<
<
<
<
|
<







341
342
343
344
345
346
347







348

349
350
351
352
353
354
355
      " ORDER BY mtime DESC LIMIT 1",
      fossil_roundup_date(&zTag[4]), zType);
    return rid;
  }

  /* "tag:" + symbolic-name */
  if( memcmp(zTag, "tag:", 4)==0 ){







    rid = most_recent_event_with_tag(&zTag[4], zType);

    if( startOfBranch ) rid = start_of_branch(rid,1);
    return rid;
  }

  /* root:BR -> The origin of the branch named BR */
  if( strncmp(zTag, "root:", 5)==0 ){
    rid = symbolic_name_to_rid(zTag+5, zType);
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412



413
414
415
416
417
418
419
      if( db_step(&q)==SQLITE_ROW ) rid = -1;
    }
    db_finalize(&q);
    if( rid ) return rid;
  }

  if( zType[0]=='w' ){
    zTagPrefix = "wiki";
  }
  /* Symbolic name */
  rid = db_int(0,
    "SELECT event.objid, max(event.mtime)"
    "  FROM tag, tagxref, event"
    " WHERE tag.tagname='%q-%q' "
    "   AND tagxref.tagid=tag.tagid AND tagxref.tagtype>0 "
    "   AND event.objid=tagxref.rid "
    "   AND event.type GLOB '%q'",
    zTagPrefix, zTag, zType
  );




  if( rid>0 ){
    if( startOfBranch ) rid = start_of_branch(rid,1);
    return rid;
  }

  /* Pure numeric date/time */







<
<
<
|
|
|
|
|
|
|
|
|
>
>
>







428
429
430
431
432
433
434



435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
      if( db_step(&q)==SQLITE_ROW ) rid = -1;
    }
    db_finalize(&q);
    if( rid ) return rid;
  }

  if( zType[0]=='w' ){



    rid = db_int(0,
      "SELECT event.objid, max(event.mtime)"
      "  FROM tag, tagxref, event"
      " WHERE tag.tagname='wiki-%q' "
      "   AND tagxref.tagid=tag.tagid AND tagxref.tagtype>0 "
      "   AND event.objid=tagxref.rid "
      "   AND event.type GLOB '%q'",
      zTag, zType
    );
  }else{
    rid = most_recent_event_with_tag(zTag, zType);
  }

  if( rid>0 ){
    if( startOfBranch ) rid = start_of_branch(rid,1);
    return rid;
  }

  /* Pure numeric date/time */