Fossil: Check-in [fdeebddea7]

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview

Comment:	Faster resolution of common symbolic tags such as "trunk".
Downloads:	Tarball \| ZIP archive
Timelines:	family \| ancestors \| descendants \| both \| trunk
Files:	files \| file ages \| folders
SHA3-256:	fdeebddea7abf6addf1640da01804f230e9a10c2b081b0aafdcac1642d0412ba
User & Date:	drh 2021-01-08 15:16:00.894

An analysis of the SQL problem that this check-in attempts to resolve.

Consider the following schema in which there are a large number objects and each object can have zero or more tags. A single tag might apply to any number of objects. Most tags apply to only a few objects, but a few tags might apply to many objects. One or two tags might apply to 95% of the objects.

The "objects" here are really EVENT table entries and "tags" are the combination of the TAG and TAGXREF tables. Let's assume that there are 1,000,000 objects and 50,000 distinct tagNames. Most objects have exactly one tag. 5% of the objects have two or more tags. Most tagNames are assigned to less than 20 objects, but a few tags are assigned to many objects, and one tag in particular is assigned to about 95% of the objects.

CREATE TABLE object(
  objectId INTEGER PRIMARY KEY,
  value REAL
);
CREATE INDEX object_value ON object(value);
CREATE TABLE tag(
  tagName TEXT,
  objectId INTEGER REFERENCES object ON DELETE CASCADE,
  PRIMARY KEY(tagName, objectId)
);

Given a tagName $TAG, we want to find the object with that tag that has the largest value. The code was computing that objectId as follows:

SELECT objectId
  FROM object JOIN tag USING(objectId)
 WHERE tagName=$TAG
 ORDER BY value DESC
 LIMIT 1;

The previous query works well when the number of objects associated with $TAG are small (at least compared to the total number of objects). But what if 95% of the objects have $TAG? In that case, the query above is inefficient since it needs to search most of the object table to find the object with the largest value. The following query works by scaning objects in decreasing value order looking for the first one that has $TAG:

SELECT objectId
  FROM object
 WHERE EXISTS(SELECT 1 FROM tag
               WHERE tag.objectId=object.objectId
                 AND tag.tagName=$TAG)
 ORDER BY value DESC
 LIMIT 1;

The problem is, we don't know in advance if $TAG is abundant or rare. This check-in attempts to resolve the problem by using the following grotesque query:

SELECT objectId, value
  FROM (SELECT objectId, value FROM object ORDER BY value DESC LIMIT 30) AS ox
 WHERE EXISTS(SELECT 1 FROM tag
               WHERE tag.objectId=ox.objectId
                 AND tag.tagName=$TAG)
UNION ALL
SELECT objectId, value FROM (
  SELECT objectId, value
    FROM object JOIN tag USING(objectId)
   WHERE tagName=$TAG
   ORDER BY object.value DESC
   LIMIT 1
)
LIMIT 1;

The previous query works as follows: The first term of the UNION ALL searches for an object with $TAG amoung the 30 objects with the largest values. If there are no matches, then it falls down into the second term of the UNION ALL which searches all objects with $TAG for the one with the largest value. The query works because SQLite evaluates UNION ALL compounds sequentially. (This is not guaranteed by the SQLite documentation, but it has always been so. Perhaps SQLite should make it a hard guarantee moving forward?)

The grotesque query is still inefficient if, for example, all one million objects have $TAG except for the 50 largest. We could denormalize the schema, making a copy of the object.value into the tag table, and keeping the values in sync using triggers. like this:

CREATE TABLE object(
  objectId INTEGER PRIMARY KEY,
  value REAL
);
CREATE INDEX object_value ON object(value);
CREATE TABLE tag(
  tagName TEXT,
  objectId INTEGER REFERENCES object ON DELETE CASCASE,
  objectValue REAL, -- Copy of object.value
  PRIMARY KEY(tagName, objectId)
);
CREATE INDEX tag_value ON tag(objectValue);
CREATE TRIGGER AFTER UPDATE ON object(value) BEGIN
  UPDATE tag SET objectValue=new.value WHERE objectId=new.objectId;
END;
CREATE TRIGGER AFTER INSERT ON tag BEGIN
  UPDATE tag
     SET objectValue=(SELECT value FROM object WHERE objectId=new.objectId)
   WHERE tagName=new.tagName and objectId=new.objectId;
END;
CREATE TRIGGER AFTER UPDATE ON tag BEGIN
  UPDATE tag
     SET objectValue=(SELECT value FROM object WHERE objectId=new.objectId)
   WHERE tagName=new.tagName and objectId=new.objectId;
END;

In that case, the following query is always efficient:

SELECT objectId
  FROM tag
 WHERE tagName=$TAG
 ORDER BY objectValue DESC
 LIMIT 1;

The downside is that there are now three triggers and an extra index in the schema and the schema is denormalized.

Context

2021-01-08
15:25		In the makefiles, put sqlite3.o early in the dependency list so that on a multithreaded make, it starts earliest. This makes the multithreaded makes finish sooner. check-in: 8ca760ce97 user: drh tags: trunk
15:16		Faster resolution of common symbolic tags such as "trunk". check-in: fdeebddea7 user: drh tags: trunk
14:23		Improved wording on the change log. check-in: a28778e1d8 user: drh tags: trunk

Changes

Changes to src/name.c.

︙			︙
179 180 181 182 183 184 185 186 187 188 189 190 191 192	if( eType==2 && ans>0 ){ zBr = branch_of_rid(ans); ans = compute_youngest_ancestor_in_branch(rid, zBr); fossil_free(zBr); } return ans; } /* Convert a symbolic name into a RID. Acceptable forms: ** * artifact hash (optionally enclosed in [...]) ** * 4-character or larger prefix of a artifact ** * Symbolic Name	> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >	179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235	if( eType==2 && ans>0 ){ zBr = branch_of_rid(ans); ans = compute_youngest_ancestor_in_branch(rid, zBr); fossil_free(zBr); } return ans; } /* Find the RID of the most recent object with symbolic tag zTag and having a type that matches zType. Return 0 if there are no matches. This is a tricky query to do efficiently. If the tag is very common (ex: "trunk") then we want to use the query identified below as Q1 - which searching the most recent EVENT table entries for the most recent with the tag. But if the tag is relatively scarce (anything other than "trunk", basically) ** then we want to do the indexed search show below as Q2. / static int most_recent_event_with_tag(const char zTag, const char zType){ return db_int(0, "SELECT objid FROM (" / Q1: Begin by looking for the tag in the 30 most recent events / "SELECT objid" " FROM (SELECT FROM event ORDER BY mtime DESC LIMIT 30) AS ex" " WHERE type GLOB '%q'" " AND EXISTS(SELECT 1 FROM tagxref, tag" " WHERE tag.tagname='sym-%q'" " AND tagxref.tagid=tag.tagid" " AND tagxref.tagtype>0" " AND tagxref.rid=ex.objid)" " ORDER BY mtime DESC LIMIT 1" ") UNION ALL SELECT * FROM (" /* Q2: If the tag is not found in the 30 most recent events, then using ** the tagxref table to index for the tag / "SELECT event.objid" " FROM tag, tagxref, event" " WHERE tag.tagname='sym-%q'" " AND tagxref.tagid=tag.tagid" " AND tagxref.tagtype>0" " AND event.objid=tagxref.rid" " AND event.type GLOB '%q'" " ORDER BY event.mtime DESC LIMIT 1" ") LIMIT 1;", zType, zTag, zTag, zType ); } / Convert a symbolic name into a RID. Acceptable forms: ** * artifact hash (optionally enclosed in [...]) ** * 4-character or larger prefix of a artifact ** * Symbolic Name
︙			︙
223 224 225 226 227 228 229 ~~230~~ 231 232 233 234 235 236 237	int rid = 0; int nTag; int i; int startOfBranch = 0; const char zXTag; / zTag with optional [...] removed / int nXTag; / Size of zXTag / const char zDate; /* Expanded date-time string / ~~const char zTagPrefix = "sym";~~ if( zType==0 \|\| zType[0]==0 ){ zType = "*"; }else if( zType[0]=='b' ){ zType = "ci"; startOfBranch = 1; }	<	266 267 268 269 270 271 272 273 274 275 276 277 278 279	int rid = 0; int nTag; int i; int startOfBranch = 0; const char zXTag; / zTag with optional [...] removed / int nXTag; / Size of zXTag / const char zDate; /* Expanded date-time string / if( zType==0 \|\| zType[0]==0 ){ zType = ""; }else if( zType[0]=='b' ){ zType = "ci"; startOfBranch = 1; }
︙			︙
299 300 301 302 303 304 305 ~~306 307 308 309 310 311 312 313 314~~ 315 316 317 318 319 320 321	" ORDER BY mtime DESC LIMIT 1", fossil_roundup_date(&zTag[4]), zType); return rid; } /* "tag:" + symbolic-name / if( memcmp(zTag, "tag:", 4)==0 ){ ~~rid = db_int(0,~~ ~~"SELECT event.objid, max(event.mtime)"~~ ~~" FROM tag, tagxref, event"~~ ~~" WHERE tag.tagname='sym-%q' "~~ ~~" AND tagxref.tagid=tag.tagid AND tagxref.tagtype>0 "~~ ~~" AND event.objid=tagxref.rid "~~ ~~" AND event.type GLOB '%q'",~~ &zTag[4], zType ); if( startOfBranch ) rid = start_of_branch(rid,1); return rid; } / root:BR -> The origin of the branch named BR */ if( strncmp(zTag, "root:", 5)==0 ){ rid = symbolic_name_to_rid(zTag+5, zType);	< < < < < < < \| <	341 342 343 344 345 346 347 348 349 350 351 352 353 354 355	" ORDER BY mtime DESC LIMIT 1", fossil_roundup_date(&zTag[4]), zType); return rid; } /* "tag:" + symbolic-name / if( memcmp(zTag, "tag:", 4)==0 ){ rid = most_recent_event_with_tag(&zTag[4], zType); if( startOfBranch ) rid = start_of_branch(rid,1); return rid; } / root:BR -> The origin of the branch named BR */ if( strncmp(zTag, "root:", 5)==0 ){ rid = symbolic_name_to_rid(zTag+5, zType);
︙			︙
394 395 396 397 398 399 400 ~~401 402 403 404 405 406 407 408 409 410 411 412~~ 413 414 415 416 417 418 419	if( db_step(&q)==SQLITE_ROW ) rid = -1; } db_finalize(&q); if( rid ) return rid; } if( zType[0]=='w' ){ ~~zTagPrefix = "wiki";~~ } ~~/* Symbolic name /~~ rid = db_int(0, "SELECT event.objid, max(event.mtime)" " FROM tag, tagxref, event" " WHERE tag.tagname='%q-%q' " " AND tagxref.tagid=tag.tagid AND tagxref.tagtype>0 " " AND event.objid=tagxref.rid " " AND event.type GLOB '%q'", ~~zTagPrefix,~~ zTag, zType ); if( rid>0 ){ if( startOfBranch ) rid = start_of_branch(rid,1); return rid; } / Pure numeric date/time */	< < < \| \| \| \| \| \| \| \| \| > > >	428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453	if( db_step(&q)==SQLITE_ROW ) rid = -1; } db_finalize(&q); if( rid ) return rid; } if( zType[0]=='w' ){ rid = db_int(0, "SELECT event.objid, max(event.mtime)" " FROM tag, tagxref, event" " WHERE tag.tagname='wiki-%q' " " AND tagxref.tagid=tag.tagid AND tagxref.tagtype>0 " " AND event.objid=tagxref.rid " " AND event.type GLOB '%q'", zTag, zType ); }else{ rid = most_recent_event_with_tag(zTag, zType); } if( rid>0 ){ if( startOfBranch ) rid = start_of_branch(rid,1); return rid; } /* Pure numeric date/time */
︙			︙