Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
| Comment: | Faster resolution of common symbolic tags such as "trunk". |
|---|---|
| Downloads: | Tarball | ZIP archive |
| Timelines: | family | ancestors | descendants | both | trunk |
| Files: | files | file ages | folders |
| SHA3-256: |
fdeebddea7abf6addf1640da01804f23 |
| User & Date: | drh 2021-01-08 15:16:00.894 |
Consider the following schema in which there are a large number objects and each object can have zero or more tags. A single tag might apply to any number of objects. Most tags apply to only a few objects, but a few tags might apply to many objects. One or two tags might apply to 95% of the objects.
The "objects" here are really EVENT table entries and "tags" are the combination of the TAG and TAGXREF tables. Let's assume that there are 1,000,000 objects and 50,000 distinct tagNames. Most objects have exactly one tag. 5% of the objects have two or more tags. Most tagNames are assigned to less than 20 objects, but a few tags are assigned to many objects, and one tag in particular is assigned to about 95% of the objects.
CREATE TABLE object( objectId INTEGER PRIMARY KEY, value REAL ); CREATE INDEX object_value ON object(value); CREATE TABLE tag( tagName TEXT, objectId INTEGER REFERENCES object ON DELETE CASCADE, PRIMARY KEY(tagName, objectId) );
Given a tagName $TAG, we want to find the object with that tag that has the largest value. The code was computing that objectId as follows:
SELECT objectId FROM object JOIN tag USING(objectId) WHERE tagName=$TAG ORDER BY value DESC LIMIT 1;
The previous query works well when the number of objects associated with $TAG are small (at least compared to the total number of objects). But what if 95% of the objects have $TAG? In that case, the query above is inefficient since it needs to search most of the object table to find the object with the largest value. The following query works by scaning objects in decreasing value order looking for the first one that has $TAG:
SELECT objectId FROM object WHERE EXISTS(SELECT 1 FROM tag WHERE tag.objectId=object.objectId AND tag.tagName=$TAG) ORDER BY value DESC LIMIT 1;
The problem is, we don't know in advance if $TAG is abundant or rare. This check-in attempts to resolve the problem by using the following grotesque query:
SELECT objectId, value FROM (SELECT objectId, value FROM object ORDER BY value DESC LIMIT 30) AS ox WHERE EXISTS(SELECT 1 FROM tag WHERE tag.objectId=ox.objectId AND tag.tagName=$TAG) UNION ALL SELECT objectId, value FROM ( SELECT objectId, value FROM object JOIN tag USING(objectId) WHERE tagName=$TAG ORDER BY object.value DESC LIMIT 1 ) LIMIT 1;
The previous query works as follows: The first term of the UNION ALL searches for an object with $TAG amoung the 30 objects with the largest values. If there are no matches, then it falls down into the second term of the UNION ALL which searches all objects with $TAG for the one with the largest value. The query works because SQLite evaluates UNION ALL compounds sequentially. (This is not guaranteed by the SQLite documentation, but it has always been so. Perhaps SQLite should make it a hard guarantee moving forward?)
The grotesque query is still inefficient if, for example, all one million objects have $TAG except for the 50 largest. We could denormalize the schema, making a copy of the object.value into the tag table, and keeping the values in sync using triggers. like this:
CREATE TABLE object( objectId INTEGER PRIMARY KEY, value REAL ); CREATE INDEX object_value ON object(value); CREATE TABLE tag( tagName TEXT, objectId INTEGER REFERENCES object ON DELETE CASCASE, objectValue REAL, -- Copy of object.value PRIMARY KEY(tagName, objectId) ); CREATE INDEX tag_value ON tag(objectValue); CREATE TRIGGER AFTER UPDATE ON object(value) BEGIN UPDATE tag SET objectValue=new.value WHERE objectId=new.objectId; END; CREATE TRIGGER AFTER INSERT ON tag BEGIN UPDATE tag SET objectValue=(SELECT value FROM object WHERE objectId=new.objectId) WHERE tagName=new.tagName and objectId=new.objectId; END; CREATE TRIGGER AFTER UPDATE ON tag BEGIN UPDATE tag SET objectValue=(SELECT value FROM object WHERE objectId=new.objectId) WHERE tagName=new.tagName and objectId=new.objectId; END;
In that case, the following query is always efficient:
SELECT objectId FROM tag WHERE tagName=$TAG ORDER BY objectValue DESC LIMIT 1;
The downside is that there are now three triggers and an extra index in the schema and the schema is denormalized.
|
2021-01-08
| ||
| 15:25 | In the makefiles, put sqlite3.o early in the dependency list so that on a multithreaded make, it starts earliest. This makes the multithreaded makes finish sooner. check-in: 8ca760ce97 user: drh tags: trunk | |
| 15:16 | Faster resolution of common symbolic tags such as "trunk". check-in: fdeebddea7 user: drh tags: trunk | |
| 14:23 | Improved wording on the change log. check-in: a28778e1d8 user: drh tags: trunk | |
| ︙ | ︙ | |||
179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
if( eType==2 && ans>0 ){
zBr = branch_of_rid(ans);
ans = compute_youngest_ancestor_in_branch(rid, zBr);
fossil_free(zBr);
}
return ans;
}
/*
** Convert a symbolic name into a RID. Acceptable forms:
**
** * artifact hash (optionally enclosed in [...])
** * 4-character or larger prefix of a artifact
** * Symbolic Name
| > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 |
if( eType==2 && ans>0 ){
zBr = branch_of_rid(ans);
ans = compute_youngest_ancestor_in_branch(rid, zBr);
fossil_free(zBr);
}
return ans;
}
/*
** Find the RID of the most recent object with symbolic tag zTag
** and having a type that matches zType.
**
** Return 0 if there are no matches.
**
** This is a tricky query to do efficiently.
** If the tag is very common (ex: "trunk") then
** we want to use the query identified below as Q1 - which searching
** the most recent EVENT table entries for the most recent with the tag.
** But if the tag is relatively scarce (anything other than "trunk", basically)
** then we want to do the indexed search show below as Q2.
*/
static int most_recent_event_with_tag(const char *zTag, const char *zType){
return db_int(0,
"SELECT objid FROM ("
/* Q1: Begin by looking for the tag in the 30 most recent events */
"SELECT objid"
" FROM (SELECT * FROM event ORDER BY mtime DESC LIMIT 30) AS ex"
" WHERE type GLOB '%q'"
" AND EXISTS(SELECT 1 FROM tagxref, tag"
" WHERE tag.tagname='sym-%q'"
" AND tagxref.tagid=tag.tagid"
" AND tagxref.tagtype>0"
" AND tagxref.rid=ex.objid)"
" ORDER BY mtime DESC LIMIT 1"
") UNION ALL SELECT * FROM ("
/* Q2: If the tag is not found in the 30 most recent events, then using
** the tagxref table to index for the tag */
"SELECT event.objid"
" FROM tag, tagxref, event"
" WHERE tag.tagname='sym-%q'"
" AND tagxref.tagid=tag.tagid"
" AND tagxref.tagtype>0"
" AND event.objid=tagxref.rid"
" AND event.type GLOB '%q'"
" ORDER BY event.mtime DESC LIMIT 1"
") LIMIT 1;",
zType, zTag, zTag, zType
);
}
/*
** Convert a symbolic name into a RID. Acceptable forms:
**
** * artifact hash (optionally enclosed in [...])
** * 4-character or larger prefix of a artifact
** * Symbolic Name
|
| ︙ | ︙ | |||
223 224 225 226 227 228 229 | int rid = 0; int nTag; int i; int startOfBranch = 0; const char *zXTag; /* zTag with optional [...] removed */ int nXTag; /* Size of zXTag */ const char *zDate; /* Expanded date-time string */ | < | 266 267 268 269 270 271 272 273 274 275 276 277 278 279 |
int rid = 0;
int nTag;
int i;
int startOfBranch = 0;
const char *zXTag; /* zTag with optional [...] removed */
int nXTag; /* Size of zXTag */
const char *zDate; /* Expanded date-time string */
if( zType==0 || zType[0]==0 ){
zType = "*";
}else if( zType[0]=='b' ){
zType = "ci";
startOfBranch = 1;
}
|
| ︙ | ︙ | |||
299 300 301 302 303 304 305 |
" ORDER BY mtime DESC LIMIT 1",
fossil_roundup_date(&zTag[4]), zType);
return rid;
}
/* "tag:" + symbolic-name */
if( memcmp(zTag, "tag:", 4)==0 ){
| < < < < < < < | < | 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 |
" ORDER BY mtime DESC LIMIT 1",
fossil_roundup_date(&zTag[4]), zType);
return rid;
}
/* "tag:" + symbolic-name */
if( memcmp(zTag, "tag:", 4)==0 ){
rid = most_recent_event_with_tag(&zTag[4], zType);
if( startOfBranch ) rid = start_of_branch(rid,1);
return rid;
}
/* root:BR -> The origin of the branch named BR */
if( strncmp(zTag, "root:", 5)==0 ){
rid = symbolic_name_to_rid(zTag+5, zType);
|
| ︙ | ︙ | |||
394 395 396 397 398 399 400 |
if( db_step(&q)==SQLITE_ROW ) rid = -1;
}
db_finalize(&q);
if( rid ) return rid;
}
if( zType[0]=='w' ){
| < < < | | | | | | | | | > > > | 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 |
if( db_step(&q)==SQLITE_ROW ) rid = -1;
}
db_finalize(&q);
if( rid ) return rid;
}
if( zType[0]=='w' ){
rid = db_int(0,
"SELECT event.objid, max(event.mtime)"
" FROM tag, tagxref, event"
" WHERE tag.tagname='wiki-%q' "
" AND tagxref.tagid=tag.tagid AND tagxref.tagtype>0 "
" AND event.objid=tagxref.rid "
" AND event.type GLOB '%q'",
zTag, zType
);
}else{
rid = most_recent_event_with_tag(zTag, zType);
}
if( rid>0 ){
if( startOfBranch ) rid = start_of_branch(rid,1);
return rid;
}
/* Pure numeric date/time */
|
| ︙ | ︙ |