Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
| Comment: | Updates to the antibot.wiki page, to discuss the latest enhancements to robot defenses. |
|---|---|
| Downloads: | Tarball | ZIP archive |
| Timelines: | family | ancestors | descendants | both | trunk |
| Files: | files | file ages | folders |
| SHA3-256: |
14e23927cea4e56aea3ec50e7de9251b |
| User & Date: | drh 2025-10-09 13:35:20.340 |
Context
|
2025-10-09
| ||
| 18:22 | stash drop help tweak suggested in [forum:d5c5c0f980|forum post d5c5c0f980]. ... (check-in: e2783d0789 user: stephan tags: trunk) | |
| 13:35 | Updates to the antibot.wiki page, to discuss the latest enhancements to robot defenses. ... (check-in: 14e23927ce user: drh tags: trunk) | |
| 12:55 | New settings to allow robots to download tarballs but only if the corresponding check-in is a leaf or if it has a tag like "release" or "allow-robots". New settings control all of the above. ... (check-in: 4d198d0e12 user: drh tags: trunk) | |
Changes
Changes to www/antibot.wiki.
1 2 3 4 5 6 7 8 9 10 11 12 13 | <title>Defense Against Robots</title> A typical Fossil website can have billions and billions of pages, and many of those pages (for example diffs and annotations and tarballs) can be expensive to compute. If a robot walks a Fossil-generated website, it can present a crippling bandwidth and CPU load. A Fossil website is intended to be used interactively by humans, not walked by robots. This article describes the techniques used by Fossil to try to welcome human users while keeping out robots. | > > > > > > > > > | | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | <title>Defense Against Robots</title> A typical Fossil website can have billions and billions of pages, and many of those pages (for example diffs and annotations and tarballs) can be expensive to compute. If a robot walks a Fossil-generated website, it can present a crippling bandwidth and CPU load. A "robots.txt" file can help, but in practice, most robots these days ignore the robots.txt file, so it won't help much. A Fossil website is intended to be used interactively by humans, not walked by robots. This article describes the techniques used by Fossil to try to welcome human users while keeping out robots. <h2>Defenses Are Enabled By Default</h2> In the latest implementations of Fossil, most robot defenses are enabled by default. You can probably get by with standing up a public-facing Fossil instance in the default configuration. But you can also customize the defenses to serve your particular needs. <h2>Customizing Anti-Robot Defenses</h2> Admin users can configure robot defenses on the "Robot Defense Settings" page (/setup_robot). That page is accessible (to Admin users) from the default menu bar by click on the "Admin" menu choice, then selecting the "Robot-Defense" link from the list. |
| ︙ | ︙ | |||
128 129 130 131 132 133 134 | <h2>Do Not Allow Robot Access To Certain Pages</h2> The [/help?cmd=robot-restrict|robot-restrict setting] is a comma-separated list of GLOB patterns for pages for which robot access is prohibited. The default value is: <blockquote><pre> | | | 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | <h2>Do Not Allow Robot Access To Certain Pages</h2> The [/help?cmd=robot-restrict|robot-restrict setting] is a comma-separated list of GLOB patterns for pages for which robot access is prohibited. The default value is: <blockquote><pre> timelineX,diff,annotate,fileage,file,finfo,reports </pre></blockquote> Each entry corresponds to the first path element on the URI for a Fossil-generated page. If Fossil does not know for certain that the HTTP request is coming from a human, then any attempt to access one of these pages brings up a javascript-powered captcha. The user has to click the accept button the captcha once, and that sets a cookie allowing |
| ︙ | ︙ | |||
154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
of the most recent changes, but timelines of long-ago change or that
contain lists of file changes or other harder-to-compute values are
prohibited.
* <b>zip →</b>
The special "zip" keyword also matches "/tarball/" and "/sqlar/".
* <b>diff →</b>
This matches /vdiff/ and /fdiff/ and /vpatch/ and any other page that
is primarily about showing the difference between two check-ins or two
file versioons.
* <b>annotate →</b>
This also matches /blame/ and /praise/.
Other special keywords may be added in the future.
The default [/help?cmd=robot-restrict|robot-restrict]
| > > > > > > > > > > > > > > > > | > > > > > > | 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
of the most recent changes, but timelines of long-ago change or that
contain lists of file changes or other harder-to-compute values are
prohibited.
* <b>zip →</b>
The special "zip" keyword also matches "/tarball/" and "/sqlar/".
* <b>zipX →</b>
This is like "zip" in that it restricts access to "/zip/", "/tarball"/
and "/sqlar/" but with exceptions:<ol type="a">
<li><p> If the [/help?cmd=robot-zip-leaf|robot-zip-leaf] setting is
true, then tarballs of leaf check-ins are allowed. This permits
URLs that attempt to download the latest check-in on trunk or
from a named branch, for example.
<li><p> If a check-in has a tag that matches the GLOB list in
[/help?cmd=robot-zip-tag|robot-zip-tag], then tarballs of that
check-in are allowed. This allow check-ins tagged with
"release" or "allow-robots" (for example) to be downloaded
without restriction.
</ol>
The "zipX" restriction is not in the default robot-restrict setting.
This is something you might want to add, depending on your needs.
* <b>diff →</b>
This matches /vdiff/ and /fdiff/ and /vpatch/ and any other page that
is primarily about showing the difference between two check-ins or two
file versioons.
* <b>annotate →</b>
This also matches /blame/ and /praise/.
Other special keywords may be added in the future.
The default [/help?cmd=robot-restrict|robot-restrict]
setting has been shown in practice to do a good job of keeping
robots from consuming all available CPU and bandwidth while will
still allowing humans access to the full power of the site without
having to be logged in.
One possible enhancement is to add "zipX" to the
[/help?cmd=robot-restrict|robot-restrict] setting,
and enable [help?cmd=robot-zip-leaf|robot-zip-leaf]
and configure [help?cmd=robot-zip-tag|robot-zip-tag].
Do this if you find that robots downloading lots of
obscure tarballs is causing load issues on your site.
<h2>Anti-robot Exception RegExps</h2>
The [/help?cmd=robot-exception|robot-exception setting] under the name
of <b>Exceptions to anti-robot restrictions</b> is a list of
[/re_rules|regular expressions], one per line, that match
URIs that will bypass the captcha and allow robots full access. The
|
| ︙ | ︙ |