Fossil

Check-in [14e23927ce]
Login

Check-in [14e23927ce]

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Updates to the antibot.wiki page, to discuss the latest enhancements to robot defenses.
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256: 14e23927cea4e56aea3ec50e7de9251bd80530bea902cad175405fc66f246efd
User & Date: drh 2025-10-09 13:35:20.340
Context
2025-10-09
18:22
stash drop help tweak suggested in [forum:d5c5c0f980|forum post d5c5c0f980]. ... (check-in: e2783d0789 user: stephan tags: trunk)
13:35
Updates to the antibot.wiki page, to discuss the latest enhancements to robot defenses. ... (check-in: 14e23927ce user: drh tags: trunk)
12:55
New settings to allow robots to download tarballs but only if the corresponding check-in is a leaf or if it has a tag like "release" or "allow-robots". New settings control all of the above. ... (check-in: 4d198d0e12 user: drh tags: trunk)
Changes
Unified Diff Ignore Whitespace Patch
Changes to www/antibot.wiki.
1
2
3
4
5
6
7


8
9
10
11
12
13







14
15
16
17
18
19
20
21
<title>Defense Against Robots</title>

A typical Fossil website can have billions and billions of pages,
and many of those pages (for example diffs and annotations and tarballs) 
can be expensive to compute.
If a robot walks a Fossil-generated website,
it can present a crippling bandwidth and CPU load.



A Fossil website is intended to be used
interactively by humans, not walked by robots.  This article
describes the techniques used by Fossil to try to welcome human
users while keeping out robots.








<h2>Setting Up Anti-Robot Defenses</h2>

Admin users can configure robot defenses on the
"Robot Defense Settings" page (/setup_robot).
That page is accessible (to Admin users) from the default menu bar
by click on the "Admin" menu choice, then selecting the
"Robot-Defense" link from the list.








>
>






>
>
>
>
>
>
>
|







1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
<title>Defense Against Robots</title>

A typical Fossil website can have billions and billions of pages,
and many of those pages (for example diffs and annotations and tarballs) 
can be expensive to compute.
If a robot walks a Fossil-generated website,
it can present a crippling bandwidth and CPU load.
A "robots.txt" file can help, but in practice, most robots these
days ignore the robots.txt file, so it won't help much.

A Fossil website is intended to be used
interactively by humans, not walked by robots.  This article
describes the techniques used by Fossil to try to welcome human
users while keeping out robots.

<h2>Defenses Are Enabled By Default</h2>

In the latest implementations of Fossil, most robot defenses are
enabled by default.  You can probably get by with standing up a
public-facing Fossil instance in the default configuration.  But
you can also customize the defenses to serve your particular needs.

<h2>Customizing Anti-Robot Defenses</h2>

Admin users can configure robot defenses on the
"Robot Defense Settings" page (/setup_robot).
That page is accessible (to Admin users) from the default menu bar
by click on the "Admin" menu choice, then selecting the
"Robot-Defense" link from the list.

128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
<h2>Do Not Allow Robot Access To Certain Pages</h2>

The [/help?cmd=robot-restrict|robot-restrict setting] is a comma-separated
list of GLOB patterns for pages for which robot access is prohibited.
The default value is:

<blockquote><pre>
timelineX,diff,annotate,zip,fileage,file,finfo,reports
</pre></blockquote>

Each entry corresponds to the first path element on the URI for a
Fossil-generated page.  If Fossil does not know for certain that the
HTTP request is coming from a human, then any attempt to access one of
these pages brings up a javascript-powered captcha.  The user has to
click the accept button the captcha once, and that sets a cookie allowing







|







137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
<h2>Do Not Allow Robot Access To Certain Pages</h2>

The [/help?cmd=robot-restrict|robot-restrict setting] is a comma-separated
list of GLOB patterns for pages for which robot access is prohibited.
The default value is:

<blockquote><pre>
timelineX,diff,annotate,fileage,file,finfo,reports
</pre></blockquote>

Each entry corresponds to the first path element on the URI for a
Fossil-generated page.  If Fossil does not know for certain that the
HTTP request is coming from a human, then any attempt to access one of
these pages brings up a javascript-powered captcha.  The user has to
click the accept button the captcha once, and that sets a cookie allowing
154
155
156
157
158
159
160
















161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176






177
178
179
180
181
182
183
     of the most recent changes, but timelines of long-ago change or that
     contain lists of file changes or other harder-to-compute values are
     prohibited.

  *  <b>zip &rarr;</b>
     The special "zip" keyword also matches "/tarball/" and "/sqlar/".

















  *  <b>diff &rarr;</b>
     This matches /vdiff/ and /fdiff/ and /vpatch/ and any other page that
     is primarily about showing the difference between two check-ins or two
     file versioons.

  *  <b>annotate &rarr;</b>
     This also matches /blame/ and /praise/.

Other special keywords may be added in the future.

The default [/help?cmd=robot-restrict|robot-restrict]
setting has been shown in practice to do a great job of keeping 
robots from consuming all available CPU and bandwidth while will
still allowing humans access to the full power of the site without
having to be logged in.








<h2>Anti-robot Exception RegExps</h2>

The [/help?cmd=robot-exception|robot-exception setting] under the name
of <b>Exceptions to anti-robot restrictions</b> is a list of
[/re_rules|regular expressions], one per line, that match
URIs that will bypass the captcha and allow robots full access.  The







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>











|




>
>
>
>
>
>







163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
     of the most recent changes, but timelines of long-ago change or that
     contain lists of file changes or other harder-to-compute values are
     prohibited.

  *  <b>zip &rarr;</b>
     The special "zip" keyword also matches "/tarball/" and "/sqlar/".

  *  <b>zipX &rarr;</b>
     This is like "zip" in that it restricts access to "/zip/", "/tarball"/
     and "/sqlar/" but with exceptions:<ol type="a">
     <li><p> If the [/help?cmd=robot-zip-leaf|robot-zip-leaf] setting is
             true, then tarballs of leaf check-ins are allowed.  This permits
             URLs that attempt to download the latest check-in on trunk or
             from a named branch, for example.
     <li><p> If a check-in has a tag that matches the GLOB list in
             [/help?cmd=robot-zip-tag|robot-zip-tag], then tarballs of that
             check-in are allowed.  This allow check-ins tagged with
             "release" or "allow-robots" (for example) to be downloaded
             without restriction.
     </ol>
     The "zipX" restriction is not in the default robot-restrict setting.
     This is something you might want to add, depending on your needs.

  *  <b>diff &rarr;</b>
     This matches /vdiff/ and /fdiff/ and /vpatch/ and any other page that
     is primarily about showing the difference between two check-ins or two
     file versioons.

  *  <b>annotate &rarr;</b>
     This also matches /blame/ and /praise/.

Other special keywords may be added in the future.

The default [/help?cmd=robot-restrict|robot-restrict]
setting has been shown in practice to do a good job of keeping 
robots from consuming all available CPU and bandwidth while will
still allowing humans access to the full power of the site without
having to be logged in.

One possible enhancement is to add "zipX" to the 
[/help?cmd=robot-restrict|robot-restrict] setting,
and enable [help?cmd=robot-zip-leaf|robot-zip-leaf]
and configure [help?cmd=robot-zip-tag|robot-zip-tag].
Do this if you find that robots downloading lots of
obscure tarballs is causing load issues on your site.

<h2>Anti-robot Exception RegExps</h2>

The [/help?cmd=robot-exception|robot-exception setting] under the name
of <b>Exceptions to anti-robot restrictions</b> is a list of
[/re_rules|regular expressions], one per line, that match
URIs that will bypass the captcha and allow robots full access.  The