Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
| Comment: | Add the "Defense against Spiders" documentation page. |
|---|---|
| Timelines: | family | ancestors | descendants | both | trunk |
| Files: | files | file ages | folders |
| SHA1: |
1e26962d04aa914c447243de691fa88d |
| User & Date: | drh 2013-04-09 14:58:17.543 |
Context
|
2013-04-10
| ||
| 07:47 | Fix minor typo on index page. check-in: 68ed364281 user: mistachkin tags: trunk | |
|
2013-04-09
| ||
| 14:58 | Add the "Defense against Spiders" documentation page. check-in: 1e26962d04 user: drh tags: trunk | |
| 13:30 | Change the default auto-hyperlink-delay from 0 to 10 milliseconds. check-in: ddd1659677 user: drh tags: trunk | |
Changes
Added www/antibot.wiki.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
<title>Defense Against Spiders</title>
The website presented by a Fossil server has many hyperlinks.
Even a modest project can have millions of pages in its
tree, and many of those pages (for example diffs and annotations
and ZIP archive of older check-ins) can be expensive to compute.
If a spider or bot tries to walk a website implemented by
Fossil, it can present a crippling bandwidth and CPU load.
The website presented by a Fossil server is intended to be used
interactively by humans, not walked by spiders. This article
describes the techniques used by Fossil to try to welcome human
users while keeping out spiders.
<h2>The "hyperlink" user capability</h2>
Every Fossil web session has a "user". For random passers-by on the internet
(and for spiders) that user is "nobody". The "anonymous" user is also
available for humans who do not wish to identify themselves. The difference
is that "anonymous" requires a login (using a password supplied via
a CAPTCHA) whereas "nobody" does not require a login.
The site administrator can also create logins with
passwords for specific individuals.
The "h" or "hyperlink" capability is a permission that can be granted
to users that enables the display of hyperlinks. Most of the hyperlinks
generated by Fossil are suppressed if this capability is missing. So
one simple defense against spiders is to disable the "h" permission for
the "nobody" user. This means that users must log in (perhaps as
"anonymous") before they can see any of the hyperlinks. Spiders do not
normally attempt to log into websites and will therefore
not see most of the hyperlinks and will not try to walk the millions of
historical check-ins and diffs available on a Fossil-generated website.
If the "h" capability is missing from user "nobody" but is present for
user "anonymous", then a message automatically appears at the top of each
page inviting the user to log in as anonymous in order to activate hyperlinks.
Removing the "h" capability from user "nobody" is an effective means
of preventing spiders from walking a Fossil-generated website. But
it can also be annoying to humans, since it requires them to log in.
Hence, Fossil provides other techniques for blocking spiders which
are less cumbersome to humans.
<h2>Automatic hyperlinks based on UserAgent</h2>
Fossil has the ability to selectively enable hyperlinks for users
that lack the "h" capability based on their UserAgent string in the
HTTP request header and on the browsers ability to run Javascript.
The UserAgent string is a text identifier that is included in the header
of most HTTP requests that identifies the specific maker and version of
the browser (or spider) that generated the request. Typical UserAgent
strings look like this:
<ul>
<li> Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0
<li> Mozilla/4.0 (compatible; MSIE 8.0; Windows_NT 5.1; Trident/4.0)
<li> Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
<li> Wget/1.12 (openbsd4.9)
</ul>
The first two UserAgent strings above identify Firefox 19 and
Internet Explorer 8.0, both running on windows NT. The third
example is the spider used by Google to index the internet.
The fourth example is the "wget" utility running on OpenBSD.
Thus the first two UserAgent strings above identify the requestor
as human whereas the second two identify the requestor as a spider.
Note that the UserAgent string is completely under the control
of the requestor and so a malicious spider can forge a UserAgent
string that makes it look like a human. But most spiders truly
seem to desire to "play nicely" on the internet and are quite open
about the fact that they are a spider. And so the UserAgent string
provides a good first-guess about whether or not a request originates
from a human or a spider.
In Fossil, under the Admin/Access menu, there is a setting entitled
"<b>Enable hyperlinks for "nobody" based on User-Agent and Javascript</b>".
If this setting is enabled, and if the UserAgent string looks like a
human and not a spider, then Fossil will enable hyperlinks even if
the "h" capability is omitted from the user permissions. This setting
gives humans easy access to the hyperlinks while preventing spiders
from walking the millions of pages on a typical Fossil site.
But the hyperlinks are not enabled directly with the setting above.
Instead, the HTML code that is generated contains anchor tags ("<a>")
without "href=" attributes. Then, javascript code is added to the
end of the page that goes back and fills in the "href=" attributes of
the anchor tags with the hyperlink targets, thus enabling the hyperlinks.
This extra step of using javascript to enable the hyperlink targets
is a security measure against spiders that forge a human-looking
UserAgent string. Most spiders do not bother to run javascript and
so to the spider the empty anchor tag will be useless. But all modern
web browsers implement javascript, so hyperlinks will appears
normally for human users.
<h2>Further defenses</h2>
Recently (as of this writing, in the spring of 2013) the Fossil server
on the SQLite website ([http://www.sqlite.org/src/]) has been hit repeatedly
by Chinese spiders that use forged UserAgent strings to make them look
like normal web browsers and which interpret javascript. We do not
believe these attacks to be nefarious since SQLite is public domain
and the attackers could obtain all information they ever wanted to
know about SQLite simply by cloning the repository. Instead, we
believe these "attacks" are coming from "script kiddies". But regardless
of whether or not malice is involved, these attacks do present
an unnecessary load on the server which reduces the responsiveness of
the SQLite website for well-behaved and socially responsible users.
For this reason, additional defenses against
spiders have been put in place.
On the Admin/Access page of Fossil, just below the
"<b>Enable hyperlinks for "nobody" based on User-Agent and Javascript</b>"
setting, there are now two additional subsettings that can be optionally
enabled to control hyperlinks.
The first subsetting waits to run the
javascript that sets the "href=" attributes on anchor tags until after
at least one "mouseover" event has been detected on the <body>
element of the page. The thinking here is that spiders will not be
simulating mouse motion and so no mouseover events will ever occur and
hence the hyperlinks will never become enabled for spiders.
The second new subsetting is a delay (in milliseconds) before setting
the "href=" attributes on anchor tags. The default value for this
delay is 10 milliseconds. The idea here is that a spider will try to
render the page immediately, and will not wait for delayed scripts
to be run, thus will never enable the hyperlinks.
These two subsettings can be used separately or together. If used together,
then the delay timer does not start until after the first mouse movement
is detected.
<h2>The ongoing struggle</h2>
Fossil currently does a very good job of providing easy access to humans
while keeping out troublesome robots and spiders. However, spiders and
bots continue to grow more sophisticated, requiring ever more advanced
defenses. This "arms race" is unlikely to ever end. The developers of
Fossil will continue to try improve the spider defenses of Fossil so
check back from time to time for the latest releases and updates.
Readers of this page who have suggestions on how to improve the spider
defenses in Fossil are invited to submit your ideas to the Fossil Users
mailing list:
[mailto:fossil-users@lists.fossil-scm.org | fossil-users@lists.fossil-scm.org].
|
Changes to www/index.wiki.
| ︙ | ︙ | |||
129 130 131 132 133 134 135 |
helps insure project integrity.
* Fossil contains a [./wikitheory.wiki | built-in wiki].
* An [./event.wiki | Event] is a special kind of wiki page associated
with a point in time rather than a name.
* [./settings.wiki | Settings] control the behaviour of fossil.
* [./ssl.wiki | Use SSL] to encrypt communication with the server.
* There is a
| | > > | 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
helps insure project integrity.
* Fossil contains a [./wikitheory.wiki | built-in wiki].
* An [./event.wiki | Event] is a special kind of wiki page associated
with a point in time rather than a name.
* [./settings.wiki | Settings] control the behaviour of fossil.
* [./ssl.wiki | Use SSL] to encrypt communication with the server.
* There is a
[http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users | mailing list]
(with publicly readable
[http://www.mail-archive.com/fossil-users@lists.fossil-scm.org | archives]
available for discussing fossil issues.
* [./stats.wiki | Performance statistics] taken from real-world projects
hosted on fossil.
* How to [./shunning.wiki | delete content] from a fossil repository.
* How Fossil does [./password.wiki | password management].
* On-line [/help | help].
* Documentation on the
[http://www.sqliteconcepts.org/THManual.pdf | TH1 Script Language] used
to configure the ticketing subsystem.
* A free hosting server for Fossil repositories is available at
[http://chiselapp.com/].
* How to [./server.wiki | set up a server] for your repository.
* Customizing the [./custom_ticket.wiki | ticket system].
* Methods to [./checkin_names.wiki | identify a specific check-in].
* [./inout.wiki | Import and export] from and to Git.
* [./fossil-v-git.wiki | Fossil versus Git].
* [./fiveminutes.wiki | Up and running in 5 minutes as a single user]
(contributed by Gilles Ganault on 2013-01-08).
* [./antibot.wiki | How Fossil defense against abuse by spiders and bots]
<h3>Links For Fossil Developer:</h3>
* [./contribute.wiki | Contributing] code or documentation to the
Fossil project.
* [./theory1.wiki | Thoughts On The Design Of Fossil].
* [./pop.wiki | Principles Of Operation]
|
| ︙ | ︙ |
Changes to www/mkindex.tcl.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
#!/bin/sh
#
# Run this TCL script to generate a WIKI page that contains a
# permuted index of the various documentation files.
#
# tclsh mkindex.tcl >permutedindex.wiki
#
set doclist {
bugtheory.wiki {Bug Tracking In Fossil}
branching.wiki {Branching, Forking, Merging, and Tagging}
build.wiki {Compiling and Installing Fossil}
checkin_names.wiki {Checkin And Version Names}
checkin.wiki {Check-in Checklist}
changes.wiki {Fossil Changelog}
copyright-release.html {Contributor License Agreement}
| > | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
#!/bin/sh
#
# Run this TCL script to generate a WIKI page that contains a
# permuted index of the various documentation files.
#
# tclsh mkindex.tcl >permutedindex.wiki
#
set doclist {
antibot.wiki {Defense against Spiders and Bots}
bugtheory.wiki {Bug Tracking In Fossil}
branching.wiki {Branching, Forking, Merging, and Tagging}
build.wiki {Compiling and Installing Fossil}
checkin_names.wiki {Checkin And Version Names}
checkin.wiki {Check-in Checklist}
changes.wiki {Fossil Changelog}
copyright-release.html {Contributor License Agreement}
|
| ︙ | ︙ |
Changes to www/permutedindex.wiki.
| ︙ | ︙ | |||
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | <li> [/help | Command-line help] </ul> <a name="pindex"></a> <h2>Permuted Index:</h2> <ul> <li><a href="fiveminutes.wiki">5 Minutes as a Single User — Update and Running in</a></li> <li><a href="tech_overview.wiki">A Technical Overview Of The Design And Implementation Of Fossil</a></li> <li><a href="copyright-release.html">Agreement — Contributor License</a></li> <li><a href="delta_encoder_algorithm.wiki">Algorithm — Fossil Delta Encoding</a></li> <li><a href="fiveminutes.wiki">as a Single User — Update and Running in 5 Minutes</a></li> <li><a href="faq.wiki">Asked Questions — Frequently</a></li> <li><a href="password.wiki">Authentication — Password Management And</a></li> <li><a href="private.wiki">Branches — Creating, Syncing, and Deleting Private</a></li> <li><a href="branching.wiki">Branching, Forking, Merging, and Tagging</a></li> <li><a href="bugtheory.wiki">Bug Tracking In Fossil</a></li> <li><a href="makefile.wiki">Build Process — The Fossil</a></li> <li><a href="changes.wiki">Changelog — Fossil</a></li> <li><a href="checkin.wiki">Check-in Checklist</a></li> <li><a href="checkin_names.wiki">Checkin And Version Names</a></li> | > > | 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | <li> [/help | Command-line help] </ul> <a name="pindex"></a> <h2>Permuted Index:</h2> <ul> <li><a href="fiveminutes.wiki">5 Minutes as a Single User — Update and Running in</a></li> <li><a href="tech_overview.wiki">A Technical Overview Of The Design And Implementation Of Fossil</a></li> <li><a href="antibot.wiki">against Spiders and Bots — Defense</a></li> <li><a href="copyright-release.html">Agreement — Contributor License</a></li> <li><a href="delta_encoder_algorithm.wiki">Algorithm — Fossil Delta Encoding</a></li> <li><a href="fiveminutes.wiki">as a Single User — Update and Running in 5 Minutes</a></li> <li><a href="faq.wiki">Asked Questions — Frequently</a></li> <li><a href="password.wiki">Authentication — Password Management And</a></li> <li><a href="antibot.wiki">Bots — Defense against Spiders and</a></li> <li><a href="private.wiki">Branches — Creating, Syncing, and Deleting Private</a></li> <li><a href="branching.wiki">Branching, Forking, Merging, and Tagging</a></li> <li><a href="bugtheory.wiki">Bug Tracking In Fossil</a></li> <li><a href="makefile.wiki">Build Process — The Fossil</a></li> <li><a href="changes.wiki">Changelog — Fossil</a></li> <li><a href="checkin.wiki">Check-in Checklist</a></li> <li><a href="checkin_names.wiki">Checkin And Version Names</a></li> |
| ︙ | ︙ | |||
40 41 42 43 44 45 46 47 48 49 50 51 52 53 | <li><a href="copyright-release.html">Contributor License Agreement</a></li> <li><a href="concepts.wiki">Core Concepts — Fossil</a></li> <li><a href="newrepo.wiki">Create A New Fossil Repository — How To</a></li> <li><a href="private.wiki">Creating, Syncing, and Deleting Private Branches</a></li> <li><a href="qandc.wiki">Criticisms — Questions And</a></li> <li><a href="custom_ticket.wiki">Customizing The Ticket System</a></li> <li><a href="tech_overview.wiki">Databases Used By Fossil — SQLite</a></li> <li><a href="shunning.wiki">Deleting Content From Fossil — Shunning:</a></li> <li><a href="private.wiki">Deleting Private Branches — Creating, Syncing, and</a></li> <li><a href="delta_encoder_algorithm.wiki">Delta Encoding Algorithm — Fossil</a></li> <li><a href="delta_format.wiki">Delta Format — Fossil</a></li> <li><a href="tech_overview.wiki">Design And Implementation Of Fossil — A Technical Overview Of The</a></li> <li><a href="theory1.wiki">Design Of The Fossil DVCS — Thoughts On The</a></li> <li><a href="embeddeddoc.wiki">Documentation — Embedded Project</a></li> | > | 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | <li><a href="copyright-release.html">Contributor License Agreement</a></li> <li><a href="concepts.wiki">Core Concepts — Fossil</a></li> <li><a href="newrepo.wiki">Create A New Fossil Repository — How To</a></li> <li><a href="private.wiki">Creating, Syncing, and Deleting Private Branches</a></li> <li><a href="qandc.wiki">Criticisms — Questions And</a></li> <li><a href="custom_ticket.wiki">Customizing The Ticket System</a></li> <li><a href="tech_overview.wiki">Databases Used By Fossil — SQLite</a></li> <li><a href="antibot.wiki">Defense against Spiders and Bots</a></li> <li><a href="shunning.wiki">Deleting Content From Fossil — Shunning:</a></li> <li><a href="private.wiki">Deleting Private Branches — Creating, Syncing, and</a></li> <li><a href="delta_encoder_algorithm.wiki">Delta Encoding Algorithm — Fossil</a></li> <li><a href="delta_format.wiki">Delta Format — Fossil</a></li> <li><a href="tech_overview.wiki">Design And Implementation Of Fossil — A Technical Overview Of The</a></li> <li><a href="theory1.wiki">Design Of The Fossil DVCS — Thoughts On The</a></li> <li><a href="embeddeddoc.wiki">Documentation — Embedded Project</a></li> |
| ︙ | ︙ | |||
125 126 127 128 129 130 131 132 133 134 135 136 137 138 | <li><a href="selfcheck.wiki">Self Checks — Fossil Repository Integrity</a></li> <li><a href="selfhost.wiki">Self Hosting Repositories — Fossil</a></li> <li><a href="server.wiki">Server — How To Configure A Fossil</a></li> <li><a href="settings.wiki">Settings — Fossil</a></li> <li><a href="shunning.wiki">Shunning: Deleting Content From Fossil</a></li> <li><a href="fiveminutes.wiki">Single User — Update and Running in 5 Minutes as a</a></li> <li><a href="style.wiki">Source Code Style Guidelines</a></li> <li><a href="tech_overview.wiki">SQLite Databases Used By Fossil</a></li> <li><a href="ssl.wiki">SSL with Fossil — Using</a></li> <li><a href="quickstart.wiki">Start Guide — Fossil Quick</a></li> <li><a href="stats.wiki">Statistics — Performance</a></li> <li><a href="style.wiki">Style Guidelines — Source Code</a></li> <li><a href="foss-cklist.wiki">Successful Open-Source Projects — Checklist For</a></li> <li><a href="sync.wiki">Sync Protocol — The Fossil</a></li> | > | 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | <li><a href="selfcheck.wiki">Self Checks — Fossil Repository Integrity</a></li> <li><a href="selfhost.wiki">Self Hosting Repositories — Fossil</a></li> <li><a href="server.wiki">Server — How To Configure A Fossil</a></li> <li><a href="settings.wiki">Settings — Fossil</a></li> <li><a href="shunning.wiki">Shunning: Deleting Content From Fossil</a></li> <li><a href="fiveminutes.wiki">Single User — Update and Running in 5 Minutes as a</a></li> <li><a href="style.wiki">Source Code Style Guidelines</a></li> <li><a href="antibot.wiki">Spiders and Bots — Defense against</a></li> <li><a href="tech_overview.wiki">SQLite Databases Used By Fossil</a></li> <li><a href="ssl.wiki">SSL with Fossil — Using</a></li> <li><a href="quickstart.wiki">Start Guide — Fossil Quick</a></li> <li><a href="stats.wiki">Statistics — Performance</a></li> <li><a href="style.wiki">Style Guidelines — Source Code</a></li> <li><a href="foss-cklist.wiki">Successful Open-Source Projects — Checklist For</a></li> <li><a href="sync.wiki">Sync Protocol — The Fossil</a></li> |
| ︙ | ︙ |