Author
Message
DamonHD
Posts:6158
Moderator
Member since: 2006-11-30
:: Quote ::
Subject: How to ban the bots? UA, IP or behaviour?
Hi,

How do you ban bad bots?

Do you use the UA (User-Agent) string?

Do you use the IP address?

Do you use the bot's behaviour?

I use all three, but in different ways. None of them is foolproof, but overall they seem to do most of what I need with minimal manual 'whack-a-mole' effort.

I show reduced content to all bots that I recognise by UA except for a few exceptions. One clue is anything with the letters "bot" in the UA for example!

I use the IP address to ban downloading of significant content from known-bad IP addresses (almost entirely checked against SPAMHAUS's SBL list automatically).

But I use behaviour as the most potent way to shut badly-behaved bots out entirely, eg trying to open many many connections at once, or ceaseless downloading where a human would have at least had to take a "comfort break".

All three tests are applied to every incoming connection on some of my sites.

Rgds

Damon
November 03, 2007 10:59AM
Joshua
Posts:2831
Administrator
Member since: 2007-03-16
:: Quote ::
Subject: Re: How to ban the bots? UA, IP or behaviour?
On my other forum there's three steps:

- By domain. Don't need members from .ru on my Dutch forum.

- A check against fields that are by default available in the sign-up process for the board SW but not on mine.

- A modified Submit button that sends a unique code. A bot would need to read the page for getting that code. Fortunately most don't.

All three anti spam measures. I don't do anything against scraping. But most scrapers target some more commercially interesting fields (just a wild guess).
November 03, 2007 11:56AM
GegaBit
Posts:3311
Senior member
Member since: 2006-11-30
:: Quote ::
Subject: Re: How to ban the bots? UA, IP or behaviour?
Thanks Damon,
Here is what I currently do:
1- Ban by user agent known bad bots inside my .htaccess file



2- Catch badly behaving bots in a bot trap what serves them endless bogus html files with endless bogus email addresses
(Wish I could find a script that would automatically ban repeated IPs from the above bot trap log)

3- Ban by IP manually while monitoring the log file (The most exhausting and nerve wrecking checking my stats every half hour, I block over 100 IPs per day!)

Quote:
eg trying to open many many connections at once
Exactly how is this done technically, what do I need to install/download/configure to get this one working?
November 03, 2007 12:26PM
GegaBit
Posts:3311
Senior member
Member since: 2006-11-30
:: Quote ::
Subject: Re: How to ban the bots? UA, IP or behaviour?
To help others Here is my .htaccess file (located inside your pulic html directory, it is built from years of log monitoring and I would say keeps out a fair number of scraping kiddies:

1- Ban by user agent known bad bots inside my .htaccess file


Quote:
SetEnvIfNoCase User-Agent "accelobot" bad_bot
SetEnvIfNoCase User-Agent "AnanziSpider5" bad_bot
SetEnvIfNoCase User-Agent "Anonymouse" bad_botbad_bot
SetEnvIfNoCase User-Agent "atSpider" bad_bot
SetEnvIfNoCase User-Agent "Btsearch" bad_bot
SetEnvIfNoCase User-Agent "cazoodle" bad_bot
SetEnvIfNoCase User-Agent "comagent" bad_bot
SetEnvIfNoCase User-Agent "downloader" bad_bot
SetEnvIfNoCase User-Agent "dragon" bad_bot
SetEnvIfNoCase User-Agent "DTS" bad_bot
SetEnvIfNoCase User-Agent "dumping" bad_bot
SetEnvIfNoCase User-Agent "EmbeddedWB" bad_bot
SetEnvIfNoCase User-Agent "Exabot" bad_bot
SetEnvIfNoCase User-Agent "Extractor" bad_bot
SetEnvIfNoCase User-Agent "FAST" bad_bot
SetEnvIfNoCase User-Agent "Fetch" bad_bot
SetEnvIfNoCase User-Agent "Framework" bad_bot
SetEnvIfNoCase User-Agent "FrontPage" bad_bot
SetEnvIfNoCase User-Agent "Gaisbot" bad_bot
SetEnvIfNoCase User-Agent "getinfo" bad_bot
SetEnvIfNoCase User-Agent "H010818" bad_bot
SetEnvIfNoCase User-Agent "Harry" bad_bot
SetEnvIfNoCase User-Agent "heritrix" bad_bot
SetEnvIfNoCase User-Agent "holmes" bad_bot
SetEnvIfNoCase User-Agent "houxou" bad_bot
SetEnvIfNoCase User-Agent "HTTrack" bad_bot
SetEnvIfNoCase User-Agent "IEAutoDiscovery" bad_bot
SetEnvIfNoCase User-Agent "Indy" bad_bot
SetEnvIfNoCase User-Agent "iRc" bad_bot
SetEnvIfNoCase User-Agent "java" bad_bot
SetEnvIfNoCase User-Agent "lanshanbot" bad_bot
SetEnvIfNoCase User-Agent "litefinder" bad_bot
SetEnvIfNoCase User-Agent "Microsoft" bad_bot
SetEnvIfNoCase User-Agent "MSIECrawler" bad_bot
SetEnvIfNoCase User-Agent "mywebsite" bad_bot
SetEnvIfNoCase User-Agent "Nutch" bad_bot
SetEnvIfNoCase User-Agent "Pagebull" bad_bot
SetEnvIfNoCase User-Agent "picsearch" bad_bot
SetEnvIfNoCase User-Agent "Pingdom" bad_bot
SetEnvIfNoCase User-Agent "PlantyNet" bad_bot
SetEnvIfNoCase User-Agent "RedBot" bad_bot
SetEnvIfNoCase User-Agent "scanner" bad_bot
SetEnvIfNoCase User-Agent "sensis" bad_bot
SetEnvIfNoCase User-Agent "sharp" bad_bot
SetEnvIfNoCase User-Agent "Shinchakubin" bad_bot
SetEnvIfNoCase User-Agent "Snapbot" bad_bot
SetEnvIfNoCase User-Agent "snprtz" bad_bot
SetEnvIfNoCase User-Agent "SuperCleaner" bad_bot
SetEnvIfNoCase User-Agent "Swapper" bad_bot
SetEnvIfNoCase User-Agent "TencentTraveler" bad_bot
SetEnvIfNoCase User-Agent "Test" bad_bot
SetEnvIfNoCase User-Agent "Transcoder" bad_bot
SetEnvIfNoCase User-Agent "turnitin" bad_bot
SetEnvIfNoCase User-Agent "Twiceler" bad_bot
SetEnvIfNoCase User-Agent "University" bad_bot
SetEnvIfNoCase User-Agent "voyager" bad_bot
SetEnvIfNoCase User-Agent "WebStripper" bad_bot
SetEnvIfNoCase User-Agent "WinHTTP" bad_bot
SetEnvIfNoCase User-Agent "WinuE" bad_bot
SetEnvIfNoCase User-Agent "worio" bad_bot
SetEnvIfNoCase User-Agent "WOW64" bad_bot
SetEnvIfNoCase User-Agent "XX" bad_bot
SetEnvIfNoCase User-Agent "yetibot" bad_bot
SetEnvIfNoCase User-Agent "YodaoBot" bad_bot
SetEnvIfNoCase User-Agent "Zango" bad_bot
SetEnvIfNoCase User-Agent "ZSEBOT" bad_bot
SetEnvIfNoCase User-Agent "^almaden" bad_bot
SetEnvIfNoCase User-Agent "^bot" bad_bot
SetEnvIfNoCase User-Agent "^CFNetwork" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker" bad_bot
SetEnvIfNoCase User-Agent "^Convera" bad_bot
SetEnvIfNoCase User-Agent "^ExtractorPro" bad_bot
SetEnvIfNoCase User-Agent "^findlinks" bad_bot
SetEnvIfNoCase User-Agent "^Firefox$" bad_bot
SetEnvIfNoCase User-Agent "^genieBot" bad_bot
SetEnvIfNoCase User-Agent "^gigabot" bad_bot
SetEnvIfNoCase User-Agent "^HappyFunBot" bad_bot
SetEnvIfNoCase User-Agent "^IRLbot" bad_bot
SetEnvIfNoCase User-Agent "^ISC" bad_bot
SetEnvIfNoCase User-Agent "^jakarta" bad_bot
SetEnvIfNoCase User-Agent "^larbin" bad_bot
SetEnvIfNoCase User-Agent "^link" bad_bot
SetEnvIfNoCase User-Agent "^lwp" bad_bot
SetEnvIfNoCase User-Agent "^Mozilla/3\.0\ \(compatible\)$" bad_bot
SetEnvIfNoCase User-Agent "^Mozilla/4\.0\ \(compatible\;\)$" bad_bot
SetEnvIfNoCase User-Agent "^NICErsPRO" bad_bot
SetEnvIfNoCase User-Agent "^Ocelli" bad_bot
SetEnvIfNoCase User-Agent "^Outfox" bad_bot
SetEnvIfNoCase User-Agent "^Pagebull" bad_bot
SetEnvIfNoCase User-Agent "^PHPCrawl" bad_bot
SetEnvIfNoCase User-Agent "^RufusBot" bad_bot
SetEnvIfNoCase User-Agent "^ShopWiki" bad_bot
SetEnvIfNoCase User-Agent "^Snapbot" bad_bot
SetEnvIfNoCase User-Agent "^Sogou" bad_bot
SetEnvIfNoCase User-Agent "^Teleport" bad_bot
SetEnvIfNoCase User-Agent "^user" bad_bot
SetEnvIfNoCase User-Agent "^Webaroo" bad_bot
SetEnvIfNoCase User-Agent "^WebCopier" bad_bot
SetEnvIfNoCase User-Agent "^Wget" bad_bot
SetEnvIfNoCase User-Agent "^ZoomSpider" bad_bot
SetEnvIfNoCase User-Agent ^$ bad_bot
SetEnvIfNoCase User-Agent ^-$ bad_bot

<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>


The only problem is those that forge the user agent and appear to be a normal browser in the logs are not cought, this needs a script that is beyond my capabilities.
November 03, 2007 12:32PM
DamonHD
Posts:6158
Moderator
Member since: 2006-11-30
:: Quote ::
Subject: Re: How to ban the bots? UA, IP or behaviour?
GB, my list has a lot in common with yours!

Rgds

Damon
November 03, 2007 06:25PM
GegaBit
Posts:3311
Senior member
Member since: 2006-11-30
:: Quote ::
Subject: Re: How to ban the bots? UA, IP or behaviour?
Damon,
How exactly do you catch those that forge the user agent to one that is common? What script do you use to identify & block them?
November 04, 2007 05:10PM
DamonHD
Posts:6158
Moderator
Member since: 2006-11-30
:: Quote ::
Subject: Re: How to ban the bots? UA, IP or behaviour?
Hi,

I can't, which is why I *also* have the other two methods, eg if the request comes from a known SPAMmer or the client is sucking pages like gravity is going out of fashion, then a forged UA isn't going to help them much...

Look at an engineering graph of a Safe Operating Area (SOA) and you'll see all sorts of bits chopped off here and there, and what's left is probably OK. That's what I'm doing: not relying on just one defence.

Rgds

Damon



Edited 1 time(s). Last edit at 11/04/2007 09:38PM by DamonHD.
November 04, 2007 09:36PM

Sorry, you do not have permission to post/reply in this forum.