Hugo博客公告弹窗

反爬虫蜘蛛攻略:Apache/Nginx/PHP禁止某些User Agent抓取网站

   
文章摘要
摘要小助理今天溜号啦……😜

昨晚打开宝塔面板后台查看,上行与下行几乎到了7000KB的速度,不到半小时,就10G流量了,开始以为是被黑了?重载系统,恢复数据后还是一样。

下载一个网站监控报表插件一看,好家伙,是谷歌蜘蛛和一些不明ip在搞我其中一个网站!!!最狠的就是谷歌蜘蛛了,疯狂在爬!

方法一:cloudflare cdn 防火墙阻止

表达式:

(http.user_agent contains “Googlebot”) or (http.user_agent contains “SemrushBot”) or (http.user_agent contains “AhrefsBot”) or (http.user_agent contains “DotBot”)

方法二:禁止指定UA(用户代理)

宝塔面板下使用方法如下:

1、找到文件目录 /www/server/nginx/conf 文件夹下面,新建一个文件agent_deny.conf

代码如下:

#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}
#禁止指定UA及UA为空的访问
if ($http_user_agent ~* "FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|DotBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Bytespider|Ezooms|Googlebot|JikeSpider|SemrushBot|^$" ) {
return 403;
}
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}

2、找到网站设置里面的第7行左右 写入代码:include agent_deny.conf;

如果你网站使用火车头采集发布,使用以上代码会返回403错误,发布不了的。如果想使用火车头采集发布,请使用下面的代码:

#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}
#禁止指定UA访问。UA为空的可以访问,比如火车头可以正常发布。
if ($http_user_agent ~ "FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|YandexBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Bytespider|Ezooms|Googlebot|JikeSpider|SemrushBot" ) {
return 403;
}
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}

三、测试效果

如果是vps,那非常简单,使用curl -A 模拟抓取即可,比如: 模拟谷歌蜘蛛抓取:替换为你的域名

curl -I -A 'Googlebot' xxx.com
curl -I -A 'Baiduspider' xxx.com

谷歌蜘蛛和UA为空的返回是403禁止访问标识,而百度蜘蛛则成功返回200,说明生效!

四、垃圾蜘蛛IP段

直接屏蔽:谷歌蜘蛛

66.249.79.0/24
66.249.68.0/24

屏蔽AhrefsBot蜘蛛:

51.222.253.0/24

不明的IP:

39.173.116.0/24
85.208.96.0/24
185.191.171.0/24

居然还有华为蜘蛛!!!是否屏蔽思考中。。。

114.119.130.102
114.119.135.32
114.119.0.0/16

robots.txt 屏蔽垃圾蜘蛛

User-agent: Googlebot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: SemrushBot-SA
Disallow: /
User-agent: SemrushBot-BA
Disallow: /
User-agent: SemrushBot-SI
Disallow: /
User-agent: SemrushBot-SWA
Disallow: /
User-agent: SemrushBot-CT
Disallow: /
User-agent: SemrushBot-BM
Disallow: /
User-agent: SemrushBot-SEOAB
Disallow: /
user-agent: AhrefsBot
Disallow: /
User-agent: DotBot
Disallow: /
User-agent: Uptimebot
Disallow: /
User-agent: MegaIndex.ru
Disallow: /
User-agent: ZoominfoBot
Disallow: /
User-agent: Mail.Ru
Disallow: /
User-agent: BLEXBot
Disallow: /
User-agent: ExtLinksBot
Disallow: /
User-agent: aiHitBot
Disallow: /
User-agent: Researchscan
Disallow: /
User-agent: DnyzBot
Disallow: /
User-agent: spbot
Disallow: /
User-agent: YandexBot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: SemrushBot-SA
Disallow: /
User-agent: SemrushBot-BA
Disallow: /
User-agent: SemrushBot-SI
Disallow: /
User-agent: SemrushBot-SWA
Disallow: /
User-agent: SemrushBot-CT
Disallow: /
User-agent: SemrushBot-BM
Disallow: /
User-agent: SemrushBot-SEOAB
Disallow: /
User-agent: *
Disallow:
CC BY-NC-SA 4.0 转载请注明
最后更新于 2024-11-20 16:13
clarity统计