昨晚打开宝塔面板后台查看,上行与下行几乎到了7000KB的速度,不到半小时,就10G流量了,开始以为是被黑了?重载系统,恢复数据后还是一样。
下载一个网站监控报表插件一看,好家伙,是谷歌蜘蛛和一些不明ip在搞我其中一个网站!!!最狠的就是谷歌蜘蛛了,疯狂在爬!
方法一:cloudflare cdn 防火墙阻止
表达式:
(http.user_agent contains “Googlebot”) or (http.user_agent contains “SemrushBot”) or (http.user_agent contains “AhrefsBot”) or (http.user_agent contains “DotBot”)
方法二:禁止指定UA(用户代理)
宝塔面板下使用方法如下:
1、找到文件目录 /www/server/nginx/conf
文件夹下面,新建一个文件agent_deny.conf
代码如下:
#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}
#禁止指定UA及UA为空的访问
if ($http_user_agent ~* "FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|DotBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Bytespider|Ezooms|Googlebot|JikeSpider|SemrushBot|^$" ) {
return 403;
}
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}
2、找到网站设置里面的第7行左右 写入代码:include agent_deny.conf;
如果你网站使用火车头采集发布,使用以上代码会返回403错误,发布不了的。如果想使用火车头采集发布,请使用下面的代码:
#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}
#禁止指定UA访问。UA为空的可以访问,比如火车头可以正常发布。
if ($http_user_agent ~ "FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|YandexBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Bytespider|Ezooms|Googlebot|JikeSpider|SemrushBot" ) {
return 403;
}
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}
三、测试效果
如果是vps,那非常简单,使用curl -A 模拟抓取即可,比如: 模拟谷歌蜘蛛抓取:替换为你的域名
curl -I -A 'Googlebot' xxx.com
curl -I -A 'Baiduspider' xxx.com
谷歌蜘蛛和UA为空的返回是403禁止访问标识,而百度蜘蛛则成功返回200,说明生效!
四、垃圾蜘蛛IP段
直接屏蔽:谷歌蜘蛛
66.249.79.0/24
66.249.68.0/24
屏蔽AhrefsBot蜘蛛:
51.222.253.0/24
不明的IP:
39.173.116.0/24
85.208.96.0/24
185.191.171.0/24
居然还有华为蜘蛛!!!是否屏蔽思考中。。。
114.119.130.102
114.119.135.32
114.119.0.0/16
robots.txt 屏蔽垃圾蜘蛛
User-agent: Googlebot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: SemrushBot-SA
Disallow: /
User-agent: SemrushBot-BA
Disallow: /
User-agent: SemrushBot-SI
Disallow: /
User-agent: SemrushBot-SWA
Disallow: /
User-agent: SemrushBot-CT
Disallow: /
User-agent: SemrushBot-BM
Disallow: /
User-agent: SemrushBot-SEOAB
Disallow: /
user-agent: AhrefsBot
Disallow: /
User-agent: DotBot
Disallow: /
User-agent: Uptimebot
Disallow: /
User-agent: MegaIndex.ru
Disallow: /
User-agent: ZoominfoBot
Disallow: /
User-agent: Mail.Ru
Disallow: /
User-agent: BLEXBot
Disallow: /
User-agent: ExtLinksBot
Disallow: /
User-agent: aiHitBot
Disallow: /
User-agent: Researchscan
Disallow: /
User-agent: DnyzBot
Disallow: /
User-agent: spbot
Disallow: /
User-agent: YandexBot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: SemrushBot-SA
Disallow: /
User-agent: SemrushBot-BA
Disallow: /
User-agent: SemrushBot-SI
Disallow: /
User-agent: SemrushBot-SWA
Disallow: /
User-agent: SemrushBot-CT
Disallow: /
User-agent: SemrushBot-BM
Disallow: /
User-agent: SemrushBot-SEOAB
Disallow: /
User-agent: *
Disallow: