banner
Vinking

Vinking

你写下的每一个BUG 都是人类反抗被人工智能统治的一颗子弹

禁止营销蜘蛛爬取网站

今天看了看网站的访问日志发现有一个 User Agent 是 Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html) 的蜘蛛一直非常频繁地爬取网站,而且看着二十多 Mb 的日志就知道爬取网站已经有好一段时间了。

网站日志

根据这个蜘蛛爬取的网页来看,应该是将以前爬取过的网页的一些参数进行随机组合再进行访问爬取,所以爬取的链接都是一大串的而且基本上返回的都是 404。再加上每隔几秒钟就爬取一次导致日志里面的正常访问记录基本上淹没在几万行的垃圾蜘蛛记录中。

一开始觉得既然是蜘蛛就应该遵守 robots.txt 吧,就想着给 robots.txt 加上规则就好了:

User-agent: SemrushBot
Disallow: /

结果网上一查才知道这东西似乎并不遵守 robots.txt 😅(在官方页面中声称蜘蛛严格遵守 robots.txt,但是从网上的反馈来看并非如此),而且不只是这个 SemrushBot 很多营销蜘蛛都不遵守 robots.txt。没办法只能用 Nginx 来屏蔽掉了。在宝塔的 Nginx 免费防火墙里面的全局配置中点击 User-Agent 过滤规则,加上下面这个正则表达式(从网上汇总而来的,没想到有这么多,顺便也将一些没用的 UA 也加上去了,使用前请看看有没有需要用的 UA):

(nmap|NMAP|HTTrack|sqlmap|Java|zgrab|Go-http-client|CensysInspect|leiki|webmeup|Python|python|curllCurl|wget|Wget|toutiao|Barkrowler|AhrefsBot|a Palo Alto|ltx71|censys|DotBot|MauiBot|MegaIndex.ru|BLEXBot|ZoominfoBot|ExtLinksBot|hubspot|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Bytespider|Ezooms|JikeSpider|SemrushBot)

之后等几秒钟就可以看到拦截的数据了。

拦截数据

后面看着一直飙升的拦截数字想到只需要发起请求了无论是返回 404 还是 444,都会消耗服务器资源,这样几秒十几秒一次的一直请求也不是长久之计。再翻了翻这些蜘蛛的 IP 都是国外的节点,而我的网站国外线路也解析到了 Cloudflare,那就刚好可以让 Cloudflare 从中间帮我挡着,不至于让其访问到网站。

蜘蛛 IP

Cloudflare 进入对应的域名控制台后,进入安全性WAF 项,点击添加规则。在表达式预览中加入下面的表达式(同样的使用前请检查有没有屏蔽了需要用的 UA):

(http.user_agent contains "SemrushBot") or (http.user_agent contains "FeedDemon") or (http.user_agent contains "Indy Library") or (http.user_agent contains "Alexa Toolbar") or (http.user_agent contains "AskTbFXTV") or (http.user_agent contains "AhrefsBot") or (http.user_agent contains "CrawlDaddy") or (http.user_agent contains "CoolpadWebkit") or (http.user_agent contains "Java") or (http.user_agent contains "Feedly") or (http.user_agent contains "UniversalFeedParser") or (http.user_agent contains "ApacheBench") or (http.user_agent contains "Microsoft URL Control") or (http.user_agent contains "Swiftbot") or (http.user_agent contains "ZmEu") or (http.user_agent contains "jaunty") or (http.user_agent contains "Python-urllib") or (http.user_agent contains "lightDeckReports Bot") or (http.user_agent contains "YYSpider") or (http.user_agent contains "DigExt") or (http.user_agent contains "HttpClient") or (http.user_agent contains "MJ12bot") or (http.user_agent contains "heritrix") or (http.user_agent contains "Bytespider") or (http.user_agent contains "Ezooms") or (http.user_agent contains "JikeSpider") or (http.user_agent contains "HTTrack") or (http.user_agent contains "Apache-HttpClient") or (http.user_agent contains "harvest") or (http.user_agent contains "audit") or (http.user_agent contains "dirbuster") or (http.user_agent contains "pangolin") or (http.user_agent contains "nmap") or (http.user_agent contains "sqln") or (http.user_agent contains "hydra") or (http.user_agent contains "libwww") or (http.user_agent contains "BBBike") or (http.user_agent contains "sqlmap") or (http.user_agent contains "w3af") or (http.user_agent contains "owasp") or (http.user_agent contains "Nikto") or (http.user_agent contains "fimap") or (http.user_agent contains "havij") or (http.user_agent contains "BabyKrokodil") or (http.user_agent contains "netsparker") or (http.user_agent contains "httperf") or (http.user_agent contains " SF/") or (http.user_agent contains "zgrab") or (http.user_agent contains "NMAP") or (http.user_agent contains "Go-http-client") or (http.user_agent contains "CensysInspect") or (http.user_agent contains "leiki") or (http.user_agent contains "webmeup") or (http.user_agent contains "Python") or (http.user_agent contains "python") or (http.user_agent contains "wget") or (http.user_agent contains "Wget") or (http.user_agent contains "toutiao") or (http.user_agent contains "Barkrowler") or (http.user_agent contains "a Palo Alto") or (http.user_agent contains "ltx71") or (http.user_agent contains "censys") or (http.user_agent contains "DotBot") or (http.user_agent contains "MauiBot") or (http.user_agent contains "MegaIndex.ru") or (http.user_agent contains "BLEXBot") or (http.user_agent contains "ZoominfoBot") or (http.user_agent contains "ExtLinksBot") or (http.user_agent contains "hubspot")

然后选择操作选择阻止后保存即可。

Cloudflare 自定义规则

到此,如果宝塔的 Nginx 免费防火墙的风险拦截数字不再增加。而 Cloudflare 的防火墙规则拦截数疯狂增加的时候就证明已经从 Cloudflare 成功拦截到这些垃圾蜘蛛的访问了。

睡醒更新#

一觉睡醒了就已经屏蔽了两千多次了,这东西真的是锲而不舍啊😅。

Cloudflare 自定义规则

自定义拦截页面#

最近才看到 Cloudflare 默认的拦截网页,感觉太丑了... 想换成自定义的拦截网页。

Cloudflare 默认的拦截网页

结果发现自定义页面需要升级到 Pro 计划才能使用。嘛,白嫖党是不可能为了自定义页面去升级的,那就只能用比较笨的方法去实现了,思路就是直接将符合上面 User-Agent 的访问者全部重定向到一个自定义页面中。

Note

🔔请注意:托管到 Vercel 的页面在境内会无法访问。因为本站的境外访问才会经过 Cloudflare,故无需考虑境内能否访问。如果你想境内也能访问,请使用其他页面托管平台。

我们需要自己写一个静态的拦截页面,然后托管到 Vercel

Vercel 部署的详细步骤#

首先,来到 Vercel 里新建一个项目

新建一个项目

然后可以在左边选择从已有的库中导入你的自定义页面,也可以在右边克隆模板。这里我还没有自定义的页面,可以点击右边克隆模板的 Browse All Templates 链接去选择模板。

克隆模板

因为我这里使用的是 React 来编写页面,所以就使用 Create React App 这个模板。点击 Deploy 按钮使用模板。

接着在 Create Git Repository 内点击 Github 连接你的 Github,输入自定义仓库名字,默认勾选 Create private Git Repository 设置为私人仓库,点击 Create 即可自动部署。

部署

当看到 Congratulations! 的提示时证明已经成功部署了。

成功部署

最后就是在 Github 将代码提交到刚才新建的仓库了,提交完成后 Vercel 会自动帮你更新页面。

下面是我使用 React 编写的自定义页面,可以直接拿去使用。

\src\App.js 文件:

import React, { Component } from 'react'
import './App.css';

export default class App extends Component {
  state = {
    tran: {
      lang: 'en',
      title: 'This request has been blocked',
      contents_1: 'Some of your characteristics exist in the blacklist and this request has been blocked.',
      contents_2: 'If you think this is a false alarm, please contact me promptly.',
      footer: '@Vinking Security Center',
      tips: 'Details have been saved to continuously optimize the Security Center'
    }
  }
  handleTranslations = () => {
    const { lang } = this.state.tran
    const newState = (lang === 'en') ? {
      lang: 'zh',
      title: '请求已被拦截',
      contents_1: '您的一些特征存在于黑名单中,此次请求已被拦截。',
      contents_2: '如果您觉得这是误报,请及时联系我。',
      symbols: '@ Vinking 安全中心',
      tips: '详细信息已保存以持续优化安全中心'
    } : {
      lang: 'en',
      title: 'This request has been blocked',
      contents_1: 'Some of your characteristics exist in the blacklist and this request has been blocked.',
      contents_2: 'If you think this is a false alarm, please contact me promptly.',
      symbols: '@Vinking Security Center',
      tips: 'Details have been saved to continuously optimize the Security Center'
    }
    document.title = newState.title
    this.setState({ tran: newState })
  }
  render() {
    const { title, contents_1, contents_2, symbols, tips } = this.state.tran
    return (
      <div className="content">
        <div className="card">
          <div className="cardHeader">
            <div>{title}</div>
            <div className='translation' onClick={this.handleTranslations}>
              <svg xmlns="http://www.w3.org/2000/svg" width="15" height="15" viewBox="0 0 1024 1024"><path fill="#f8f9fa" d="M608 416h288c35.36 0 64 28.48 64 64v416c0 35.36-28.48 64-64 64H480c-35.36 0-64-28.48-64-64V608H128c-35.36 0-64-28.48-64-64V128c0-35.36 28.48-64 64-64h416c35.36 0 64 28.48 64 64v288zm0 64v64c0 35.36-28.48 64-64 64h-64v256.032C480 881.696 494.304 896 511.968 896H864a31.968 31.968 0 0 0 31.968-31.968V512A31.968 31.968 0 0 0 864 480.032H608zM128 159.968V512c0 17.664 14.304 31.968 31.968 31.968H512A31.968 31.968 0 0 0 543.968 512V160a31.968 31.968 0 0 0-31.936-32H160a31.968 31.968 0 0 0-32 31.968zm64 244.288V243.36h112.736V176h46.752c6.4.928 9.632 1.824 9.632 2.752a10.56 10.56 0 0 1-1.376 4.128c-2.752 7.328-4.128 16.032-4.128 26.112v34.368h119.648v156.768h-50.88v-20.64h-68.768V497.76h-49.504V379.488h-67.36v24.768H192zm46.72-122.368v60.48h67.392V281.92h-67.36zm185.664 60.48V281.92h-68.768v60.48h68.768zm203.84 488H576L668.128 576h64.64l89.344 254.4h-54.976l-19.264-53.664H647.488l-19.232 53.632zm33.024-96.256h72.864l-34.368-108.608h-1.376l-37.12 108.608zM896 320h-64a128 128 0 0 0-128-128v-64a192 192 0 0 1 192 192zM128 704h64a128 128 0 0 0 128 128v64a192 192 0 0 1-192-192z" /></svg>
            </div>
          </div>
          <div className="cardDesc">
            <span className="red">{contents_1}</span>
            <br />
            {contents_2}
          </div>
          <div className="cardSymbols">
            <div>{symbols}</div>
          </div>
        </div>
        <div className="tips">{tips}</div>
      </div>
    )
  }
}

\public\index.html 文件:

<!doctype html>
<html>

<head>
  <meta charset="utf-8">
  <meta name="viewport"
    content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no, shrink-to-fit=no">
  <meta name="theme-color" content="#07092f">
  <title>This request has been blocked</title>
  <meta name="description" content="This request has been blocked.">
</head>

<body>
  <div id="root"></div>
</body>

</html>

回到 Cloudflare,点击规则项的重定向规则,创建一条规则。将上面屏蔽营销蜘蛛的表达式填入表达式编辑框,类型选择静态URL 填入刚才的 Vercel 托管页面的链接、状态代码设置为 301取消勾选保留查询字符串即可自定义的拦截网页。

Cloudflare 重定向

此文由 Mix Space 同步更新至 xLog
原始链接为 https://www.vinking.top/posts/codes/blocking-bots-with-cloudflare


加载中...
此文章数据所有权由区块链加密技术和智能合约保障仅归创作者所有。