banner
Vinking

Vinking

你写下的每一个BUG 都是人类反抗被人工智能统治的一颗子弹

禁止行銷蜘蛛爬取網站

今天看了看網站的訪問日誌發現有一個 User Agent 是 Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html) 的蜘蛛一直非常頻繁地爬取網站,而且看著二十多 Mb 的日誌就知道爬取網站已經有好一段時間了。

網站日誌

根據這個蜘蛛爬取的網頁來看,應該是將以前爬取過的網頁的一些參數進行隨機組合再進行訪問爬取,所以爬取的鏈接都是一大串的而且基本上返回的都是 404。再加上每隔幾秒鐘就爬取一次導致日誌裡面的正常訪問記錄基本上淹沒在幾萬行的垃圾蜘蛛記錄中。

一開始覺得既然是蜘蛛就應該遵守 robots.txt 吧,就想着給 robots.txt 加上規則就好了:

User-agent: SemrushBot
Disallow: /

結果網上查才知道這東西似乎並不遵守 robots.txt 😅(在官方頁面中聲稱蜘蛛嚴格遵守 robots.txt,但是從網上的反饋來看並非如此),而且不只是這個 SemrushBot 很多行銷蜘蛛都不遵守 robots.txt。沒辦法只能用 Nginx 來屏蔽掉了。在寶塔的 Nginx 免費防火牆裡面的全局配置中點擊 User-Agent 過濾規則,加上下面這個正則表達式(從網上匯總而來的,沒想到有這麼多,順便也將一些沒用的 UA 也加上去了,使用前請看看有沒有需要用的 UA):

(nmap|NMAP|HTTrack|sqlmap|Java|zgrab|Go-http-client|CensysInspect|leiki|webmeup|Python|python|curllCurl|wget|Wget|toutiao|Barkrowler|AhrefsBot|a Palo Alto|ltx71|censys|DotBot|MauiBot|MegaIndex.ru|BLEXBot|ZoominfoBot|ExtLinksBot|hubspot|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Bytespider|Ezooms|JikeSpider|SemrushBot)

之後等幾秒鐘就可以看到攔截的數據了。

攔截數據

後面看著一直飆升的攔截數字想到只需要發起請求了無論是返回 404 還是 444,都会消耗伺服器資源,這樣幾秒十幾秒一次的一直請求也不是長久之計。再翻了翻這些蜘蛛的 IP 都是國外的節點,而我的網站國外線路也解析到了 Cloudflare,那就剛好可以讓 Cloudflare 從中間幫我擋著,不至於讓其訪問到網站。

蜘蛛 IP

Cloudflare 進入對應的域名控制台後,進入安全性WAF 項,點擊添加規則。在表達式預覽中加入下面的表達式(同樣的使用前請檢查有沒有屏蔽了需要用的 UA):

(http.user_agent contains "SemrushBot") or (http.user_agent contains "FeedDemon") or (http.user_agent contains "Indy Library") or (http.user_agent contains "Alexa Toolbar") or (http.user_agent contains "AskTbFXTV") or (http.user_agent contains "AhrefsBot") or (http.user_agent contains "CrawlDaddy") or (http.user_agent contains "CoolpadWebkit") or (http.user_agent contains "Java") or (http.user_agent contains "Feedly") or (http.user_agent contains "UniversalFeedParser") or (http.user_agent contains "ApacheBench") or (http.user_agent contains "Microsoft URL Control") or (http.user_agent contains "Swiftbot") or (http.user_agent contains "ZmEu") or (http.user_agent contains "jaunty") or (http.user_agent contains "Python-urllib") or (http.user_agent contains "lightDeckReports Bot") or (http.user_agent contains "YYSpider") or (http.user_agent contains "DigExt") or (http.user_agent contains "HttpClient") or (http.user_agent contains "MJ12bot") or (http.user_agent contains "heritrix") or (http.user_agent contains "Bytespider") or (http.user_agent contains "Ezooms") or (http.user_agent contains "JikeSpider") or (http.user_agent contains "HTTrack") or (http.user_agent contains "Apache-HttpClient") or (http.user_agent contains "harvest") or (http.user_agent contains "audit") or (http.user_agent contains "dirbuster") or (http.user_agent contains "pangolin") or (http.user_agent contains "nmap") or (http.user_agent contains "sqln") or (http.user_agent contains "hydra") or (http.user_agent contains "libwww") or (http.user_agent contains "BBBike") or (http.user_agent contains "sqlmap") or (http.user_agent contains "w3af") or (http.user_agent contains "owasp") or (http.user_agent contains "Nikto") or (http.user_agent contains "fimap") or (http.user_agent contains "havij") or (http.user_agent contains "BabyKrokodil") or (http.user_agent contains "netsparker") or (http.user_agent contains "httperf") or (http.user_agent contains " SF/") or (http.user_agent contains "zgrab") or (http.user_agent contains "NMAP") or (http.user_agent contains "Go-http-client") or (http.user_agent contains "CensysInspect") or (http.user_agent contains "leiki") or (http.user_agent contains "webmeup") or (http.user_agent contains "Python") or (http.user_agent contains "python") or (http.user_agent contains "wget") or (http.user_agent contains "Wget") or (http.user_agent contains "toutiao") or (http.user_agent contains "Barkrowler") or (http.user_agent contains "a Palo Alto") or (http.user_agent contains "ltx71") or (http.user_agent contains "censys") or (http.user_agent contains "DotBot") or (http.user_agent contains "MauiBot") or (http.user_agent contains "MegaIndex.ru") or (http.user_agent contains "BLEXBot") or (http.user_agent contains "ZoominfoBot") or (http.user_agent contains "ExtLinksBot") or (http.user_agent contains "hubspot")

然後選擇操作選擇阻止後保存即可。

Cloudflare 自定義規則

到此,如果寶塔的 Nginx 免費防火牆的風險攔截數字不再增加。而 Cloudflare 的防火牆規則攔截數瘋狂增加的時候就證明已經從 Cloudflare 成功攔截到這些垃圾蜘蛛的訪問了。

睡醒更新#

一覺睡醒了就已經屏蔽了兩千多次了,這東西真的是銳意不懈啊😅。

Cloudflare 自定義規則

自定義攔截頁面#

最近才看到 Cloudflare 默認的攔截網頁,感覺太醜了... 想換成自定義的攔截網頁。

Cloudflare 默認的攔截網頁

結果發現自定義頁面需要升級到 Pro 計劃才能使用。嘛,白嫖党是不可能為了自定義頁面去升級的,那就只能用比較笨的方法去實現了,思路就是直接將符合上面 User-Agent 的訪問者全部重定向到一個自定義頁面中。

Note

🔔請注意:托管到 Vercel 的頁面在境內會無法訪問。因為本站的境外訪問才會經過 Cloudflare,故無需考慮境內能否訪問。如果你想境內也能訪問,請使用其他頁面托管平台。

我們需要自己寫一個靜態的攔截頁面,然後托管到 Vercel

Vercel 部署的詳細步驟#

首先,來到 Vercel 裡新建一個項目

新建一個項目

然後可以在左邊選擇從已有的庫中導入你的自定義頁面,也可以在右邊克隆模板。這裡我還沒有自定義的頁面,可以點擊右邊克隆模板的 Browse All Templates 鏈接去選擇模板。

克隆模板

因為我這裡使用的是 React 來編寫頁面,所以就使用 Create React App 這個模板。點擊 Deploy 按鈕使用模板。

接著在 Create Git Repository 內點擊 Github 連接你的 Github,輸入自定義倉庫名字,默認勾選 Create private Git Repository 設置為私人倉庫,點擊 Create 即可自動部署。

部署

當看到 Congratulations! 的提示時證明已經成功部署了。

成功部署

最後就是在 Github 將代碼提交到剛才新建的倉庫了,提交完成後 Vercel 會自動幫你更新頁面。

下面是我使用 React 編寫的自定義頁面,可以直接拿去使用。

\src\App.js 文件:

import React, { Component } from 'react'
import './App.css';

export default class App extends Component {
  state = {
    tran: {
      lang: 'en',
      title: 'This request has been blocked',
      contents_1: 'Some of your characteristics exist in the blacklist and this request has been blocked.',
      contents_2: 'If you think this is a false alarm, please contact me promptly.',
      footer: '@Vinking Security Center',
      tips: 'Details have been saved to continuously optimize the Security Center'
    }
  }
  handleTranslations = () => {
    const { lang } = this.state.tran
    const newState = (lang === 'en') ? {
      lang: 'zh',
      title: '請求已被攔截',
      contents_1: '您的一些特徵存在於黑名單中,此次請求已被攔截。',
      contents_2: '如果您覺得這是誤報,請及時聯繫我。',
      symbols: '@ Vinking 安全中心',
      tips: '詳細信息已保存以持續優化安全中心'
    } : {
      lang: 'en',
      title: 'This request has been blocked',
      contents_1: 'Some of your characteristics exist in the blacklist and this request has been blocked.',
      contents_2: 'If you think this is a false alarm, please contact me promptly.',
      symbols: '@Vinking Security Center',
      tips: 'Details have been saved to continuously optimize the Security Center'
    }
    document.title = newState.title
    this.setState({ tran: newState })
  }
  render() {
    const { title, contents_1, contents_2, symbols, tips } = this.state.tran
    return (
      <div className="content">
        <div className="card">
          <div className="cardHeader">
            <div>{title}</div>
            <div className='translation' onClick={this.handleTranslations}>
              <svg xmlns="http://www.w3.org/2000/svg" width="15" height="15" viewBox="0 0 1024 1024"><path fill="#f8f9fa" d="M608 416h288c35.36 0 64 28.48 64 64v416c0 35.36-28.48 64-64 64H480c-35.36 0-64-28.48-64-64V608H128c-35.36 0-64-28.48-64-64V128c0-35.36 28.48-64 64-64h416c35.36 0 64 28.48 64 64v288zm0 64v64c0 35.36-28.48 64-64 64h-64v256.032C480 881.696 494.304 896 511.968 896H864a31.968 31.968 0 0 0 31.968-31.968V512A31.968 31.968 0 0 0 864 480.032H608zM128 159.968V512c0 17.664 14.304 31.968 31.968 31.968H512A31.968 31.968 0 0 0 543.968 512V160a31.968 31.968 0 0 0-31.936-32H160a31.968 31.968 0 0 0-32 31.968zm64 244.288V243.36h112.736V176h46.752c6.4.928 9.632 1.824 9.632 2.752a10.56 10.56 0 0 1-1.376 4.128c-2.752 7.328-4.128 16.032-4.128 26.112v34.368h119.648v156.768h-50.88v-20.64h-68.768V497.76h-49.504V379.488h-67.36v24.768H192zm46.72-122.368v60.48h67.392V281.92h-67.36zm185.664 60.48V281.92h-68.768v60.48h68.768zm203.84 488H576L668.128 576h64.64l89.344 254.4h-54.976l-19.264-53.664H647.488l-19.232 53.632zm33.024-96.256h72.864l-34.368-108.608h-1.376l-37.12 108.608zM896 320h-64a128 128 0 0 0-128-128v-64a192 192 0 0 1 192 192zM128 704h64a128 128 0 0 0 128 128v64a192 192 0 0 1-192-192z" /></svg>
            </div>
          </div>
          <div className="cardDesc">
            <span className="red">{contents_1}</span>
            <br />
            {contents_2}
          </div>
          <div className="cardSymbols">
            <div>{symbols}</div>
          </div>
        </div>
        <div className="tips">{tips}</div>
      </div>
    )
  }
}

\public\index.html 文件:

<!doctype html>
<html>

<head>
  <meta charset="utf-8">
  <meta name="viewport"
    content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no, shrink-to-fit=no">
  <meta name="theme-color" content="#07092f">
  <title>This request has been blocked</title>
  <meta name="description" content="This request has been blocked.">
</head>

<body>
  <div id="root"></div>
</body>

</html>

回到 Cloudflare,點擊規則項的重定向規則,創建一條規則。將上面屏蔽行銷蜘蛛的表達式填入表達式編輯框,類型選擇靜態URL 填入剛才的 Vercel 托管頁面的鏈接、狀態代碼設置為 301取消勾選保留查詢字符串即可自定義的攔截網頁。

Cloudflare 重定向

此文由 Mix Space 同步更新至 xLog
原始鏈接為 https://www.vinking.top/posts/codes/blocking-bots-with-cloudflare


載入中......
此文章數據所有權由區塊鏈加密技術和智能合約保障僅歸創作者所有。