爬虫反爬进阶——IP代理池、请求指纹、字体反爬实战

发布时间：2026/6/26 21:01:32

入门级反爬User-Agent、IP 封禁、请求频率限制好解决但到了进阶阶段会遇到更多花样。这篇讲三个最实用的反爬对抗技术。一、IP 代理池——突破封 IP爬得稍微快一点网站就封 IP。解决方案是维护一个代理池IP 被封了自动换。1. 免费代理获取思路importrequestsfrombs4importBeautifulSoupimporttimedeffetch_free_proxies():从免费代理网站采集代理IP示例proxies[]urlhttps://www.free-proxy-list.com/resprequests.get(url,headers{User-Agent:Mozilla/5.0})soupBeautifulSoup(resp.text,html.parser)forrowinsoup.select(table tr)[1:]:# 跳过表头colsrow.select(td)iflen(cols)2:ipcols[0].text.strip()portcols[1].text.strip()proxies.append(f{ip}:{port})returnproxies需要注意的是免费代理质量很差90% 不可用速度慢小规模采集凑合用大规模采集建议买付费代理。2. 验证代理是否可用importthreadingdefcheck_proxy(proxy):测试代理是否可用try:resprequests.get(http://httpbin.org/ip,proxies{http:proxy,https:proxy},timeout5)ifresp.status_code200:print(f✅{proxy}可用 -{resp.text.strip()})returnTrueexcept:passreturnFalse# 多线程验证所有代理defvalidate_all(proxies):valid[]threads[]lockthreading.Lock()defcheck(p):ifcheck_proxy(p):withlock:valid.append(p)forpinproxies:tthreading.Thread(targetcheck,args(p,))t.start()threads.append(t)fortinthreads:t.join()returnvalid3. 代理池管理器importrandomimporttimeimportrequestsclassProxyPool:简单的代理池管理器def__init__(self):self.proxies[]self.blacklistset()defadd_proxy(self,proxy):ifproxynotinself.blacklist:self.proxies.append(proxy)defget_proxy(self):随机返回一个代理ifnotself.proxies:returnNonereturnrandom.choice(self.proxies)defmark_bad(self,proxy):标记代理不可用ifproxyinself.proxies:self.proxies.remove(proxy)self.blacklist.add(proxy)print(f 代理{proxy}已移除剩余{len(self.proxies)}个)defrequest_with_retry(self,url,max_retries5):带代理重试的请求foriinrange(max_retries):proxyself.get_proxy()ifnotproxy:print(代理池已空等待补充...)returnNonetry:resprequests.get(url,proxies{http:proxy,https:proxy},timeout10,headers{User-Agent:Mozilla/5.0})ifresp.status_code200:returnrespelse:self.mark_bad(proxy)except:self.mark_bad(proxy)time.sleep(0.5)print(f重试{max_retries}次全部失败:{url})returnNone4. 付费代理推荐免费代理不稳定真干活建议买快代理国内最大的代理服务商动态代理按量计费芝麻代理价格便宜几块钱能用一天站大爷支持动态按需提取一般几十块钱能用几个月省心很多。二、请求指纹——伪装浏览器特征现在的反爬不仅看 IP 和 UA还会检测TLS 指纹、Headers 顺序、WebDriver等。1. 完整的请求头伪装headers{User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/125.0.0.0 Safari/537.36,Accept:text/html,application/xhtmlxml,application/xml;q0.9,image/webp,*/*;q0.8,Accept-Language:zh-CN,zh;q0.9,en;q0.8,Accept-Encoding:gzip, deflate, br,Connection:keep-alive,Referer:https://www.google.com/,Sec-Ch-Ua:Google Chrome;v125, Chromium;v125, Not.A/Brand;v24,Sec-Ch-Ua-Mobile:?0,Sec-Ch-Ua-Platform:Windows,Sec-Fetch-Dest:document,Sec-Fetch-Mode:navigate,Sec-Fetch-Site:none,Sec-Fetch-User:?1,Upgrade-Insecure-Requests:1,}2. 随机切换importrandomfromfake_useragentimportUserAgent uaUserAgent()classRandomHeaders:随机生成请求头staticmethoddefget_headers():user_agents[ua.chrome,ua.edge,ua.firefox,]return{User-Agent:random.choice(user_agents),Accept:text/html,application/xhtmlxml,application/xml;q0.9,*/*;q0.8,Accept-Language:random.choice([zh-CN,zh;q0.9,en;q0.8,en-US,en;q0.9,zh-CN;q0.8,zh-CN,zh;q0.9,]),Referer:random.choice([https://www.baidu.com/,https://www.google.com/,https://www.bing.com/,]),}3. 使用 requests 会话重用# 每次都新建会话容易被识别# 正确的做法是复用 Sessionsessionrequests.Session()session.headers.update(RandomHeaders.get_headers())# 连续请求保持同一指纹resp1session.get(https://example.com/page1)resp2session.get(https://example.com/page2)三、字体反爬——最常见的反爬手段很多网站大众点评、猫眼电影、58同城用自定义字体来加密数字你爬下来看到的是乱码。原理正常数字0 1 2 3 4 5 6 7 8 9 加密后 ➀ ➁ ➂ ➃ ➄ ➅ ➆ ➇ ➈ ➉ 网页通过 font-face 加载一个自定义字体文件.woff 或 .ttf 字体文件里把字符映射关系打乱了所以浏览器能正确显示但你爬到的文本是乱码解决方案fromfontTools.ttLibimportTTFontimportreimportrequestsdefparse_font_anti_spider(html_text,font_url):解析字体反爬# 1. 下载字体文件resprequests.get(font_url)withopen(temp.woff,wb)asf:f.write(resp.content)# 2. 解析字体映射关系fontTTFont(temp.woff)cmapfont.getBestCmap()# 3. 建立映射关系# cmap 返回的是 {unicode编码: 字形名称}# 比如 {0xe001: one, 0xe002: two, ...}# 需要根据字形名称找到对应的数字unicode_to_digit{}num_map{one:1,two:2,three:3,four:4,five:5,six:6,seven:7,eight:8,nine:9,zero:0,}forunicode_val,glyph_nameincmap.items():# 字形名称可能类似 uniE001 或 oneglyph_nameglyph_name.lower()foreng,digitinnum_map.items():ifenginglyph_name:unicode_to_digit[chr(unicode_val)]digit# 4. 替换加密字符forenc_char,digitinunicode_to_digit.items():html_texthtml_text.replace(enc_char,digit)returnhtml_text在线字体反爬动态加载更麻烦很多网站已经升级到了动态字体——每次请求都生成新的字体文件映射关系每次都不一样# 解决方案思路# 1. 每次请求都下载最新的字体文件# 2. 通过字形轮廓计算来识别数字# 更简单的方法用 OCR 识别# from PIL import Image# import pytesseract# 但准确率不如直接解析字体文件四、Selenium 防检测用 Selenium 时网站可以通过window.navigator.webdriver检测到你是自动化工具。1. 隐藏 WebDriver 特征fromseleniumimportwebdriverfromselenium.webdriver.chrome.optionsimportOptions optionsOptions()options.add_argument(--disable-blink-featuresAutomationControlled)options.add_experimental_option(excludeSwitches,[enable-automation])options.add_experimental_option(useAutomationExtension,False)driverwebdriver.Chrome(optionsoptions)# 注入 JS 修改 navigator 属性driver.execute_cdp_cmd(Page.addScriptToEvaluateOnNewDocument,{source: Object.defineProperty(navigator, webdriver, { get: () undefined }); Object.defineProperty(navigator, plugins, { get: () [1, 2, 3, 4, 5] }); Object.defineProperty(navigator, languages, { get: () [zh-CN, zh] }); })2. 使用 undetected-chromedriver有个专门的库解决了大部分检测问题pipinstallundetected-chromedriverimportundetected_chromedriverasuc driveruc.Chrome()driver.get(https://example.com)# 自动隐藏了 WebDriver 特征比手动配置稳定得多五、实际工作流如何选择反爬策略反爬级别特征应对方法⭐ 入门级封 IP、限频率加延时轮换 UA⭐⭐ 初级检测请求头、验证码Session 复用代理池⭐⭐⭐ 中级字体反爬、动态加载下载字体解析抓接口⭐⭐⭐⭐ 高级WebDriver 检测、风控系统undetected-chromedriver 行为模拟⭐⭐⭐⭐⭐ 顶级滑块验证、人机识别打码平台行为轨迹模拟核心原则反爬对抗不是越强越好够用就行。加 3 秒延迟轮换 UA 能解决 80% 的问题没必要一上来就上分布式代理池。觉得有用的话点赞关注【张老师技术栈】吧每周更新 Java/Python/爬虫实战干货不让你白来。

资讯详情

爬虫反爬进阶——IP代理池、请求指纹、字体反爬实战

相关新闻

如何在5分钟内快速配置罗技PUBG压枪宏：终极后坐力控制指南

喀什螺纹钢公司，专业品质值得信赖

AI与大模型新闻日报 | 2026-06-26

质量管理-IQC是什么？

Fooocus：5分钟掌握完全免费的AI图像生成神器终极指南

【2026】GX Works3下载安装教程和使用教程（附安装包）PLC编程入门到精通，收藏这一篇就够了

全国涉水批文办理主流机构对比2026年，可靠性涉水批件服务怎么选

3D打印自制焊膏钢网：电子工程师快速原型开发利器

代数几何中的稳定性理论：从模空间构建到特征p推广

计算机毕业设计之基于Java的流浪动物收养系统设计与开发

技术线上面试代码写完就以为通关？留学生利用黑盒测试自证风控「蒸汽教育分享」

暗黑2存档编辑器终极指南：5分钟快速掌握d2s-editor完整使用教程

ComfyUI ControlNet Aux插件：解决模型下载失败的终极指南

轻量级多模态智能体实战：本地部署Qwen-VL图文理解与报告生成

手撕CNN：从卷积计算到工程落地的全链路解析