登录
首页 >  文章 >  php教程

PHP 实现简单爬虫:file_get_contents 与正则应用

时间:2026-05-19 23:48:39 346浏览 收藏

本文深入探讨了使用 PHP 的 `file_get_contents` 搭配正则表达式实现轻量级网页爬虫的核心要点与实战陷阱:既阐明了其依赖 `allow_url_fopen` 的前提条件及被禁用时必须转向 cURL 的刚性限制,也直击正则解析 HTML 的常见误区——如缺失 `s` 修饰符、贪心匹配导致的误捕与漏捕,并给出更健壮的模式设计建议;同时理性对比了正则与 DOMDocument 的适用边界,强调在结构简单、字段固定的场景下,前者凭借启动快、无依赖、代码简洁的优势依然高效可靠;最后提醒开发者绝不能忽视基础防护——模拟 User-Agent、添加请求延迟、控制超时与重定向、严格校验 URL 等,尤其点明正则对 JavaScript 动态渲染页面的天然失效,避免读者陷入“调参即万能”的认知误区。

如何用 PHP 实现一个简单的爬虫程序_使用 file_get_contents 与正则

file_get_contents 能否直接抓取网页内容

能,但有前提:目标网站允许被访问,且 PHP 配置启用了 allow_url_fopen(默认开启,但部分共享主机已禁用)。如果返回空或警告 Warning: file_get_contents(): failed to open stream: no suitable wrapper,说明已被禁用,此时必须改用 cURL —— file_get_contents 在这种情况下完全不可用。

实际使用时建议先检测:

if (!ini_get('allow_url_fopen')) {
    die('file_get_contents 无法用于远程 URL');
}

正则匹配 HTML 标签的常见翻车点

preg_match 提取标题、链接等字段时,最常犯的错误是写 /(.*?)<\/title>/i</code> 这类“贪心+无边界”的表达式。它在遇到换行、注释、嵌套标签(如 <code><script>document.write('<title>...')</code>)时极易误匹配或漏匹配。</p> <p>更稳妥的做法是:</p> <ul><li>加上 <code>s</code> 修饰符让 <code>.</code> 匹配换行符:<code>/<title>(.*?)<\/title>/is</code></li> <li>用非贪婪 + 明确字符集替代 <code>.*?</code>,例如提取 href:<code>/href=["\']([^"\'>\s]+)["\']/i</code></li> <li>避免匹配跨多行的复杂结构(如整个 <code><table></code>),正则不是 HTML 解析器</li> </ul><h3>为什么不用 DOMDocument 而坚持用正则</h3> <p>不是不能用,而是场景不同。如果你只是从格式稳定、结构扁平的页面中提取几个固定字段(比如某天气页的温度数字、某 RSS 源的 <code><link></code> 值),<code>file_get_contents</code> + <code>preg_match</code> 组合启动快、依赖少、代码短。而 <code>DOMDocument</code> 需要加载完整文档树,对 malformed HTML 容错虽好,但开销大,且容易因编码问题(如页面声明 <code>gb2312</code> 但内容是 <code>utf-8</code>)导致解析失败或乱码。</p> <p>若真要用正则提链接,注意两点:</p> <ul><li>用 <code>mb_convert_encoding</code> 统一转为 UTF-8 再匹配,否则中文 URL 可能截断</li> <li>匹配后对结果调用 <code>urldecode</code> 和 <code>filter_var($url, FILTER_VALIDATE_URL)</code> 做基础校验</li> </ul><h3>简单爬虫必须加的防护措施</h3> <p>哪怕只是本地调试,不加限制的 <code>file_get_contents</code> 请求也容易触发目标站的反爬机制或造成自身 IP 被封。最小可行防护包括:</p> <ul><li>设置请求头模拟浏览器:<code>stream_context_create(['http' => ['header' => "User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36\n"]])</code></li> <li>加延迟:<code>sleep(1)</code>,别连续请求</li> <li>限定最大响应体大小,防止下载整站:<code>'max_redirects' => 3, 'timeout' => 10, 'ignore_errors' => true</code></li> <li>始终检查返回值是否为 <code>false</code>,而非只看长度——网络中断时可能返回空字符串,但不是错误</li> </ul><p>真正要注意的是:正则永远无法可靠处理动态渲染页面(如 React/Vue 生成的内容),这类页面源码里根本没你想要的数据,再调什么上下文、修饰符都没用。</p><p>文中关于的知识介绍,希望对你的学习有所帮助!若是受益匪浅,那就动动鼠标收藏这篇《PHP 实现简单爬虫:file_get_contents 与正则应用》文章吧,也可关注golang学习网公众号了解相关技术文章。</p> </div> <div class="labsList"> </div> </div> <!-- 最新阅读 --> <div class="contBoxNor"> <div class="contTit"> <div class="tit">相关阅读</div> <a href="/articlelist.html" class="more">更多></a> </div> <ul class="latestReadList"> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  2年前  |   <a href="javascript:;" class="aLightGray" title="PHP技术">PHP技术</a> <a href="javascript:;" class="aLightGray" title="高薪回报">高薪回报</a> <a href="javascript:;" class="aLightGray" title="发展前景">发展前景</a> </div> <div class="tit lineOverflow"><a href="/article/61908.html" title="PHP技术的高薪回报与发展前景" class="aBlack">PHP技术的高薪回报与发展前景</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="61908" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  2年前  |   <a href="javascript:;" class="aLightGray" title="php">php</a> <a href="javascript:;" class="aLightGray" title="优惠券">优惠券</a> <a href="javascript:;" class="aLightGray" title="商场">商场</a> </div> <div class="tit lineOverflow"><a href="/article/62538.html" title="基于 PHP 的商场优惠券系统开发中的常见问题解决方案" class="aBlack">基于 PHP 的商场优惠券系统开发中的常见问题解决方案</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="62538" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  2年前  |   <a href="javascript:;" class="aLightGray" title="PHP支付功能">PHP支付功能</a> <a href="javascript:;" class="aLightGray" title="在线支付开发">在线支付开发</a> <a href="javascript:;" class="aLightGray" title="简单支付实现">简单支付实现</a> </div> <div class="tit lineOverflow"><a href="/article/62741.html" title="如何使用PHP开发简单的在线支付功能" class="aBlack">如何使用PHP开发简单的在线支付功能</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="62741" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  2年前  |   <a href="javascript:;" class="aLightGray" title="分布式缓存">分布式缓存</a> <a href="javascript:;" class="aLightGray" title="PHP消息队列">PHP消息队列</a> <a href="javascript:;" class="aLightGray" title="缓存刷新器">缓存刷新器</a> </div> <div class="tit lineOverflow"><a href="/article/62881.html" title="PHP消息队列开发指南:实现分布式缓存刷新器" class="aBlack">PHP消息队列开发指南:实现分布式缓存刷新器</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="62881" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  2年前  |   <a href="javascript:;" class="aLightGray" title="微服务">微服务</a> <a href="javascript:;" class="aLightGray" title="调度">调度</a> <a href="javascript:;" class="aLightGray" title="分布式任务">分布式任务</a> </div> <div class="tit lineOverflow"><a href="/article/63734.html" title="如何在PHP微服务中实现分布式任务分配和调度" class="aBlack">如何在PHP微服务中实现分布式任务分配和调度</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="63734" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> </ul> </div> <!-- 最新阅读 --> <div class="contBoxNor"> <div class="contTit"> <div class="tit">最新阅读</div> <a href="/articlelist.html" class="more">更多></a> </div> <ul class="latestReadList"> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  6小时前  |   <a href="/articletag/29_new_0_1.html" class="aLightGray" title="MySQL">MySQL</a> · <a href="/articletag/220_new_0_1.html" class="aLightGray" title="Redis">Redis</a> · <a href="/articletag/1433_new_0_1.html" class="aLightGray" title="PHP">PHP</a> · <a href="/articletag/39770_new_0_1.html" class="aLightGray" title="接口设计">接口设计</a> · <a href="javascript:;" class="aLightGray" title="mysql">mysql</a> <a href="javascript:;" class="aLightGray" title="php">php</a> <a href="javascript:;" class="aLightGray" title="redis">redis</a> <a href="javascript:;" class="aLightGray" title="防重复提交">防重复提交</a> <a href="javascript:;" class="aLightGray" title="接口幂等">接口幂等</a> </div> <div class="tit lineOverflow"><a href="/article/619859.html" title="PHP 接口幂等提交实战:Redis Key + MySQL 唯一索引防重复下单" class="aBlack">PHP 接口幂等提交实战:Redis Key + MySQL 唯一索引防重复下单</a></div> <div class="opt"> <span><i class="view"></i>378</span> <span class="collectBtn user_collection" data-id="619859" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  1星期前  |   <a href="javascript:;" class="aLightGray" title="PHP字符串">PHP字符串</a> </div> <div class="tit lineOverflow"><a href="/article/619685.html" title="PHPBase64解密方法与实战教程" class="aBlack">PHPBase64解密方法与实战教程</a></div> <div class="opt"> <span><i class="view"></i>291</span> <span class="collectBtn user_collection" data-id="619685" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  1星期前  |   </div> <div class="tit lineOverflow"><a href="/article/619672.html" title="PHP移动端扫码数据接收与处理技巧" class="aBlack">PHP移动端扫码数据接收与处理技巧</a></div> <div class="opt"> <span><i class="view"></i>169</span> <span class="collectBtn user_collection" data-id="619672" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  1星期前  |   <a href="javascript:;" class="aLightGray" title="phpenv">phpenv</a> </div> <div class="tit lineOverflow"><a href="/article/619664.html" title="PHPEnv解决Accessdenied报错教程" class="aBlack">PHPEnv解决Accessdenied报错教程</a></div> <div class="opt"> <span><i class="view"></i>222</span> <span class="collectBtn user_collection" data-id="619664" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  1星期前  |   <a href="javascript:;" class="aLightGray" title="Laravel">Laravel</a> </div> <div class="tit lineOverflow"><a href="/article/619648.html" title="Laravel并发任务日志记录方法" class="aBlack">Laravel并发任务日志记录方法</a></div> <div class="opt"> <span><i class="view"></i>322</span> <span class="collectBtn user_collection" data-id="619648" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  1星期前  |   </div> <div class="tit lineOverflow"><a href="/article/619641.html" title="宝塔面板Docker部署方法详解" class="aBlack">宝塔面板Docker部署方法详解</a></div> <div class="opt"> <span><i class="view"></i>362</span> <span class="collectBtn user_collection" data-id="619641" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  1星期前  |   </div> <div class="tit lineOverflow"><a href="/article/619636.html" title="学号重复检测,PHP唯一性校验技巧" class="aBlack">学号重复检测,PHP唯一性校验技巧</a></div> <div class="opt"> <span><i class="view"></i>117</span> <span class="collectBtn user_collection" data-id="619636" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  1星期前  |   <a href="javascript:;" class="aLightGray" title="Webman">Webman</a> </div> <div class="tit lineOverflow"><a href="/article/619625.html" title="Webman多应用模式:多域名多系统架构解析" class="aBlack">Webman多应用模式:多域名多系统架构解析</a></div> <div class="opt"> <span><i class="view"></i>231</span> <span class="collectBtn user_collection" data-id="619625" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  1星期前  |   <a href="javascript:;" class="aLightGray" title="Yii框架">Yii框架</a> </div> <div class="tit lineOverflow"><a href="/article/619623.html" title="Yii框架入口文件隐藏与URL优化方案" class="aBlack">Yii框架入口文件隐藏与URL优化方案</a></div> <div class="opt"> <span><i class="view"></i>278</span> <span class="collectBtn user_collection" data-id="619623" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  1星期前  |   </div> <div class="tit lineOverflow"><a href="/article/619586.html" title="PHP加密数据查询与解密方法详解" class="aBlack">PHP加密数据查询与解密方法详解</a></div> <div class="opt"> <span><i class="view"></i>123</span> <span class="collectBtn user_collection" data-id="619586" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  1星期前  |   </div> <div class="tit lineOverflow"><a href="/article/619581.html" title="安全下载PHP文件的正确方式" class="aBlack">安全下载PHP文件的正确方式</a></div> <div class="opt"> <span><i class="view"></i>186</span> <span class="collectBtn user_collection" data-id="619581" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/84_new_0_1.html" class="aLightGray" title="php教程">php教程</a>   |  1星期前  |   </div> <div class="tit lineOverflow"><a href="/article/619553.html" title="PHP空合并运算符??与??=使用技巧解析" class="aBlack">PHP空合并运算符??与??=使用技巧解析</a></div> <div class="opt"> <span><i class="view"></i>153</span> <span class="collectBtn user_collection" data-id="619553" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> </ul> </div> <!-- 课程推荐 --> <div class="contBoxNor"> <div class="contTit"> <div class="tit">课程推荐</div> <a href="/courselist.html" class="more">更多></a> </div> <ul class="classRecomList"> <li> <a href="/course/9.html" title="前端进阶之JavaScript设计模式" class="img_box"> <img src="/uploads/20221222/52fd0f23a454c71029c2c72d206ed815.jpg" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="前端进阶之JavaScript设计模式"> </a> <dl> <dt class="lineOverflow"> 前端进阶之JavaScript设计模式 </dt> <dd class="cont1 lineOverflow">设计模式是开发人员在软件开发过程中面临一般问题时的解决方案,代表了最佳的实践。本课程的主打内容包括JS常见设计模式以及具体应用场景,打造一站式知识长龙服务,适合有JS基础的同学学习。</dd> <dd class="cont2"> <a href="/course/9.html" title="前端进阶之JavaScript设计模式" class="toStudy">立即学习</a> <span>543次学习</span> </dd> </dl> </li> <li> <a href="/course/2.html" title="GO语言核心编程课程" class="img_box"> <img src="/uploads/20221221/634ad7404159bfefc6a54a564d437b5f.png" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="GO语言核心编程课程"> </a> <dl> <dt class="lineOverflow"> GO语言核心编程课程 </dt> <dd class="cont1 lineOverflow">本课程采用真实案例,全面具体可落地,从理论到实践,一步一步将GO核心编程技术、编程思想、底层实现融会贯通,使学习者贴近时代脉搏,做IT互联网时代的弄潮儿。</dd> <dd class="cont2"> <a href="/course/2.html" title="GO语言核心编程课程" class="toStudy">立即学习</a> <span>516次学习</span> </dd> </dl> </li> <li> <a href="/course/74.html" title="简单聊聊mysql8与网络通信" class="img_box"> <img src="/uploads/20240103/bad35fe14edbd214bee16f88343ac57c.png" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="简单聊聊mysql8与网络通信"> </a> <dl> <dt class="lineOverflow"> 简单聊聊mysql8与网络通信 </dt> <dd class="cont1 lineOverflow">如有问题加微信:Le-studyg;在课程中,我们将首先介绍MySQL8的新特性,包括性能优化、安全增强、新数据类型等,帮助学生快速熟悉MySQL8的最新功能。接着,我们将深入解析MySQL的网络通信机制,包括协议、连接管理、数据传输等,让</dd> <dd class="cont2"> <a href="/course/74.html" title="简单聊聊mysql8与网络通信" class="toStudy">立即学习</a> <span>500次学习</span> </dd> </dl> </li> <li> <a href="/course/57.html" title="JavaScript正则表达式基础与实战" class="img_box"> <img src="/uploads/20221226/bbe4083bb3cb0dd135fb02c31c3785fb.jpg" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="JavaScript正则表达式基础与实战"> </a> <dl> <dt class="lineOverflow"> JavaScript正则表达式基础与实战 </dt> <dd class="cont1 lineOverflow">在任何一门编程语言中,正则表达式,都是一项重要的知识,它提供了高效的字符串匹配与捕获机制,可以极大的简化程序设计。</dd> <dd class="cont2"> <a href="/course/57.html" title="JavaScript正则表达式基础与实战" class="toStudy">立即学习</a> <span>487次学习</span> </dd> </dl> </li> <li> <a href="/course/28.html" title="从零制作响应式网站—Grid布局" class="img_box"> <img src="/uploads/20221223/ac110f88206daeab6c0cf38ebf5fe9ed.jpg" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="从零制作响应式网站—Grid布局"> </a> <dl> <dt class="lineOverflow"> 从零制作响应式网站—Grid布局 </dt> <dd class="cont1 lineOverflow">本系列教程将展示从零制作一个假想的网络科技公司官网,分为导航,轮播,关于我们,成功案例,服务流程,团队介绍,数据部分,公司动态,底部信息等内容区块。网站整体采用CSSGrid布局,支持响应式,有流畅过渡和展现动画。</dd> <dd class="cont2"> <a href="/course/28.html" title="从零制作响应式网站—Grid布局" class="toStudy">立即学习</a> <span>485次学习</span> </dd> </dl> </li> </ul> </div> </div> <!-- footer --> <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"> <div class="footer"> <ul> <li ><a href="/" class="aLightGray"><em class="material-icons">home</em><span>首页</span></a></li> <li class="curr"><a href="/articlelist.html" class="aLightGray"><em class="material-icons">menu_book</em><span>阅读</span></a></li> <li ><a href="/courselist.html" class="aLightGray"><em class="material-icons">school</em><span>课程</span></a></li> <li ><a href="/ai.html" class="aLightGray"><em class="material-icons">smart_toy</em><span>AI助手</span></a></li> <li ><a href="/user.html" class="aLightGray"><em class="material-icons">person</em><span>我的</span></a></li> </ul> </div> <script src="/assets/js/require.js" data-main="/assets/js/require-frontend.js?v=1671101972"></script> <script> var _hmt = _hmt || []; (function() { var hm = document.createElement("script"); hm.src = "https://hm.baidu.com/hm.js?3dc5666f6478c7bf39cd5c91e597423d"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s); })(); </script> </body> </html>