登录
首页 >  文章 >  python教程

如何用正则提取HTML内容

时间:2025-07-01 12:53:55 108浏览 收藏

小伙伴们对文章编程感兴趣吗?是否正在学习相关知识点?如果是,那么本文《如何用正则提取HTML特定内容》,就很适合你,本篇文章讲解的知识点主要包括。在之后的文章中也会多多分享相关知识点,希望对大家的知识积累有所帮助!

正则表达式可用于提取HTML中的特定内容,但并非最佳工具,推荐使用BeautifulSoup等库。1. 提取标签内文本可用类似(.*?)的正则,捕获组提取所需内容;2. 提取属性值如图片src可用,并可通过src=(['\"])(.*?)\1兼容单双引号;3. 匹配带特定类名的标签内容如

...
可用
([\s\S]*?)
,但嵌套结构可能导致匹配失败;建议测试时用真实数据、多用非贪婪模式,并在复杂结构中优先选用HTML解析库以避免问题。

如何使用正则表达式提取HTML中的特定内容?

在处理网页数据时,提取HTML中的特定内容是很常见的需求。正则表达式(Regex)虽然不是解析HTML的最佳工具(推荐用BeautifulSoup或类似库),但在简单场景下,它仍然是一种快速有效的方法。

如何使用正则表达式提取HTML中的特定内容?

匹配标签内的文本内容

如果你只想提取某个标签之间的文本,比如</code>标签里的标题,可以用如下正则:</p><img src="/uploads/20250701/1751345615686369cf8eeec.png" alt="如何使用正则表达式提取HTML中的特定内容?"><pre class="brush:regex;toolbar:false;"><title.*?>(.*?)

这个表达式的意思是:

  • .*? 表示非贪婪匹配任意字符
  • (.*?) 是一个捕获组,用来提取你真正想要的内容

例如,面对这段HTML:

如何使用正则表达式提取HTML中的特定内容?
这是要提取的网页标题

正则会提取出“这是要提取的网页标题”。

⚠️注意:如果页面中有多处</code>标签或者结构复杂,可能会出现误匹配,这时候需要结合上下文或其他方式辅助判断。</p><h2>提取指定属性的值</h2><p>有时候你需要从HTML标签中提取某个属性的值,比如所有图片的<code>src</code>:</p><pre class="brush:regex;toolbar:false;"><img.*?src="(.*?)".*?></pre><p>这样就能从下面这样的HTML中提取出图片地址:</p><pre class="brush:html;toolbar:false;"><img src="/images/logo.png" alt="Logo"></pre><p>结果就是 <code>/images/logo.png</code></p><p>?技巧:</p><ul><li>如果不确定引号类型,可以使用<code>src=(['\"])(.*?)\1</code>来兼容单引号和双引号</li><li>注意转义字符,比如在Python中要用原始字符串<code>r''</code>避免反斜杠被转义</li></ul><h2>匹配带特定类名的标签内容</h2><p>想提取某个class下的内容?比如<code><div class="content">...</div></code>中的整个块:</p><pre class="brush:regex;toolbar:false;"><div class="content".*?>([\s\S]*?)</div></pre><p>这里用了<code>[\s\S]*?</code>来匹配包括换行在内的所有字符。</p><p>⚠️风险提示:</p><ul><li>HTML嵌套结构容易让这种正则失效,比如内部还有多个<code></div></code></li><li>更稳妥的方式是使用HTML解析器,避免“标签没闭合”、“属性顺序变化”等问题</li></ul><h2>一些实用建议</h2><ul><li>测试正则时尽量用真实的数据样本,别只看理想情况</li><li>多用非贪婪模式(<code>.*?</code>),否则很容易匹配过多内容</li><li>遇到复杂HTML结构时,优先考虑专门的解析库,而不是硬着头皮写正则</li><li>正则只是工具之一,不适用于所有HTML解析场景</li></ul><p>基本上就这些。正则提取HTML内容不复杂,但细节容易出错,多测试、多观察匹配结果才是关键。</p><p>理论要掌握,实操不能落!以上关于《如何用正则提取HTML内容》的详细介绍,大家都掌握了吧!如果想要继续提升自己的能力,那么就来关注golang学习网公众号吧!</p> </div> <div class="labsList"> </div> </div> <!-- 最新阅读 --> <div class="contBoxNor"> <div class="contTit"> <div class="tit">相关阅读</div> <a href="/articlelist.html" class="more">更多></a> </div> <ul class="latestReadList"> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="高效开发">高效开发</a> <a href="javascript:;" class="aLightGray" title="Flask框架">Flask框架</a> <a href="javascript:;" class="aLightGray" title="安装技巧">安装技巧</a> </div> <div class="tit lineOverflow"><a href="/article/80964.html" title="Flask框架安装技巧:让你的开发更高效" class="aBlack">Flask框架安装技巧:让你的开发更高效</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="80964" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="Django">Django</a> <a href="javascript:;" class="aLightGray" title="技巧">技巧</a> <a href="javascript:;" class="aLightGray" title="多线程">多线程</a> </div> <div class="tit lineOverflow"><a href="/article/90241.html" title="Django框架中的并发处理技巧" class="aBlack">Django框架中的并发处理技巧</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="90241" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="下载速度">下载速度</a> <a href="javascript:;" class="aLightGray" title="pip源配置">pip源配置</a> <a href="javascript:;" class="aLightGray" title="国内源">国内源</a> </div> <div class="tit lineOverflow"><a href="/article/88174.html" title="提升Python包下载速度的方法——正确配置pip的国内源" class="aBlack">提升Python包下载速度的方法——正确配置pip的国内源</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="88174" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="Python">Python</a> <a href="javascript:;" class="aLightGray" title="C++">C++</a> <a href="javascript:;" class="aLightGray" title="选择">选择</a> </div> <div class="tit lineOverflow"><a href="/article/113474.html" title="Python与C++:哪个编程语言更适合初学者?" class="aBlack">Python与C++:哪个编程语言更适合初学者?</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="113474" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   </div> <div class="tit lineOverflow"><a href="/article/120624.html" title="品牌建设技巧" class="aBlack">品牌建设技巧</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="120624" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> </ul> </div> <!-- 最新阅读 --> <div class="contBoxNor"> <div class="contTit"> <div class="tit">最新阅读</div> <a href="/articlelist.html" class="more">更多></a> </div> <ul class="latestReadList"> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  7小时前  |   </div> <div class="tit lineOverflow"><a href="/article/248823.html" title="Python整除运算符//使用详解" class="aBlack">Python整除运算符//使用详解</a></div> <div class="opt"> <span><i class="view"></i>222</span> <span class="collectBtn user_collection" data-id="248823" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  7小时前  |   </div> <div class="tit lineOverflow"><a href="/article/248814.html" title="PyCharm选择解释器教程及选型指南" class="aBlack">PyCharm选择解释器教程及选型指南</a></div> <div class="opt"> <span><i class="view"></i>482</span> <span class="collectBtn user_collection" data-id="248814" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  7小时前  |   </div> <div class="tit lineOverflow"><a href="/article/248800.html" title="Python代码优化与性能提升技巧" class="aBlack">Python代码优化与性能提升技巧</a></div> <div class="opt"> <span><i class="view"></i>243</span> <span class="collectBtn user_collection" data-id="248800" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  7小时前  |   </div> <div class="tit lineOverflow"><a href="/article/248797.html" title="PyCharm新手入门,基础操作全解析" class="aBlack">PyCharm新手入门,基础操作全解析</a></div> <div class="opt"> <span><i class="view"></i>369</span> <span class="collectBtn user_collection" data-id="248797" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  8小时前  |   </div> <div class="tit lineOverflow"><a href="/article/248790.html" title="Python中%运算符用法及取模应用" class="aBlack">Python中%运算符用法及取模应用</a></div> <div class="opt"> <span><i class="view"></i>126</span> <span class="collectBtn user_collection" data-id="248790" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  8小时前  |   </div> <div class="tit lineOverflow"><a href="/article/248784.html" title="Python字符串split方法详解" class="aBlack">Python字符串split方法详解</a></div> <div class="opt"> <span><i class="view"></i>257</span> <span class="collectBtn user_collection" data-id="248784" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  8小时前  |   </div> <div class="tit lineOverflow"><a href="/article/248783.html" title="Python中import的作用与使用详解" class="aBlack">Python中import的作用与使用详解</a></div> <div class="opt"> <span><i class="view"></i>171</span> <span class="collectBtn user_collection" data-id="248783" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  8小时前  |   </div> <div class="tit lineOverflow"><a href="/article/248764.html" title="Python中int类型详解与使用方法" class="aBlack">Python中int类型详解与使用方法</a></div> <div class="opt"> <span><i class="view"></i>284</span> <span class="collectBtn user_collection" data-id="248764" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  8小时前  |   </div> <div class="tit lineOverflow"><a href="/article/248763.html" title="Python向量化操作怎么实现?" class="aBlack">Python向量化操作怎么实现?</a></div> <div class="opt"> <span><i class="view"></i>329</span> <span class="collectBtn user_collection" data-id="248763" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  8小时前  |   <a href="javascript:;" class="aLightGray" title="Python">Python</a> <a href="javascript:;" class="aLightGray" title="数据验证">数据验证</a> <a href="javascript:;" class="aLightGray" title="区块链">区块链</a> <a href="javascript:;" class="aLightGray" title="哈希">哈希</a> <a href="javascript:;" class="aLightGray" title="区块">区块</a> </div> <div class="tit lineOverflow"><a href="/article/248761.html" title="Python实现简易区块链教程" class="aBlack">Python实现简易区块链教程</a></div> <div class="opt"> <span><i class="view"></i>176</span> <span class="collectBtn user_collection" data-id="248761" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  8小时前  |   </div> <div class="tit lineOverflow"><a href="/article/248751.html" title="Python自动化测试技巧与实战方法" class="aBlack">Python自动化测试技巧与实战方法</a></div> <div class="opt"> <span><i class="view"></i>315</span> <span class="collectBtn user_collection" data-id="248751" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  8小时前  |   <a href="javascript:;" class="aLightGray" title="迭代器">迭代器</a> <a href="javascript:;" class="aLightGray" title="普通函数">普通函数</a> <a href="javascript:;" class="aLightGray" title="yield">yield</a> <a href="javascript:;" class="aLightGray" title="Python生成器">Python生成器</a> <a href="javascript:;" class="aLightGray" title="惰性计算">惰性计算</a> </div> <div class="tit lineOverflow"><a href="/article/248739.html" title="Python生成器是什么?有何特别之处?" class="aBlack">Python生成器是什么?有何特别之处?</a></div> <div class="opt"> <span><i class="view"></i>366</span> <span class="collectBtn user_collection" data-id="248739" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> </ul> </div> <!-- 课程推荐 --> <div class="contBoxNor"> <div class="contTit"> <div class="tit">课程推荐</div> <a href="/courselist.html" class="more">更多></a> </div> <ul class="classRecomList"> <li> <a href="/course/9.html" title="前端进阶之JavaScript设计模式" class="img_box"> <img src="/uploads/20221222/52fd0f23a454c71029c2c72d206ed815.jpg" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="前端进阶之JavaScript设计模式"> </a> <dl> <dt class="lineOverflow"> 前端进阶之JavaScript设计模式 </dt> <dd class="cont1 lineOverflow">设计模式是开发人员在软件开发过程中面临一般问题时的解决方案,代表了最佳的实践。本课程的主打内容包括JS常见设计模式以及具体应用场景,打造一站式知识长龙服务,适合有JS基础的同学学习。</dd> <dd class="cont2"> <a href="/course/9.html" title="前端进阶之JavaScript设计模式" class="toStudy">立即学习</a> <span>542次学习</span> </dd> </dl> </li> <li> <a href="/course/2.html" title="GO语言核心编程课程" class="img_box"> <img src="/uploads/20221221/634ad7404159bfefc6a54a564d437b5f.png" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="GO语言核心编程课程"> </a> <dl> <dt class="lineOverflow"> GO语言核心编程课程 </dt> <dd class="cont1 lineOverflow">本课程采用真实案例,全面具体可落地,从理论到实践,一步一步将GO核心编程技术、编程思想、底层实现融会贯通,使学习者贴近时代脉搏,做IT互联网时代的弄潮儿。</dd> <dd class="cont2"> <a href="/course/2.html" title="GO语言核心编程课程" class="toStudy">立即学习</a> <span>508次学习</span> </dd> </dl> </li> <li> <a href="/course/74.html" title="简单聊聊mysql8与网络通信" class="img_box"> <img src="/uploads/20240103/bad35fe14edbd214bee16f88343ac57c.png" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="简单聊聊mysql8与网络通信"> </a> <dl> <dt class="lineOverflow"> 简单聊聊mysql8与网络通信 </dt> <dd class="cont1 lineOverflow">如有问题加微信:Le-studyg;在课程中,我们将首先介绍MySQL8的新特性,包括性能优化、安全增强、新数据类型等,帮助学生快速熟悉MySQL8的最新功能。接着,我们将深入解析MySQL的网络通信机制,包括协议、连接管理、数据传输等,让</dd> <dd class="cont2"> <a href="/course/74.html" title="简单聊聊mysql8与网络通信" class="toStudy">立即学习</a> <span>497次学习</span> </dd> </dl> </li> <li> <a href="/course/57.html" title="JavaScript正则表达式基础与实战" class="img_box"> <img src="/uploads/20221226/bbe4083bb3cb0dd135fb02c31c3785fb.jpg" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="JavaScript正则表达式基础与实战"> </a> <dl> <dt class="lineOverflow"> JavaScript正则表达式基础与实战 </dt> <dd class="cont1 lineOverflow">在任何一门编程语言中,正则表达式,都是一项重要的知识,它提供了高效的字符串匹配与捕获机制,可以极大的简化程序设计。</dd> <dd class="cont2"> <a href="/course/57.html" title="JavaScript正则表达式基础与实战" class="toStudy">立即学习</a> <span>487次学习</span> </dd> </dl> </li> <li> <a href="/course/28.html" title="从零制作响应式网站—Grid布局" class="img_box"> <img src="/uploads/20221223/ac110f88206daeab6c0cf38ebf5fe9ed.jpg" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="从零制作响应式网站—Grid布局"> </a> <dl> <dt class="lineOverflow"> 从零制作响应式网站—Grid布局 </dt> <dd class="cont1 lineOverflow">本系列教程将展示从零制作一个假想的网络科技公司官网,分为导航,轮播,关于我们,成功案例,服务流程,团队介绍,数据部分,公司动态,底部信息等内容区块。网站整体采用CSSGrid布局,支持响应式,有流畅过渡和展现动画。</dd> <dd class="cont2"> <a href="/course/28.html" title="从零制作响应式网站—Grid布局" class="toStudy">立即学习</a> <span>484次学习</span> </dd> </dl> </li> </ul> </div> </div> <!-- footer --> <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"> <div class="footer"> <ul> <li ><a href="/" class="aLightGray"><em class="material-icons">home</em><span>首页</span></a></li> <li class="curr"><a href="/articlelist.html" class="aLightGray"><em class="material-icons">menu_book</em><span>阅读</span></a></li> <li ><a href="/courselist.html" class="aLightGray"><em class="material-icons">school</em><span>课程</span></a></li> <li ><a href="/ai.html" class="aLightGray"><em class="material-icons">smart_toy</em><span>AI助手</span></a></li> <li ><a href="/user.html" class="aLightGray"><em class="material-icons">person</em><span>我的</span></a></li> </ul> </div> <script src="/assets/js/require.js" data-main="/assets/js/require-frontend.js?v=1671101972"></script> <script> var _hmt = _hmt || []; (function() { var hm = document.createElement("script"); hm.src = "https://hm.baidu.com/hm.js?3dc5666f6478c7bf39cd5c91e597423d"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s); })(); </script> </body> </html>