登录
首页 >  文章 >  python教程

如何用正则提取HTML内容

时间:2025-07-01 12:53:55 108浏览 收藏

小伙伴们对文章编程感兴趣吗?是否正在学习相关知识点?如果是,那么本文《如何用正则提取HTML特定内容》,就很适合你,本篇文章讲解的知识点主要包括。在之后的文章中也会多多分享相关知识点,希望对大家的知识积累有所帮助!

正则表达式可用于提取HTML中的特定内容,但并非最佳工具,推荐使用BeautifulSoup等库。1. 提取标签内文本可用类似(.*?)的正则,捕获组提取所需内容;2. 提取属性值如图片src可用,并可通过src=(['\"])(.*?)\1兼容单双引号;3. 匹配带特定类名的标签内容如

...
可用
([\s\S]*?)
,但嵌套结构可能导致匹配失败;建议测试时用真实数据、多用非贪婪模式,并在复杂结构中优先选用HTML解析库以避免问题。

如何使用正则表达式提取HTML中的特定内容?

在处理网页数据时,提取HTML中的特定内容是很常见的需求。正则表达式(Regex)虽然不是解析HTML的最佳工具(推荐用BeautifulSoup或类似库),但在简单场景下,它仍然是一种快速有效的方法。

如何使用正则表达式提取HTML中的特定内容?

匹配标签内的文本内容

如果你只想提取某个标签之间的文本,比如</code>标签里的标题,可以用如下正则:</p><img src="/uploads/20250701/1751345615686369cf8eeec.png" alt="如何使用正则表达式提取HTML中的特定内容?"><pre class="brush:regex;toolbar:false;"><title.*?>(.*?)

这个表达式的意思是:

  • .*? 表示非贪婪匹配任意字符
  • (.*?) 是一个捕获组,用来提取你真正想要的内容

例如,面对这段HTML:

如何使用正则表达式提取HTML中的特定内容?
这是要提取的网页标题

正则会提取出“这是要提取的网页标题”。

⚠️注意:如果页面中有多处</code>标签或者结构复杂,可能会出现误匹配,这时候需要结合上下文或其他方式辅助判断。</p><h2>提取指定属性的值</h2><p>有时候你需要从HTML标签中提取某个属性的值,比如所有图片的<code>src</code>:</p><pre class="brush:regex;toolbar:false;"><img.*?src="(.*?)".*?></pre><p>这样就能从下面这样的HTML中提取出图片地址:</p><pre class="brush:html;toolbar:false;"><img src="/images/logo.png" alt="Logo"></pre><p>结果就是 <code>/images/logo.png</code></p><p>?技巧:</p><ul><li>如果不确定引号类型,可以使用<code>src=(['\"])(.*?)\1</code>来兼容单引号和双引号</li><li>注意转义字符,比如在Python中要用原始字符串<code>r''</code>避免反斜杠被转义</li></ul><h2>匹配带特定类名的标签内容</h2><p>想提取某个class下的内容?比如<code><div class="content">...</div></code>中的整个块:</p><pre class="brush:regex;toolbar:false;"><div class="content".*?>([\s\S]*?)</div></pre><p>这里用了<code>[\s\S]*?</code>来匹配包括换行在内的所有字符。</p><p>⚠️风险提示:</p><ul><li>HTML嵌套结构容易让这种正则失效,比如内部还有多个<code></div></code></li><li>更稳妥的方式是使用HTML解析器,避免“标签没闭合”、“属性顺序变化”等问题</li></ul><h2>一些实用建议</h2><ul><li>测试正则时尽量用真实的数据样本,别只看理想情况</li><li>多用非贪婪模式(<code>.*?</code>),否则很容易匹配过多内容</li><li>遇到复杂HTML结构时,优先考虑专门的解析库,而不是硬着头皮写正则</li><li>正则只是工具之一,不适用于所有HTML解析场景</li></ul><p>基本上就这些。正则提取HTML内容不复杂,但细节容易出错,多测试、多观察匹配结果才是关键。</p><p>理论要掌握,实操不能落!以上关于《如何用正则提取HTML内容》的详细介绍,大家都掌握了吧!如果想要继续提升自己的能力,那么就来关注golang学习网公众号吧!</p> </div> <div class="labsList"> </div> </div> <!-- 最新阅读 --> <div class="contBoxNor"> <div class="contTit"> <div class="tit">相关阅读</div> <a href="/articlelist.html" class="more">更多></a> </div> <ul class="latestReadList"> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="高效开发">高效开发</a> <a href="javascript:;" class="aLightGray" title="Flask框架">Flask框架</a> <a href="javascript:;" class="aLightGray" title="安装技巧">安装技巧</a> </div> <div class="tit lineOverflow"><a href="/article/80964.html" title="Flask框架安装技巧:让你的开发更高效" class="aBlack">Flask框架安装技巧:让你的开发更高效</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="80964" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="Django">Django</a> <a href="javascript:;" class="aLightGray" title="技巧">技巧</a> <a href="javascript:;" class="aLightGray" title="多线程">多线程</a> </div> <div class="tit lineOverflow"><a href="/article/90241.html" title="Django框架中的并发处理技巧" class="aBlack">Django框架中的并发处理技巧</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="90241" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="下载速度">下载速度</a> <a href="javascript:;" class="aLightGray" title="pip源配置">pip源配置</a> <a href="javascript:;" class="aLightGray" title="国内源">国内源</a> </div> <div class="tit lineOverflow"><a href="/article/88174.html" title="提升Python包下载速度的方法——正确配置pip的国内源" class="aBlack">提升Python包下载速度的方法——正确配置pip的国内源</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="88174" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="Python">Python</a> <a href="javascript:;" class="aLightGray" title="C++">C++</a> <a href="javascript:;" class="aLightGray" title="选择">选择</a> </div> <div class="tit lineOverflow"><a href="/article/113474.html" title="Python与C++:哪个编程语言更适合初学者?" class="aBlack">Python与C++:哪个编程语言更适合初学者?</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="113474" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   </div> <div class="tit lineOverflow"><a href="/article/120624.html" title="品牌建设技巧" class="aBlack">品牌建设技巧</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="120624" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> </ul> </div> <!-- 最新阅读 --> <div class="contBoxNor"> <div class="contTit"> <div class="tit">最新阅读</div> <a href="/articlelist.html" class="more">更多></a> </div> <ul class="latestReadList"> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  8分钟前  |   </div> <div class="tit lineOverflow"><a href="/article/265053.html" title="PyHive连接Hive详细教程与使用方法" class="aBlack">PyHive连接Hive详细教程与使用方法</a></div> <div class="opt"> <span><i class="view"></i>117</span> <span class="collectBtn user_collection" data-id="265053" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  31分钟前  |   <a href="javascript:;" class="aLightGray" title="Python">Python</a> <a href="javascript:;" class="aLightGray" title="元学习">元学习</a> <a href="javascript:;" class="aLightGray" title="少样本异常检测">少样本异常检测</a> <a href="javascript:;" class="aLightGray" title="数据稀缺">数据稀缺</a> <a href="javascript:;" class="aLightGray" title="MAML">MAML</a> </div> <div class="tit lineOverflow"><a href="/article/265025.html" title="少样本异常检测Python实现方法" class="aBlack">少样本异常检测Python实现方法</a></div> <div class="opt"> <span><i class="view"></i>471</span> <span class="collectBtn user_collection" data-id="265025" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  35分钟前  |   <a href="javascript:;" class="aLightGray" title="Python">Python</a> <a href="javascript:;" class="aLightGray" title="IQR">IQR</a> <a href="javascript:;" class="aLightGray" title="离群点">离群点</a> <a href="javascript:;" class="aLightGray" title="Z-score">Z-score</a> <a href="javascript:;" class="aLightGray" title="IsolationForest">IsolationForest</a> </div> <div class="tit lineOverflow"><a href="/article/265020.html" title="Python三种算法处理离群点对比分析" class="aBlack">Python三种算法处理离群点对比分析</a></div> <div class="opt"> <span><i class="view"></i>450</span> <span class="collectBtn user_collection" data-id="265020" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  42分钟前  |   <a href="javascript:;" class="aLightGray" title="Python">Python</a> <a href="javascript:;" class="aLightGray" title="kafka">kafka</a> <a href="javascript:;" class="aLightGray" title="流数据处理">流数据处理</a> <a href="javascript:;" class="aLightGray" title="PySpark">PySpark</a> <a href="javascript:;" class="aLightGray" title="StructuredStreaming">StructuredStreaming</a> </div> <div class="tit lineOverflow"><a href="/article/265011.html" title="Python流数据处理:Kafka与Spark实战指南" class="aBlack">Python流数据处理:Kafka与Spark实战指南</a></div> <div class="opt"> <span><i class="view"></i>135</span> <span class="collectBtn user_collection" data-id="265011" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  43分钟前  |   <a href="javascript:;" class="aLightGray" title="Python">Python</a> <a href="javascript:;" class="aLightGray" title="内存优化">内存优化</a> <a href="javascript:;" class="aLightGray" title="unicode">unicode</a> <a href="javascript:;" class="aLightGray" title="编码解码">编码解码</a> <a href="javascript:;" class="aLightGray" title="str类型">str类型</a> </div> <div class="tit lineOverflow"><a href="/article/265010.html" title="Python如何处理Unicode编码?" class="aBlack">Python如何处理Unicode编码?</a></div> <div class="opt"> <span><i class="view"></i>251</span> <span class="collectBtn user_collection" data-id="265010" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  45分钟前  |   <a href="javascript:;" class="aLightGray" title="Python">Python</a> <a href="javascript:;" class="aLightGray" title="Excel">Excel</a> <a href="javascript:;" class="aLightGray" title="csv">csv</a> <a href="javascript:;" class="aLightGray" title="JSON">JSON</a> <a href="javascript:;" class="aLightGray" title="Pandas">Pandas</a> </div> <div class="tit lineOverflow"><a href="/article/265009.html" title="Python轻松转换JSONCSVExcel教程" class="aBlack">Python轻松转换JSONCSVExcel教程</a></div> <div class="opt"> <span><i class="view"></i>317</span> <span class="collectBtn user_collection" data-id="265009" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  49分钟前  |   </div> <div class="tit lineOverflow"><a href="/article/265005.html" title="Python类继承全解析:面向对象编程进阶教程" class="aBlack">Python类继承全解析:面向对象编程进阶教程</a></div> <div class="opt"> <span><i class="view"></i>489</span> <span class="collectBtn user_collection" data-id="265005" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  50分钟前  |   </div> <div class="tit lineOverflow"><a href="/article/265004.html" title="Python爬虫教程:Scrapy实战入门指南" class="aBlack">Python爬虫教程:Scrapy实战入门指南</a></div> <div class="opt"> <span><i class="view"></i>363</span> <span class="collectBtn user_collection" data-id="265004" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  50分钟前  |   </div> <div class="tit lineOverflow"><a href="/article/265003.html" title="Python数据可视化技巧分享" class="aBlack">Python数据可视化技巧分享</a></div> <div class="opt"> <span><i class="view"></i>186</span> <span class="collectBtn user_collection" data-id="265003" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1小时前  |   </div> <div class="tit lineOverflow"><a href="/article/264987.html" title="Pythonasyncio协程详解与实战" class="aBlack">Pythonasyncio协程详解与实战</a></div> <div class="opt"> <span><i class="view"></i>265</span> <span class="collectBtn user_collection" data-id="264987" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1小时前  |   <a href="javascript:;" class="aLightGray" title="SHAP">SHAP</a> <a href="javascript:;" class="aLightGray" title="异常检测模型">异常检测模型</a> <a href="javascript:;" class="aLightGray" title="模型解释">模型解释</a> <a href="javascript:;" class="aLightGray" title="特征贡献">特征贡献</a> <a href="javascript:;" class="aLightGray" title="Shapley值">Shapley值</a> </div> <div class="tit lineOverflow"><a href="/article/264984.html" title="SHAP解析复杂异常检测模型原理" class="aBlack">SHAP解析复杂异常检测模型原理</a></div> <div class="opt"> <span><i class="view"></i>310</span> <span class="collectBtn user_collection" data-id="264984" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1小时前  |   </div> <div class="tit lineOverflow"><a href="/article/264981.html" title="PyQt6异步任务选择:QThread与QThreadPool指南" class="aBlack">PyQt6异步任务选择:QThread与QThreadPool指南</a></div> <div class="opt"> <span><i class="view"></i>342</span> <span class="collectBtn user_collection" data-id="264981" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> </ul> </div> <!-- 课程推荐 --> <div class="contBoxNor"> <div class="contTit"> <div class="tit">课程推荐</div> <a href="/courselist.html" class="more">更多></a> </div> <ul class="classRecomList"> <li> <a href="/course/9.html" title="前端进阶之JavaScript设计模式" class="img_box"> <img src="/uploads/20221222/52fd0f23a454c71029c2c72d206ed815.jpg" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="前端进阶之JavaScript设计模式"> </a> <dl> <dt class="lineOverflow"> 前端进阶之JavaScript设计模式 </dt> <dd class="cont1 lineOverflow">设计模式是开发人员在软件开发过程中面临一般问题时的解决方案,代表了最佳的实践。本课程的主打内容包括JS常见设计模式以及具体应用场景,打造一站式知识长龙服务,适合有JS基础的同学学习。</dd> <dd class="cont2"> <a href="/course/9.html" title="前端进阶之JavaScript设计模式" class="toStudy">立即学习</a> <span>542次学习</span> </dd> </dl> </li> <li> <a href="/course/2.html" title="GO语言核心编程课程" class="img_box"> <img src="/uploads/20221221/634ad7404159bfefc6a54a564d437b5f.png" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="GO语言核心编程课程"> </a> <dl> <dt class="lineOverflow"> GO语言核心编程课程 </dt> <dd class="cont1 lineOverflow">本课程采用真实案例,全面具体可落地,从理论到实践,一步一步将GO核心编程技术、编程思想、底层实现融会贯通,使学习者贴近时代脉搏,做IT互联网时代的弄潮儿。</dd> <dd class="cont2"> <a href="/course/2.html" title="GO语言核心编程课程" class="toStudy">立即学习</a> <span>511次学习</span> </dd> </dl> </li> <li> <a href="/course/74.html" title="简单聊聊mysql8与网络通信" class="img_box"> <img src="/uploads/20240103/bad35fe14edbd214bee16f88343ac57c.png" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="简单聊聊mysql8与网络通信"> </a> <dl> <dt class="lineOverflow"> 简单聊聊mysql8与网络通信 </dt> <dd class="cont1 lineOverflow">如有问题加微信:Le-studyg;在课程中,我们将首先介绍MySQL8的新特性,包括性能优化、安全增强、新数据类型等,帮助学生快速熟悉MySQL8的最新功能。接着,我们将深入解析MySQL的网络通信机制,包括协议、连接管理、数据传输等,让</dd> <dd class="cont2"> <a href="/course/74.html" title="简单聊聊mysql8与网络通信" class="toStudy">立即学习</a> <span>498次学习</span> </dd> </dl> </li> <li> <a href="/course/57.html" title="JavaScript正则表达式基础与实战" class="img_box"> <img src="/uploads/20221226/bbe4083bb3cb0dd135fb02c31c3785fb.jpg" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="JavaScript正则表达式基础与实战"> </a> <dl> <dt class="lineOverflow"> JavaScript正则表达式基础与实战 </dt> <dd class="cont1 lineOverflow">在任何一门编程语言中,正则表达式,都是一项重要的知识,它提供了高效的字符串匹配与捕获机制,可以极大的简化程序设计。</dd> <dd class="cont2"> <a href="/course/57.html" title="JavaScript正则表达式基础与实战" class="toStudy">立即学习</a> <span>487次学习</span> </dd> </dl> </li> <li> <a href="/course/28.html" title="从零制作响应式网站—Grid布局" class="img_box"> <img src="/uploads/20221223/ac110f88206daeab6c0cf38ebf5fe9ed.jpg" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="从零制作响应式网站—Grid布局"> </a> <dl> <dt class="lineOverflow"> 从零制作响应式网站—Grid布局 </dt> <dd class="cont1 lineOverflow">本系列教程将展示从零制作一个假想的网络科技公司官网,分为导航,轮播,关于我们,成功案例,服务流程,团队介绍,数据部分,公司动态,底部信息等内容区块。网站整体采用CSSGrid布局,支持响应式,有流畅过渡和展现动画。</dd> <dd class="cont2"> <a href="/course/28.html" title="从零制作响应式网站—Grid布局" class="toStudy">立即学习</a> <span>484次学习</span> </dd> </dl> </li> </ul> </div> </div> <!-- footer --> <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"> <div class="footer"> <ul> <li ><a href="/" class="aLightGray"><em class="material-icons">home</em><span>首页</span></a></li> <li class="curr"><a href="/articlelist.html" class="aLightGray"><em class="material-icons">menu_book</em><span>阅读</span></a></li> <li ><a href="/courselist.html" class="aLightGray"><em class="material-icons">school</em><span>课程</span></a></li> <li ><a href="/ai.html" class="aLightGray"><em class="material-icons">smart_toy</em><span>AI助手</span></a></li> <li ><a href="/user.html" class="aLightGray"><em class="material-icons">person</em><span>我的</span></a></li> </ul> </div> <script src="/assets/js/require.js" data-main="/assets/js/require-frontend.js?v=1671101972"></script> <script> var _hmt = _hmt || []; (function() { var hm = document.createElement("script"); hm.src = "https://hm.baidu.com/hm.js?3dc5666f6478c7bf39cd5c91e597423d"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s); })(); </script> </body> </html>