登录
首页 >  文章 >  python教程

如何用正则提取HTML特定内容

时间:2025-08-17 09:04:26 379浏览 收藏

推广推荐
免费电影APP ➜
支持 PC / 移动端,安全直达

今天golang学习网给大家带来了《如何用正则提取HTML特定内容》,其中涉及到的知识点包括等等,无论你是小白还是老手,都适合看一看哦~有好的建议也欢迎大家在评论留言,若是看完有所收获,也希望大家能多多点赞支持呀!一起加油学习~

正则表达式可用于提取HTML中的特定内容,但并非最佳工具,推荐使用BeautifulSoup等库。1. 提取标签内文本可用类似(.*?)的正则,捕获组提取所需内容;2. 提取属性值如图片src可用,并可通过src=(['\"])(.*?)\1兼容单双引号;3. 匹配带特定类名的标签内容如

...
可用
([\s\S]*?)
,但嵌套结构可能导致匹配失败;建议测试时用真实数据、多用非贪婪模式,并在复杂结构中优先选用HTML解析库以避免问题。

如何使用正则表达式提取HTML中的特定内容?

在处理网页数据时,提取HTML中的特定内容是很常见的需求。正则表达式(Regex)虽然不是解析HTML的最佳工具(推荐用BeautifulSoup或类似库),但在简单场景下,它仍然是一种快速有效的方法。

如何使用正则表达式提取HTML中的特定内容?

匹配标签内的文本内容

如果你只想提取某个标签之间的文本,比如</code>标签里的标题,可以用如下正则:</p><img src="/uploads/20250817/175539264768a12a8794dee.png" alt="如何使用正则表达式提取HTML中的特定内容?"><pre><title.*?>(.*?)</title></pre><p>这个表达式的意思是:</p><ul><li><code>.*?</code> 表示非贪婪匹配任意字符</li><li><code>(.*?)</code> 是一个捕获组,用来提取你真正想要的内容</li></ul><p>例如,面对这段HTML:</p><img src="/uploads/20250817/175539264768a12a8797fd9.png" alt="如何使用正则表达式提取HTML中的特定内容?"><pre><title>这是要提取的网页标题</title></pre><p>正则会提取出“这是要提取的网页标题”。</p><p>⚠️注意:如果页面中有多处<code><title></code>标签或者结构复杂,可能会出现误匹配,这时候需要结合上下文或其他方式辅助判断。</p><h2>提取指定属性的值</h2><p>有时候你需要从HTML标签中提取某个属性的值,比如所有图片的<code>src</code>:</p><pre><img.*?src="(.*?)".*?></pre><p>这样就能从下面这样的HTML中提取出图片地址:</p><pre><img src="/images/logo.png" alt="Logo"></pre><p>结果就是 <code>/images/logo.png</code></p><p>?技巧:</p><ul><li>如果不确定引号类型,可以使用<code>src=(['\"])(.*?)\1</code>来兼容单引号和双引号</li><li>注意转义字符,比如在Python中要用原始字符串<code>r''</code>避免反斜杠被转义</li></ul><h2>匹配带特定类名的标签内容</h2><p>想提取某个class下的内容?比如<code><div class="content">...</div></code>中的整个块:</p><pre><div class="content".*?>([\s\S]*?)</div></pre><p>这里用了<code>[\s\S]*?</code>来匹配包括换行在内的所有字符。</p><p>⚠️风险提示:</p><ul><li>HTML嵌套结构容易让这种正则失效,比如内部还有多个<code></div></code></li><li>更稳妥的方式是使用HTML解析器,避免“标签没闭合”、“属性顺序变化”等问题</li></ul><h2>一些实用建议</h2><ul><li>测试正则时尽量用真实的数据样本,别只看理想情况</li><li>多用非贪婪模式(<code>.*?</code>),否则很容易匹配过多内容</li><li>遇到复杂HTML结构时,优先考虑专门的解析库,而不是硬着头皮写正则</li><li>正则只是工具之一,不适用于所有HTML解析场景</li></ul><p>基本上就这些。正则提取HTML内容不复杂,但细节容易出错,多测试、多观察匹配结果才是关键。</p><p>今天关于《如何用正则提取HTML特定内容》的内容就介绍到这里了,是不是学起来一目了然!想要了解更多关于的内容请关注golang学习网公众号!</p> </div> <div class="labsList"> </div> </div> <!-- 最新阅读 --> <div class="contBoxNor"> <div class="contTit"> <div class="tit">相关阅读</div> <a href="/articlelist.html" class="more">更多></a> </div> <ul class="latestReadList"> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="高效开发">高效开发</a> <a href="javascript:;" class="aLightGray" title="Flask框架">Flask框架</a> <a href="javascript:;" class="aLightGray" title="安装技巧">安装技巧</a> </div> <div class="tit lineOverflow"><a href="/article/80964.html" title="Flask框架安装技巧:让你的开发更高效" class="aBlack">Flask框架安装技巧:让你的开发更高效</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="80964" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="Django">Django</a> <a href="javascript:;" class="aLightGray" title="技巧">技巧</a> <a href="javascript:;" class="aLightGray" title="多线程">多线程</a> </div> <div class="tit lineOverflow"><a href="/article/90241.html" title="Django框架中的并发处理技巧" class="aBlack">Django框架中的并发处理技巧</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="90241" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="下载速度">下载速度</a> <a href="javascript:;" class="aLightGray" title="pip源配置">pip源配置</a> <a href="javascript:;" class="aLightGray" title="国内源">国内源</a> </div> <div class="tit lineOverflow"><a href="/article/88174.html" title="提升Python包下载速度的方法——正确配置pip的国内源" class="aBlack">提升Python包下载速度的方法——正确配置pip的国内源</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="88174" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="Python">Python</a> <a href="javascript:;" class="aLightGray" title="C++">C++</a> <a href="javascript:;" class="aLightGray" title="选择">选择</a> </div> <div class="tit lineOverflow"><a href="/article/113474.html" title="Python与C++:哪个编程语言更适合初学者?" class="aBlack">Python与C++:哪个编程语言更适合初学者?</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="113474" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   </div> <div class="tit lineOverflow"><a href="/article/120624.html" title="品牌建设技巧" class="aBlack">品牌建设技巧</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="120624" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> </ul> </div> <!-- 最新阅读 --> <div class="contBoxNor"> <div class="contTit"> <div class="tit">最新阅读</div> <a href="/articlelist.html" class="more">更多></a> </div> <ul class="latestReadList"> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  4小时前  |   <a href="javascript:;" class="aLightGray" title="Python">Python</a> <a href="javascript:;" class="aLightGray" title="数据转换">数据转换</a> <a href="javascript:;" class="aLightGray" title="高阶函数">高阶函数</a> <a href="javascript:;" class="aLightGray" title="可迭代对象">可迭代对象</a> <a href="javascript:;" class="aLightGray" title="map()函数">map()函数</a> </div> <div class="tit lineOverflow"><a href="/article/408401.html" title="Pythonmap函数详解与使用方法" class="aBlack">Pythonmap函数详解与使用方法</a></div> <div class="opt"> <span><i class="view"></i>142</span> <span class="collectBtn user_collection" data-id="408401" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  4小时前  |   </div> <div class="tit lineOverflow"><a href="/article/408370.html" title="NumPy位异或归约操作全解析" class="aBlack">NumPy位异或归约操作全解析</a></div> <div class="opt"> <span><i class="view"></i>259</span> <span class="collectBtn user_collection" data-id="408370" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  5小时前  |   <a href="javascript:;" class="aLightGray" title="Python">Python</a> <a href="javascript:;" class="aLightGray" title="文件操作">文件操作</a> <a href="javascript:;" class="aLightGray" title="内存操作">内存操作</a> <a href="javascript:;" class="aLightGray" title="二进制数据">二进制数据</a> <a href="javascript:;" class="aLightGray" title="BytesIO">BytesIO</a> </div> <div class="tit lineOverflow"><a href="/article/408366.html" title="PythonBytesIO处理二进制数据技巧" class="aBlack">PythonBytesIO处理二进制数据技巧</a></div> <div class="opt"> <span><i class="view"></i>113</span> <span class="collectBtn user_collection" data-id="408366" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  5小时前  |   </div> <div class="tit lineOverflow"><a href="/article/408345.html" title="Python遍历读取所有文件技巧" class="aBlack">Python遍历读取所有文件技巧</a></div> <div class="opt"> <span><i class="view"></i>327</span> <span class="collectBtn user_collection" data-id="408345" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  5小时前  |   </div> <div class="tit lineOverflow"><a href="/article/408335.html" title="Python中index的作用及使用方法" class="aBlack">Python中index的作用及使用方法</a></div> <div class="opt"> <span><i class="view"></i>358</span> <span class="collectBtn user_collection" data-id="408335" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  6小时前  |   </div> <div class="tit lineOverflow"><a href="/article/408274.html" title="Python快速访问嵌套字典键值对" class="aBlack">Python快速访问嵌套字典键值对</a></div> <div class="opt"> <span><i class="view"></i>340</span> <span class="collectBtn user_collection" data-id="408274" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  6小时前  |   </div> <div class="tit lineOverflow"><a href="/article/408239.html" title="Python中ch代表字符的用法解析" class="aBlack">Python中ch代表字符的用法解析</a></div> <div class="opt"> <span><i class="view"></i>365</span> <span class="collectBtn user_collection" data-id="408239" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  6小时前  |   </div> <div class="tit lineOverflow"><a href="/article/408233.html" title="NumPy1D近邻查找:向量化优化技巧" class="aBlack">NumPy1D近邻查找:向量化优化技巧</a></div> <div class="opt"> <span><i class="view"></i>391</span> <span class="collectBtn user_collection" data-id="408233" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  7小时前  |   <a href="javascript:;" class="aLightGray" title="正则表达式">正则表达式</a> <a href="javascript:;" class="aLightGray" title="字符串操作">字符串操作</a> <a href="javascript:;" class="aLightGray" title="re模块">re模块</a> <a href="javascript:;" class="aLightGray" title="Python文本处理">Python文本处理</a> <a href="javascript:;" class="aLightGray" title="文本清洗">文本清洗</a> </div> <div class="tit lineOverflow"><a href="/article/408207.html" title="Python正则表达式实战教程详解" class="aBlack">Python正则表达式实战教程详解</a></div> <div class="opt"> <span><i class="view"></i>392</span> <span class="collectBtn user_collection" data-id="408207" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  7小时前  |   </div> <div class="tit lineOverflow"><a href="/article/408205.html" title="BehaveFixture临时目录管理技巧" class="aBlack">BehaveFixture临时目录管理技巧</a></div> <div class="opt"> <span><i class="view"></i>105</span> <span class="collectBtn user_collection" data-id="408205" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  7小时前  |   <a href="javascript:;" class="aLightGray" title="Python">Python</a> <a href="javascript:;" class="aLightGray" title="余数">余数</a> <a href="javascript:;" class="aLightGray" title="元组">元组</a> <a href="javascript:;" class="aLightGray" title="divmod()函数">divmod()函数</a> <a href="javascript:;" class="aLightGray" title="商">商</a> </div> <div class="tit lineOverflow"><a href="/article/408164.html" title="divmod函数详解与使用技巧" class="aBlack">divmod函数详解与使用技巧</a></div> <div class="opt"> <span><i class="view"></i>442</span> <span class="collectBtn user_collection" data-id="408164" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  8小时前  |   </div> <div class="tit lineOverflow"><a href="/article/408116.html" title="Python多进程共享字符串内存技巧" class="aBlack">Python多进程共享字符串内存技巧</a></div> <div class="opt"> <span><i class="view"></i>291</span> <span class="collectBtn user_collection" data-id="408116" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> </ul> </div> <!-- 课程推荐 --> <div class="contBoxNor"> <div class="contTit"> <div class="tit">课程推荐</div> <a href="/courselist.html" class="more">更多></a> </div> <ul class="classRecomList"> <li> <a href="/course/9.html" title="前端进阶之JavaScript设计模式" class="img_box"> <img src="/uploads/20221222/52fd0f23a454c71029c2c72d206ed815.jpg" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="前端进阶之JavaScript设计模式"> </a> <dl> <dt class="lineOverflow"> 前端进阶之JavaScript设计模式 </dt> <dd class="cont1 lineOverflow">设计模式是开发人员在软件开发过程中面临一般问题时的解决方案,代表了最佳的实践。本课程的主打内容包括JS常见设计模式以及具体应用场景,打造一站式知识长龙服务,适合有JS基础的同学学习。</dd> <dd class="cont2"> <a href="/course/9.html" title="前端进阶之JavaScript设计模式" class="toStudy">立即学习</a> <span>543次学习</span> </dd> </dl> </li> <li> <a href="/course/2.html" title="GO语言核心编程课程" class="img_box"> <img src="/uploads/20221221/634ad7404159bfefc6a54a564d437b5f.png" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="GO语言核心编程课程"> </a> <dl> <dt class="lineOverflow"> GO语言核心编程课程 </dt> <dd class="cont1 lineOverflow">本课程采用真实案例,全面具体可落地,从理论到实践,一步一步将GO核心编程技术、编程思想、底层实现融会贯通,使学习者贴近时代脉搏,做IT互联网时代的弄潮儿。</dd> <dd class="cont2"> <a href="/course/2.html" title="GO语言核心编程课程" class="toStudy">立即学习</a> <span>516次学习</span> </dd> </dl> </li> <li> <a href="/course/74.html" title="简单聊聊mysql8与网络通信" class="img_box"> <img src="/uploads/20240103/bad35fe14edbd214bee16f88343ac57c.png" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="简单聊聊mysql8与网络通信"> </a> <dl> <dt class="lineOverflow"> 简单聊聊mysql8与网络通信 </dt> <dd class="cont1 lineOverflow">如有问题加微信:Le-studyg;在课程中,我们将首先介绍MySQL8的新特性,包括性能优化、安全增强、新数据类型等,帮助学生快速熟悉MySQL8的最新功能。接着,我们将深入解析MySQL的网络通信机制,包括协议、连接管理、数据传输等,让</dd> <dd class="cont2"> <a href="/course/74.html" title="简单聊聊mysql8与网络通信" class="toStudy">立即学习</a> <span>500次学习</span> </dd> </dl> </li> <li> <a href="/course/57.html" title="JavaScript正则表达式基础与实战" class="img_box"> <img src="/uploads/20221226/bbe4083bb3cb0dd135fb02c31c3785fb.jpg" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="JavaScript正则表达式基础与实战"> </a> <dl> <dt class="lineOverflow"> JavaScript正则表达式基础与实战 </dt> <dd class="cont1 lineOverflow">在任何一门编程语言中,正则表达式,都是一项重要的知识,它提供了高效的字符串匹配与捕获机制,可以极大的简化程序设计。</dd> <dd class="cont2"> <a href="/course/57.html" title="JavaScript正则表达式基础与实战" class="toStudy">立即学习</a> <span>487次学习</span> </dd> </dl> </li> <li> <a href="/course/28.html" title="从零制作响应式网站—Grid布局" class="img_box"> <img src="/uploads/20221223/ac110f88206daeab6c0cf38ebf5fe9ed.jpg" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="从零制作响应式网站—Grid布局"> </a> <dl> <dt class="lineOverflow"> 从零制作响应式网站—Grid布局 </dt> <dd class="cont1 lineOverflow">本系列教程将展示从零制作一个假想的网络科技公司官网,分为导航,轮播,关于我们,成功案例,服务流程,团队介绍,数据部分,公司动态,底部信息等内容区块。网站整体采用CSSGrid布局,支持响应式,有流畅过渡和展现动画。</dd> <dd class="cont2"> <a href="/course/28.html" title="从零制作响应式网站—Grid布局" class="toStudy">立即学习</a> <span>485次学习</span> </dd> </dl> </li> </ul> </div> </div> <!-- footer --> <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"> <div class="footer"> <ul> <li ><a href="/" class="aLightGray"><em class="material-icons">home</em><span>首页</span></a></li> <li class="curr"><a href="/articlelist.html" class="aLightGray"><em class="material-icons">menu_book</em><span>阅读</span></a></li> <li ><a href="/courselist.html" class="aLightGray"><em class="material-icons">school</em><span>课程</span></a></li> <li ><a href="/ai.html" class="aLightGray"><em class="material-icons">smart_toy</em><span>AI助手</span></a></li> <li ><a href="/user.html" class="aLightGray"><em class="material-icons">person</em><span>我的</span></a></li> </ul> </div> <script src="/assets/js/require.js" data-main="/assets/js/require-frontend.js?v=1671101972"></script> <script> var _hmt = _hmt || []; (function() { var hm = document.createElement("script"); hm.src = "https://hm.baidu.com/hm.js?3dc5666f6478c7bf39cd5c91e597423d"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s); })(); </script> </body> </html>