登录
首页 >  文章 >  python教程

如何用正则提取HTML内容

时间:2025-06-28 14:49:53 307浏览 收藏

从现在开始,我们要努力学习啦!今天我给大家带来《如何用正则提取HTML特定内容》,感兴趣的朋友请继续看下去吧!下文中的内容我们主要会涉及到等等知识点,如果在阅读本文过程中有遇到不清楚的地方,欢迎留言呀!我们一起讨论,一起学习!

正则表达式可用于提取HTML中的特定内容,但并非最佳工具,推荐使用BeautifulSoup等库。1. 提取标签内文本可用类似(.*?)的正则,捕获组提取所需内容;2. 提取属性值如图片src可用,并可通过src=(['\"])(.*?)\1兼容单双引号;3. 匹配带特定类名的标签内容如

...
可用
([\s\S]*?)
,但嵌套结构可能导致匹配失败;建议测试时用真实数据、多用非贪婪模式,并在复杂结构中优先选用HTML解析库以避免问题。

如何使用正则表达式提取HTML中的特定内容?

在处理网页数据时,提取HTML中的特定内容是很常见的需求。正则表达式(Regex)虽然不是解析HTML的最佳工具(推荐用BeautifulSoup或类似库),但在简单场景下,它仍然是一种快速有效的方法。

如何使用正则表达式提取HTML中的特定内容?

匹配标签内的文本内容

如果你只想提取某个标签之间的文本,比如</code>标签里的标题,可以用如下正则:</p><img src="/uploads/20250628/1751093373685f907def592.png" alt="如何使用正则表达式提取HTML中的特定内容?"><pre class="brush:regex;toolbar:false;"><title.*?>(.*?)

这个表达式的意思是:

  • .*? 表示非贪婪匹配任意字符
  • (.*?) 是一个捕获组,用来提取你真正想要的内容

例如,面对这段HTML:

如何使用正则表达式提取HTML中的特定内容?
这是要提取的网页标题

正则会提取出“这是要提取的网页标题”。

⚠️注意:如果页面中有多处</code>标签或者结构复杂,可能会出现误匹配,这时候需要结合上下文或其他方式辅助判断。</p><h2>提取指定属性的值</h2><p>有时候你需要从HTML标签中提取某个属性的值,比如所有图片的<code>src</code>:</p><pre class="brush:regex;toolbar:false;"><img.*?src="(.*?)".*?></pre><p>这样就能从下面这样的HTML中提取出图片地址:</p><pre class="brush:html;toolbar:false;"><img src="/images/logo.png" alt="Logo"></pre><p>结果就是 <code>/images/logo.png</code></p><p>?技巧:</p><ul><li>如果不确定引号类型,可以使用<code>src=(['\"])(.*?)\1</code>来兼容单引号和双引号</li><li>注意转义字符,比如在Python中要用原始字符串<code>r''</code>避免反斜杠被转义</li></ul><h2>匹配带特定类名的标签内容</h2><p>想提取某个class下的内容?比如<code><div class="content">...</div></code>中的整个块:</p><pre class="brush:regex;toolbar:false;"><div class="content".*?>([\s\S]*?)</div></pre><p>这里用了<code>[\s\S]*?</code>来匹配包括换行在内的所有字符。</p><p>⚠️风险提示:</p><ul><li>HTML嵌套结构容易让这种正则失效,比如内部还有多个<code></div></code></li><li>更稳妥的方式是使用HTML解析器,避免“标签没闭合”、“属性顺序变化”等问题</li></ul><h2>一些实用建议</h2><ul><li>测试正则时尽量用真实的数据样本,别只看理想情况</li><li>多用非贪婪模式(<code>.*?</code>),否则很容易匹配过多内容</li><li>遇到复杂HTML结构时,优先考虑专门的解析库,而不是硬着头皮写正则</li><li>正则只是工具之一,不适用于所有HTML解析场景</li></ul><p>基本上就这些。正则提取HTML内容不复杂,但细节容易出错,多测试、多观察匹配结果才是关键。</p><p>理论要掌握,实操不能落!以上关于《如何用正则提取HTML内容》的详细介绍,大家都掌握了吧!如果想要继续提升自己的能力,那么就来关注golang学习网公众号吧!</p> </div> <div class="labsList"> </div> </div> <!-- 最新阅读 --> <div class="contBoxNor"> <div class="contTit"> <div class="tit">相关阅读</div> <a href="/articlelist.html" class="more">更多></a> </div> <ul class="latestReadList"> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="高效开发">高效开发</a> <a href="javascript:;" class="aLightGray" title="Flask框架">Flask框架</a> <a href="javascript:;" class="aLightGray" title="安装技巧">安装技巧</a> </div> <div class="tit lineOverflow"><a href="/article/80964.html" title="Flask框架安装技巧:让你的开发更高效" class="aBlack">Flask框架安装技巧:让你的开发更高效</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="80964" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="Django">Django</a> <a href="javascript:;" class="aLightGray" title="技巧">技巧</a> <a href="javascript:;" class="aLightGray" title="多线程">多线程</a> </div> <div class="tit lineOverflow"><a href="/article/90241.html" title="Django框架中的并发处理技巧" class="aBlack">Django框架中的并发处理技巧</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="90241" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="下载速度">下载速度</a> <a href="javascript:;" class="aLightGray" title="pip源配置">pip源配置</a> <a href="javascript:;" class="aLightGray" title="国内源">国内源</a> </div> <div class="tit lineOverflow"><a href="/article/88174.html" title="提升Python包下载速度的方法——正确配置pip的国内源" class="aBlack">提升Python包下载速度的方法——正确配置pip的国内源</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="88174" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   <a href="javascript:;" class="aLightGray" title="Python">Python</a> <a href="javascript:;" class="aLightGray" title="C++">C++</a> <a href="javascript:;" class="aLightGray" title="选择">选择</a> </div> <div class="tit lineOverflow"><a href="/article/113474.html" title="Python与C++:哪个编程语言更适合初学者?" class="aBlack">Python与C++:哪个编程语言更适合初学者?</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="113474" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1年前  |   </div> <div class="tit lineOverflow"><a href="/article/120624.html" title="品牌建设技巧" class="aBlack">品牌建设技巧</a></div> <div class="opt"> <span><i class="view"></i>501</span> <span class="collectBtn user_collection" data-id="120624" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> </ul> </div> <!-- 最新阅读 --> <div class="contBoxNor"> <div class="contTit"> <div class="tit">最新阅读</div> <a href="/articlelist.html" class="more">更多></a> </div> <ul class="latestReadList"> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1分钟前  |   <a href="javascript:;" class="aLightGray" title="异常处理">异常处理</a> <a href="javascript:;" class="aLightGray" title="可视化">可视化</a> <a href="javascript:;" class="aLightGray" title="logging">logging</a> <a href="javascript:;" class="aLightGray" title="JupyterNotebook">JupyterNotebook</a> <a href="javascript:;" class="aLightGray" title="traceback">traceback</a> </div> <div class="tit lineOverflow"><a href="/article/244027.html" title="JupyterNotebook输出格式异常解决方法" class="aBlack">JupyterNotebook输出格式异常解决方法</a></div> <div class="opt"> <span><i class="view"></i>482</span> <span class="collectBtn user_collection" data-id="244027" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1分钟前  |   </div> <div class="tit lineOverflow"><a href="/article/244026.html" title="Python向量化计算怎么实现?" class="aBlack">Python向量化计算怎么实现?</a></div> <div class="opt"> <span><i class="view"></i>320</span> <span class="collectBtn user_collection" data-id="244026" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  17分钟前  |   </div> <div class="tit lineOverflow"><a href="/article/244004.html" title="Python正则入门:re模块使用全解析" class="aBlack">Python正则入门:re模块使用全解析</a></div> <div class="opt"> <span><i class="view"></i>105</span> <span class="collectBtn user_collection" data-id="244004" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  22分钟前  |   </div> <div class="tit lineOverflow"><a href="/article/243997.html" title="Python正则忽略大小写匹配方法" class="aBlack">Python正则忽略大小写匹配方法</a></div> <div class="opt"> <span><i class="view"></i>498</span> <span class="collectBtn user_collection" data-id="243997" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  25分钟前  |   </div> <div class="tit lineOverflow"><a href="/article/243994.html" title="Python中eval的作用是什么?" class="aBlack">Python中eval的作用是什么?</a></div> <div class="opt"> <span><i class="view"></i>345</span> <span class="collectBtn user_collection" data-id="243994" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  32分钟前  |   </div> <div class="tit lineOverflow"><a href="/article/243987.html" title="Python并行计算怎么实现?" class="aBlack">Python并行计算怎么实现?</a></div> <div class="opt"> <span><i class="view"></i>275</span> <span class="collectBtn user_collection" data-id="243987" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  45分钟前  |   </div> <div class="tit lineOverflow"><a href="/article/243971.html" title="PyCharm区域设置位置及设置方法" class="aBlack">PyCharm区域设置位置及设置方法</a></div> <div class="opt"> <span><i class="view"></i>413</span> <span class="collectBtn user_collection" data-id="243971" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1小时前  |   </div> <div class="tit lineOverflow"><a href="/article/243925.html" title="Python定时任务实现全攻略" class="aBlack">Python定时任务实现全攻略</a></div> <div class="opt"> <span><i class="view"></i>190</span> <span class="collectBtn user_collection" data-id="243925" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1小时前  |   </div> <div class="tit lineOverflow"><a href="/article/243922.html" title="正则表达式预定义字符类详解" class="aBlack">正则表达式预定义字符类详解</a></div> <div class="opt"> <span><i class="view"></i>418</span> <span class="collectBtn user_collection" data-id="243922" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1小时前  |   </div> <div class="tit lineOverflow"><a href="/article/243909.html" title="PythonPygame高级功能详解" class="aBlack">PythonPygame高级功能详解</a></div> <div class="opt"> <span><i class="view"></i>323</span> <span class="collectBtn user_collection" data-id="243909" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  1小时前  |   </div> <div class="tit lineOverflow"><a href="/article/243907.html" title="Python正则匹配Unicode字符全攻略" class="aBlack">Python正则匹配Unicode字符全攻略</a></div> <div class="opt"> <span><i class="view"></i>218</span> <span class="collectBtn user_collection" data-id="243907" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> <li> <div class="info"> <a href="/articlelist/19_new_0_1.html" class="aLightGray" title="文章">文章</a> · <a href="/articlelist/86_new_0_1.html" class="aLightGray" title="python教程">python教程</a>   |  2小时前  |   </div> <div class="tit lineOverflow"><a href="/article/243865.html" title="Python高效计算技巧全解析" class="aBlack">Python高效计算技巧全解析</a></div> <div class="opt"> <span><i class="view"></i>277</span> <span class="collectBtn user_collection" data-id="243865" data-type="article" title="收藏"><i class="collect"></i>收藏</span> </div> </li> </ul> </div> <!-- 课程推荐 --> <div class="contBoxNor"> <div class="contTit"> <div class="tit">课程推荐</div> <a href="/courselist.html" class="more">更多></a> </div> <ul class="classRecomList"> <li> <a href="/course/9.html" title="前端进阶之JavaScript设计模式" class="img_box"> <img src="/uploads/20221222/52fd0f23a454c71029c2c72d206ed815.jpg" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="前端进阶之JavaScript设计模式"> </a> <dl> <dt class="lineOverflow"> 前端进阶之JavaScript设计模式 </dt> <dd class="cont1 lineOverflow">设计模式是开发人员在软件开发过程中面临一般问题时的解决方案,代表了最佳的实践。本课程的主打内容包括JS常见设计模式以及具体应用场景,打造一站式知识长龙服务,适合有JS基础的同学学习。</dd> <dd class="cont2"> <a href="/course/9.html" title="前端进阶之JavaScript设计模式" class="toStudy">立即学习</a> <span>542次学习</span> </dd> </dl> </li> <li> <a href="/course/2.html" title="GO语言核心编程课程" class="img_box"> <img src="/uploads/20221221/634ad7404159bfefc6a54a564d437b5f.png" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="GO语言核心编程课程"> </a> <dl> <dt class="lineOverflow"> GO语言核心编程课程 </dt> <dd class="cont1 lineOverflow">本课程采用真实案例,全面具体可落地,从理论到实践,一步一步将GO核心编程技术、编程思想、底层实现融会贯通,使学习者贴近时代脉搏,做IT互联网时代的弄潮儿。</dd> <dd class="cont2"> <a href="/course/2.html" title="GO语言核心编程课程" class="toStudy">立即学习</a> <span>508次学习</span> </dd> </dl> </li> <li> <a href="/course/74.html" title="简单聊聊mysql8与网络通信" class="img_box"> <img src="/uploads/20240103/bad35fe14edbd214bee16f88343ac57c.png" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="简单聊聊mysql8与网络通信"> </a> <dl> <dt class="lineOverflow"> 简单聊聊mysql8与网络通信 </dt> <dd class="cont1 lineOverflow">如有问题加微信:Le-studyg;在课程中,我们将首先介绍MySQL8的新特性,包括性能优化、安全增强、新数据类型等,帮助学生快速熟悉MySQL8的最新功能。接着,我们将深入解析MySQL的网络通信机制,包括协议、连接管理、数据传输等,让</dd> <dd class="cont2"> <a href="/course/74.html" title="简单聊聊mysql8与网络通信" class="toStudy">立即学习</a> <span>497次学习</span> </dd> </dl> </li> <li> <a href="/course/57.html" title="JavaScript正则表达式基础与实战" class="img_box"> <img src="/uploads/20221226/bbe4083bb3cb0dd135fb02c31c3785fb.jpg" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="JavaScript正则表达式基础与实战"> </a> <dl> <dt class="lineOverflow"> JavaScript正则表达式基础与实战 </dt> <dd class="cont1 lineOverflow">在任何一门编程语言中,正则表达式,都是一项重要的知识,它提供了高效的字符串匹配与捕获机制,可以极大的简化程序设计。</dd> <dd class="cont2"> <a href="/course/57.html" title="JavaScript正则表达式基础与实战" class="toStudy">立即学习</a> <span>487次学习</span> </dd> </dl> </li> <li> <a href="/course/28.html" title="从零制作响应式网站—Grid布局" class="img_box"> <img src="/uploads/20221223/ac110f88206daeab6c0cf38ebf5fe9ed.jpg" onerror="this.onerror='';this.src='/assets/images/moren/morentu.png'" alt="从零制作响应式网站—Grid布局"> </a> <dl> <dt class="lineOverflow"> 从零制作响应式网站—Grid布局 </dt> <dd class="cont1 lineOverflow">本系列教程将展示从零制作一个假想的网络科技公司官网,分为导航,轮播,关于我们,成功案例,服务流程,团队介绍,数据部分,公司动态,底部信息等内容区块。网站整体采用CSSGrid布局,支持响应式,有流畅过渡和展现动画。</dd> <dd class="cont2"> <a href="/course/28.html" title="从零制作响应式网站—Grid布局" class="toStudy">立即学习</a> <span>484次学习</span> </dd> </dl> </li> </ul> </div> </div> <!-- footer --> <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"> <div class="footer"> <ul> <li ><a href="/" class="aLightGray"><em class="material-icons">home</em><span>首页</span></a></li> <li class="curr"><a href="/articlelist.html" class="aLightGray"><em class="material-icons">menu_book</em><span>阅读</span></a></li> <li ><a href="/courselist.html" class="aLightGray"><em class="material-icons">school</em><span>课程</span></a></li> <li ><a href="/ai.html" class="aLightGray"><em class="material-icons">smart_toy</em><span>AI助手</span></a></li> <li ><a href="/user.html" class="aLightGray"><em class="material-icons">person</em><span>我的</span></a></li> </ul> </div> <script src="/assets/js/require.js" data-main="/assets/js/require-frontend.js?v=1671101972"></script> <script> var _hmt = _hmt || []; (function() { var hm = document.createElement("script"); hm.src = "https://hm.baidu.com/hm.js?3dc5666f6478c7bf39cd5c91e597423d"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s); })(); </script> </body> </html>