首页 > 文章 > python教程

如何使用 Beautiful Soup 从公共网络中提取数据

来源：dev.to

时间：2024-08-01 18:45:40 427浏览收藏

文章不知道大家是否熟悉？今天我将给大家介绍《如何使用 Beautiful Soup 从公共网络中提取数据》，这篇文章主要会讲到等等知识点，如果你在看完本篇文章后，有更好的建议或者发现哪里有问题，希望大家都能积极评论指出，谢谢！希望我们能一起加油进步！

beautiful soup 是一个用于从网页中抓取数据的 python 库。它创建用于解析 html 和 xml 文档的解析树，从而可以轻松提取所需的信息。

beautiful soup 为网页抓取提供了几个关键功能：

导航解析树：您可以轻松导航解析树并搜索元素、标签和属性。
修改解析树： 它允许您修改解析树，包括添加、删除和更新标签和属性。
输出格式： 可以将解析树转换回字符串，方便保存修改的内容。

要使用 beautiful soup，您需要安装该库以及解析器，例如 lxml 或 html.parser。您可以使用 pip 安装它们

#install beautiful soup using pip.
pip install beautifulsoup4 lxml

处理分页

在处理跨多个页面显示内容的网站时，处理分页对于抓取所有数据至关重要。

识别分页结构：检查网站以了解分页的结构（例如下一页按钮或编号链接）。
迭代页面： 使用循环迭代每个页面并抓取数据。
更新url或参数：修改url或参数以获取下一页的内容。

import requests
from bs4 import beautifulsoup

base_url = 'https://example-blog.com/page/'
page_number = 1
all_titles = []

while true:
    # construct the url for the current page
    url = f'{base_url}{page_number}'
    response = requests.get(url)
    soup = beautifulsoup(response.content, 'html.parser')

    # find all article titles on the current page
    titles = soup.find_all('h2', class_='article-title')
    if not titles:
        break  # exit the loop if no titles are found (end of pagination)

    # extract and store the titles
    for title in titles:
        all_titles.append(title.get_text())

    # move to the next page
    page_number += 1

# print all collected titles
for title in all_titles:
    print(title)

提取嵌套数据

有时，您需要提取的数据嵌套在多层标签中。以下是如何处理嵌套数据提取。

导航到父标签： 查找包含嵌套数据的父标签。
提取嵌套标签：在每个父标签中，查找并提取嵌套标签。
迭代嵌套标签：迭代嵌套标签以提取所需的信息。

import requests
from bs4 import beautifulsoup

url = 'https://example-blog.com/post/123'
response = requests.get(url)
soup = beautifulsoup(response.content, 'html.parser')

# find the comments section
comments_section = soup.find('div', class_='comments')

# extract individual comments
comments = comments_section.find_all('div', class_='comment')

for comment in comments:
    # extract author and content from each comment
    author = comment.find('span', class_='author').get_text()
    content = comment.find('p', class_='content').get_text()
    print(f'author: {author}\ncontent: {content}\n')

处理 ajax 请求

许多现代网站使用 ajax 动态加载数据。处理 ajax 需要不同的技术，例如使用浏览器开发人员工具监视网络请求并在抓取工具中复制这些请求。

import requests
from bs4 import BeautifulSoup

# URL to the API endpoint providing the AJAX data
ajax_url = 'https://example.com/api/data?page=1'
response = requests.get(ajax_url)
data = response.json()

# Extract and print data from the JSON response
for item in data['results']:
    print(item['field1'], item['field2'])

网页抓取的风险

网络抓取需要仔细考虑法律、技术和道德风险。通过实施适当的保护措施，您可以减轻这些风险并负责任且有效地进行网络抓取。

违反服务条款：许多网站在其服务条款 (tos) 中明确禁止抓取。违反这些条款可能会导致法律诉讼。
知识产权问题：未经许可抓取内容可能会侵犯知识产权，导致法律纠纷。
ip 阻止：网站可能会检测并阻止表现出抓取行为的 ip 地址。
账号封禁：如果在需要用户身份验证的网站上进行抓取，用于抓取的账号可能会被封禁。

beautiful soup 是一个功能强大的库，它通过提供易于使用的界面来导航和搜索 html 和 xml 文档，从而简化了网页抓取的过程。它可以处理各种解析任务，使其成为任何想要从网络中提取数据的人的必备工具。

理论要掌握，实操不能落！以上关于《如何使用 Beautiful Soup 从公共网络中提取数据》的详细介绍，大家都掌握了吧！如果想要继续提升自己的能力，那么就来关注golang学习网公众号吧！

声明：本文转载于：dev.to 如有侵犯，请联系study_golang@163.com删除