Python 爬虫入门完全指南

一、什么是爬虫

爬虫（Web Crawler）是一种自动获取网页内容的程序。它模拟浏览器的行为，向服务器发送请求，获取响应，然后解析出需要的数据。

爬虫的工作流程

发送 HTTP 请求
获取服务器响应
解析响应内容
提取所需数据
保存数据

二、环境搭建

2.1 安装 Python

确保已安装 Python 3.7 或更高版本。

2.2 安装依赖库

                        pip install requests
pip install beautifulsoup4
pip install lxml
                    

三、HTTP 基础

HTTP 是爬虫与服务器通信的协议。理解 HTTP 是写好爬虫的基础。

3.1 请求方法

方法	用途
GET	获取资源
POST	提交数据
PUT	更新资源
DELETE	删除资源

3.2 状态码

状态码	含义
200	请求成功
301/302	重定向
400	请求参数错误
403	禁止访问
404	资源不存在
500	服务器内部错误

四、Requests 库

Requests 是 Python 最流行的 HTTP 库，使用简单，功能强大。

4.1 发送 GET 请求

                        import requests

# 发送 GET 请求
response = requests.get('https://api.github.com')

# 查看状态码
print(response.status_code)  # 200

# 查看响应内容
print(response.text)  # JSON 或 HTML 内容
                    

4.2 发送 POST 请求

                        # 发送 POST 请求
data = {'username': 'admin', 'password': '123456'}
response = requests.post('https://httpbin.org/post', data=data)

print(response.json())
                    

4.3 添加请求头

                        headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'application/json'
}

response = requests.get('https://api.example.com', headers=headers)
                    

五、解析 HTML

BeautifulSoup 是 Python 最流行的 HTML 解析库。

5.1 基础用法

from bs4 import BeautifulSoup

html = '''


    标题
    内容



'''

soup = BeautifulSoup(html, 'lxml')

# 查找元素
title = soup.find('h1').text
content = soup.find('p', class_='content').text

print(title)    # 标题
print(content)  # 内容

5.2 CSS 选择器

                        # 使用 CSS 选择器
items = soup.select('.item')

for item in items:
    title = item.select_one('.title').text
    link = item.select_one('a')['href']
    print(f"{title}: {link}")
                    

六、保存数据

6.1 保存为 JSON

                        import json

data = [
    {'name': '张三', 'age': 25},
    {'name': '李四', 'age': 30}
]

with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=2)
                    

6.2 保存为 CSV

                        import csv

with open('data.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['name', 'age'])
    writer.writeheader()
    writer.writerows(data)
                    

七、实战项目

7.1 采集新闻标题

                        import requests
from bs4 import BeautifulSoup

url = 'https://news.example.com'
response = requests.get(url)
response.encoding = 'utf-8'

soup = BeautifulSoup(response.text, 'lxml')

# 提取新闻标题
titles = soup.select('.news-title')

for title in titles:
    print(title.text.strip())
                    

八、总结

Python 爬虫入门并不难，掌握以下几个核心点就能开始实战：

理解 HTTP 协议和请求方法
熟练使用 Requests 发送请求
掌握 BeautifulSoup 解析 HTML
学会保存数据到文件

← 返回博客列表