Python爬虫-BeautifulSoup模块

安装Requests模块

pip3 install BeautifulSoup

BeautifulSoup表达式规则

普通字符	正则匹配
>	逐层提取
.class	提取class属性名为class的标签[多个class值可以用.连接]
#id	提取id属性名为id的标签
div	提取p标签[div]
[属性名]	提取含有属性名的标签
[属性名=属性值]	提取xxx属性名=xxx属性值的标签
标签名.属性值	提取属性为xx的标签
,	提取多个不同标签
text	提取标签下面的文本

小案例-匹配新华网主题名称

# 导入包
import requests
from bs4 import BeautifulSoup

# 构造url,user-agent请求头参数部分
ua = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}
url = 'http://www.xinhuanet.com/2020-03/28/c_1125782104.htm'

# 向指定的url发送get含有参数的请求
response = requests.get(url,headers=ua)
# 设置字符集
response.encoding = 'utf-8'
# 返回网站对象
res = response.text

# 生成选择器对象
html = BeautifulSoup(res,"html.parser")
# 匹配出数组
title = html.select('div.h-title')[0].text

print(title)