崔庆才老师爬虫的学习笔记。

一、BeautifulSoup库详解

1、什么是BeautifulSoup

Beautiful Soup就是Python的一个HTML或XML的解析库,可以用它来方便地从网页中提取数据。利用它不用编写正则表达式即可方便实现网页信息的提取。

2、安装

pip install beautifulsoup4

3、解析器

解析器 使用方法 优势 劣势
Python标准库 BeautifulSoup(markup, “html.parser”) Python的内置标准库、执行速度适中、文档容错能力强 Python 2.7.3及Python 3.2.2之前的版本文档容错能力差
lxml HTML解析器 BeautifulSoup(markup, “lxml”) 速度快、文档容错能力强 需要安装C语言库
lxml XML解析器 BeautifulSoup(markup, “xml”) 速度快、唯一支持XML的解析器 需要安装C语言库
html5lib BeautifulSoup(markup, “html5lib”) 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展

4、基本用法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from bs4 import BeautifulSoup

html = '''''
<html><head><title>The Domouse's story</title></head>
<body>
<p class="title"name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were little sisters;and their names were
<a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
<a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
<a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
and they lived at bottom of a well.</p>
<p class="story">...</p>
'''

soup = BeautifulSoup(html,'lxml')
print(soup.title.string) #打印标题
print(soup.prettify()) #格式化代码

5、标签选择器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 选择元素
from bs4 import BeautifulSoup

html = '''
<html><head><title>The Domouse's story</title></head>
<body>
<p class="title"name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were little sisters;and their names were
<a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
<a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
<a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
and they lived at bottom of a well.</p>
<p class="story">...</p>
'''

soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p) #只能返回第一个标签
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#获取名称
from bs4 import BeautifulSoup

html = '''
<html><head><title>The Domouse's story</title></head>
<body>
<p class="title"name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were little sisters;and their names were
<a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
<a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
<a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
and they lived at bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.title.name) #打印标签名称
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#获取属性
from bs4 import BeautifulSoup

html = '''
<html><head><title>The Domouse's story</title></head>
<body>
<p class="title"name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were little sisters;and their names were
<a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
<a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
<a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
and they lived at bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name']) #两种方式都可以获取标签属性
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#获取内容
from bs4 import BeautifulSoup

html = '''
<html><head><title>The Domouse's story</title></head>
<body>
<p class="title"name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were little sisters;and their names were
<a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
<a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
<a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
and they lived at bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.p.string)

6、嵌套选择

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from bs4 import BeautifulSoup

html = '''
<html><head><title>The Domouse's story</title></head>
<body>
<p class="title"name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were little sisters;and their names were
<a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
<a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
<a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
and they lived at bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.head.title.string)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#子节点和子孙节点
from bs4 import BeautifulSoup

html = '''
<html>
<head>
<title>The Domouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were little sisters;and their names were
<a href="http://example.com/elsie" class="sister"id="link1">
<span>Elsle</span>
</a>
<a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
and
<a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
and they lived at bottom of a well.
</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents) #子节点以列表形式返回

print(soup.p.children) #不同之处:children实际上是一个迭代器,需要用循环的方式才能将内容取出
for i,child in enumerate(soup.p.children):
print(i,child)

print(soup.p.descendants) #获取所有的子孙节点,也是一个迭代器
for l,child1 in enumerate(soup.p.descendants):
print(l,child1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#父节点和祖先节点
from bs4 import BeautifulSoup

html = '''
<html>
<head>
<title>The Domouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were little sisters;and their names were
<a href="http://example.com/elsie" class="sister"id="link1">
<span>Elsle</span>
</a>
<a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
and
<a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
and they lived at bottom of a well.
</p>
<p class="story">...</p>
'''

soup = BeautifulSoup(html,'lxml')
print(soup.a.parent) #返回父标签的整个内容

print(list(enumerate(soup.a.parents))) #所有祖先节点(包括父标签)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#兄弟节点
from bs4 import BeautifulSoup

html = '''
<html>
<head>
<title>The Domouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were little sisters;and their names were
<a href="http://example.com/elsie" class="sister"id="link1">
<span>Elsle</span>
</a>
<a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
and
<a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
and they lived at bottom of a well.
</p>
<p class="story">...</p>
'''

soup = BeautifulSoup(html,'lxml')

print(list(enumerate(soup.a.next_siblings))) #后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings))) #前面的兄弟节点

7、标准选择器

find_all(name,attrs,recursive,text,**kargs)

可根据签名、属性、内容查找文档

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#name
from bs4 import BeautifulSoup

html = '''
<div class="panel">
<div class="panel-heading"name="elements">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list"Id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small"Id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
<div>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul')) #返回列表类型
print(soup.find_all('ul')[0])

for ul in soup.find_all('ul'):
print(ul.find_all('li'))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#attrs
from bs4 import BeautifulSoup

html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={"id":"list-1"}))
print(soup.find_all(attrs={"name":"elements"}))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#常用属性
from bs4 import BeautifulSoup

html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element')) #注意:由于class在Python里是一个关键字,所以后面需要加一个下划线
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#text
from bs4 import BeautifulSoup

html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text='Foo'))

find(name,attrs,recursive,text,**kwargs)

find返回单个元素,find_all返回所有元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from bs4 import BeautifulSoup  

html = '''''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body"name="elelments">
<ul class="list"Id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small"Id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
<div>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))

其他常用方法:

  • find_parents():返回所有祖先节点和find_parent()返回直接父节点
  • find_next_siblings():返回所有兄弟节点和find_next_sibling()返回后面第一个兄弟节点
  • find_previous_siblings():返回前面的所有兄弟节点和find_previous_sibling():返回前面第一个兄弟节点
  • find_all_next():返回节点后所有符合条件的节点和find_next():返回第一个符合条件的节点
  • find_all_previous():返回节点前所有符合条件的节点和find_previous():返回节点前第一个符合条件的节点

8、CSS选择器

通过select()直接传入CSS选择器即可完成选择

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from bs4 import BeautifulSoup 

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body"name="elelments">
<ul class="list"Id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small"Id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
<div>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading')) #传入css选择器
print(soup.select('ul li')) # 传入标签
print(soup.select('#list-2 .element')) #传入id
print(type(soup.select('ul')[0]))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#嵌套选择
from bs4 import BeautifulSoup

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body"name="elelments">
<ul class="list"Id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small"Id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
<div>
'''
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
print(ul.select('li'))
#可以直接传入选择器实现嵌套比这种方式更方便
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#获取属性
from bs4 import BeautifulSoup

html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body"name="elelments">
<ul class="list"Id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small"Id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
<div>
'''
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
print(ul['id']) #直接传入中括号和属性名
print(ul.attrs['id']) #通过attrs属性获取属性值,两种方式都可以
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#获取内容
html = '''''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body"name="elelments">
<ul class="list"Id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small"Id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
<div>
'''
soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
print(li.string) #string属性获取文本
print(li.get_text()) #get_text()方法获取文本,两种都行

总结:

(1)推荐使用’lxml’解析库,必要时使用html.parser

(2)标签选择器筛选功能但速度快

(3)建议使用find(),find_all()查询匹配单个结果或者多个结果

(4)如果对CSS选择器熟悉建议选用select()

(5)记住常用的获取属性和文本值得方法

持续更新…

最后更新: 2018年08月14日 17:43

原始链接: http://pythonfood.github.io/2018/07/02/爬虫-beautifulsoup库/

× 多少都行~
打赏二维码