崔庆才老师爬虫的学习笔记。

一、PyQuery库详解

1、什么是PyQuery库

强大而灵活的网页解析库。如果你觉得正则写起来太麻烦,如果你觉得BeautifulSoup语法太难记,如果你熟悉jQuery的语法,那么PyQuery就是你的绝佳选择!!!

http://pyquery.readthedocs.io

2、安装

pip install pyquery

3、初始化

初始化pyquery的时候,也需要传入HTML文本来初始化一个PyQuery对象。它的初始化方式有多种,比如直接传入字符串,传入URL,传入文件名,等等。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#三种初始化方式
from pyquery import PyQuery as pq

#1.字符串初始化
html = '''
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
doc1 = pq(html)
print(doc1('li'))

#2.URL初始化
doc2 = pq(url='http://www.baidu.com')
print(doc2('head'))

#3.文件初始化
doc3 = pq(filename='demo.html')
print(doc3('li'))

4、CSS选择器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from pyquery import PyQuery as pq

html = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
doc = pq(html)
print(doc('#container .list li')) #空格代表嵌套关系,依次传入了id、class、标签
print(type(doc('#container .list li')))

5、查找元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#子元素
from pyquery import PyQuery as pq

html = '''
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a><>/li
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
doc = pq(html)
# 查找子孙节点需要用到find()方法,此时传入的参数是CSS选择器
items = doc.find('.list')
print(items)
print(type(items))
lis = doc.find('li')
print(lis)
print(type(lis))

child1 = items.children() #children()方法查找子节点
print(child1)
print(type(child1))

child2 = items.children('.active') #筛选子节点中符合条件的节点
print(child2)
print(type(child2))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#父元素
from pyquery import PyQuery as pq

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''

doc = pq(html)
items = doc('.list') #首先用.list选取class为list的节点

container = items.parent() #parent()得到父节点
print(container)
print(type(container))

parents = items.parents() #parents()得到祖先节点
print(parents)
print(type(parents))
parent = items.parents('.wrap') #筛选祖先节点中符合条件的节点
print(parent)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#兄弟元素
from pyquery import PyQuery as pq

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''

doc = pq(html)
li = doc('.list .item-0.active') #注意:.item-0和.active之间没有空格,表示这两个class并存的节点

print(li.siblings()) #siblings()获取兄弟节点

print(li.siblings('.active')) #筛选兄弟节点中符合条件的节点
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#遍历
from pyquery import PyQuery as pq

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''

doc = pq(html)
li = doc('.item-0.active')
print(li) #单个节点可以直接打印输出
print(str(li)) #也可以直接转成字符串

lis = doc('li').items() #多个节点的结果,需要items()遍历来获取
print(type(lis))
for l in lis:
print(li,type(li))

6、获取信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#获取属性
from pyquery import PyQuery as pq

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''

doc = pq(html)
a = doc('.item-0.active a')

print(a,type(a))
print(a.attr('href')) #调用attr()方法获取属性值
print(a.attr.href) #调用attr属性来获取属性值,这两种方式都可以
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#获取文本
from pyquery import PyQuery as pq

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''

doc = pq(html)

a = doc('.item-0.active a')
print(a,type(a))
print(a.text()) #text()获取节点内部的纯文本

li = doc('.item-0.active')
print(li)
print(li.html()) #html()获取节点内部的HTML文本

#注意,如果选中的结果是多个节点
lis = doc('li')
print(lis.html()) #html()方法返回的是第一个li节点的内部HTML文本
print(lis.text()) #text()则返回所有的li节点内部的纯文本,中间用空格分割开
print(type(lis.text())) #text()返回结果是一个字符串

7、DOM操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from pyquery import PyQuery as pq

html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
doc = pq(html)

li = doc('.item-0.active')
print(li)
li.removeClass('active') #removeClass()移除class属性
print(li)
li.addClass('active') #addClass()增加class属性
print(li)

li.attr('name','link') #attr(属性名,属性值)来修改属性,如果只传入第一个参数的属性名,则是获取这个属性值
print(li)

li.text('changed item') #text(纯文本)来修改节点内纯文本,如果不传参数则是获取节点内纯文本
print(li)

li.html('<span>changed item</span>') #html(html文本)来修改节点内html文本,如果不传参数则是获取节点内html文本
print(li)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# remove()移除
from pyquery import PyQuery as pq

html = '''
<div class="wrap">
Hello, World
<p>This is a paragraph.</p>
</div>
'''
doc = pq(html)

wrap = doc('.wrap')
print(wrap.text()) #获取的所有文本。现在想提取'Hello, World'这个字符串,而不要p节点内部的字符串,需要怎样操作呢?

wrap.find('p').remove() #remove()方法移除节点
print(wrap.text()) #此时wrap内部就只剩下'Hello, World'这句话了

还有很多节点操作的方法:append()、empty()和prepend()等,和jQuery的用法完全一致。官方文档:http://pyquery.readthedocs.io/en/latest/api.html。

8、伪类选择器

CSS选择器之所以强大,还有一个很重要的原因,那就是它支持多种多样的伪类选择器,例如选择第一个节点、最后一个节点、奇偶数节点、包含某一文本的节点等。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from pyquery import PyQuery as pq
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
doc = pq(html)

li = doc('li:first-child') #第一个
print(li)
li1 = doc('li:last-child') #最后一个
print(li1)
li2 = doc('li:nth-child(2)') #指定缩写顺序,第二个
print(li2)
li3 = doc('li:gt(2)') #大于2的
print(li3)
li4 = doc('li:nth-child(2n)') #偶数
print(li4)
li5 = doc('li:contains(second)') #包含second
print(li5)

关于CSS选择器的更多用法,可以参考http://www.w3school.com.cn/css/index.asp

持续更新…

最后更新: 2018年08月14日 17:47

原始链接: http://pythonfood.github.io/2018/07/02/爬虫-pyquery库/

× 多少都行~
打赏二维码