崔庆才老师爬虫的学习笔记。
一、PyQuery库详解
1、什么是PyQuery库
强大而灵活的网页解析库。如果你觉得正则写起来太麻烦,如果你觉得BeautifulSoup语法太难记,如果你熟悉jQuery的语法,那么PyQuery就是你的绝佳选择!!!
http://pyquery.readthedocs.io
2、安装
pip install pyquery
3、初始化
初始化pyquery的时候,也需要传入HTML文本来初始化一个PyQuery对象。它的初始化方式有多种,比如直接传入字符串,传入URL,传入文件名,等等。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| from pyquery import PyQuery as pq
html = ''' <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a><>/li <li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' doc1 = pq(html) print(doc1('li'))
doc2 = pq(url='http://www.baidu.com') print(doc2('head'))
doc3 = pq(filename='demo.html') print(doc3('li'))
|
4、CSS选择器
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| from pyquery import PyQuery as pq
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a><>/li <li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' doc = pq(html) print(doc('#container .list li')) print(type(doc('#container .list li')))
|
5、查找元素
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| from pyquery import PyQuery as pq
html = ''' <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a><>/li <li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> ''' doc = pq(html)
items = doc.find('.list') print(items) print(type(items)) lis = doc.find('li') print(lis) print(type(lis))
child1 = items.children() print(child1) print(type(child1))
child2 = items.children('.active') print(child2) print(type(child2))
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| from pyquery import PyQuery as pq
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> '''
doc = pq(html) items = doc('.list')
container = items.parent() print(container) print(type(container))
parents = items.parents() print(parents) print(type(parents)) parent = items.parents('.wrap') print(parent)
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| from pyquery import PyQuery as pq
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> '''
doc = pq(html) li = doc('.list .item-0.active')
print(li.siblings())
print(li.siblings('.active'))
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| from pyquery import PyQuery as pq
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> '''
doc = pq(html) li = doc('.item-0.active') print(li) print(str(li))
lis = doc('li').items() print(type(lis)) for l in lis: print(li,type(li))
|
6、获取信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| from pyquery import PyQuery as pq
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> '''
doc = pq(html) a = doc('.item-0.active a')
print(a,type(a)) print(a.attr('href')) print(a.attr.href)
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| from pyquery import PyQuery as pq
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> '''
doc = pq(html)
a = doc('.item-0.active a') print(a,type(a)) print(a.text())
li = doc('.item-0.active') print(li) print(li.html())
lis = doc('li') print(lis.html()) print(lis.text()) print(type(lis.text()))
|
7、DOM操作
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
| from pyquery import PyQuery as pq
html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="boid">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' doc = pq(html)
li = doc('.item-0.active') print(li) li.removeClass('active') print(li) li.addClass('active') print(li)
li.attr('name','link') print(li)
li.text('changed item') print(li)
li.html('<span>changed item</span>') print(li)
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| from pyquery import PyQuery as pq
html = ''' <div class="wrap"> Hello, World <p>This is a paragraph.</p> </div> ''' doc = pq(html)
wrap = doc('.wrap') print(wrap.text())
wrap.find('p').remove() print(wrap.text())
|
还有很多节点操作的方法:append()、empty()和prepend()等,和jQuery的用法完全一致。官方文档:http://pyquery.readthedocs.io/en/latest/api.html。
8、伪类选择器
CSS选择器之所以强大,还有一个很重要的原因,那就是它支持多种多样的伪类选择器,例如选择第一个节点、最后一个节点、奇偶数节点、包含某一文本的节点等。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| from pyquery import PyQuery as pq html = ''' <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> ''' doc = pq(html)
li = doc('li:first-child') print(li) li1 = doc('li:last-child') print(li1) li2 = doc('li:nth-child(2)') print(li2) li3 = doc('li:gt(2)') print(li3) li4 = doc('li:nth-child(2n)') print(li4) li5 = doc('li:contains(second)') print(li5)
|
关于CSS选择器的更多用法,可以参考http://www.w3school.com.cn/css/index.asp
持续更新…