崔庆才老师爬虫的学习笔记。
一、Urllib库详解
1、Urllib库
- urllib.request:请求模块
- urllib.error:异常处理模块
- urllib.parse:url解析模块(拆分、合并等)
- urllib.robotparser:robot.txt解析模块
urllib.request.urlopen(url,data = None,[timeout]*,cafile = None,capath = None,cadefault = False,context = None)
1 2 3 4
| import urllib.request response = urllib.request.urlopen('http://www.baidu.com') print(response.read().decode('utf-8'))
|
1 2 3 4 5 6 7
| import urllib.request import urllib.parse
data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8') response = urllib.request.urlopen('http://httpbin.org/post',data=data) print(response.read())
|
1 2 3 4 5
| import urllib.request
response = urllib.request.urlopen('http://httpbin.org/get',timeout=1) print(response.read())
|
1 2 3 4 5 6 7 8 9 10
| import socket import urllib.request import urllib.error
try: response = urllib.request.urlopen('http://httpbin.org/get',timeout = 0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print('TIME OUT')
|
2、响应
1 2 3 4 5
| import urllib.request
response = urllib.request.urlopen('https://www.python.org') print(type(response))
|
1 2 3 4 5 6 7 8
| import urllib.request
response = urllib.request.urlopen('https://www.python.org')
print(response.status) print(response.getheaders())
|
3、Request
1 2 3 4 5
| import urllib.request
request = urllib.request.Request('https://www.python.org') response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| import urllib.request import urllib.parse
url = 'http://httpbin.org/post' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64)', 'Host':'httpbin.org' } dict = {'name':'Germey' } data = bytes(urllib.parse.urlencode(dict),encoding="utf8")
request = urllib.request.Request(url=url,headers=headers,data=data,method='POST') response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))
|
1 2 3 4 5 6 7 8 9 10 11 12 13
| import urllib.request import urllib.parse
url = 'http://httpbin.org/post' dict = {'name':'Germey' } data = bytes(urllib.parse.urlencode(dict),encoding="utf8")
request = urllib.request.Request(url=url,data=data,method='POST') request.add_header( 'User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64)') request.add_header('Host','httpbin.org') response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))
|
4、Handler
https://docs.python.org/3/library/urllib.request.html#module-urllib.request
1 2 3 4 5 6 7 8 9 10 11 12
| import urllib.request
proxy_handler = urllib.request.ProxyHandler( { 'http':'http://127.0.0.1:9743', 'https':'https://127.0.0.1:9743' } ) opener = urllib.request.build_opener(proxy_handler) response = opener.open('http://www.baidu.com') print(response.read().decode('utf-8'))
|
5、Cookie
客户端保存,用来记录客户身份的文本文件、维持登录状态
1 2 3 4 5 6 7 8 9 10
| import urllib.request import http.cookiejar
cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') for item in cookie: print(item.name+'='+item.value)
|
1 2 3 4 5 6 7 8 9 10
| import urllib.request import http.cookiejar
filename = 'cookie.txt' cookie = http.cookiejar.MozillaCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True,ignore_expires=True)
|
1 2 3 4 5 6 7 8 9 10
| import urllib.request import http.cookiejar
filename = 'cookie_LWP.txt' cookie = http.cookiejar.LWPCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True,ignore_expires=True)
|
1 2 3 4 5 6 7 8 9 10
| import urllib.request import http.cookiejar
cookie = http.cookiejar.LWPCookieJar() cookie.load('cookie_LWP.txt',ignore_discard=True,ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') print(response.read().decode('utf-8'))
|
6、异常处理
https://docs.python.org/3/library/urllib.error.html#module-urllib.error
1 2 3 4 5 6 7
| from urllib import request from urllib import error
try: response = request.urlopen('http://www.cuiqingcai.com/index.html') except error.URLError as e: print(e.reason)
|
1 2 3 4 5 6 7 8 9 10 11 12
| from urllib import request from urllib import error
try: response = request.urlopen('http://www.cuiqingcai.com/index.html') except error.HTTPError as e: print(e.reason,e.code,e.headers,sep='\n') except error.URLError as e: print(e.reason) else: print('Request Successfully')
|
1 2 3 4 5 6 7 8 9 10 11
| from urllib import request from urllib import error import socket
try: response = request.urlopen('http://www.baidu.com',timeout=0.01) except error.URLError as e: print(type(e.reason)) if isinstance(e.reason,socket.timeout): print('Time Out')
|
7、URL解析
https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse
(1)urlparse
将url进行分割,分割成几个部分,再依次将其复制
parse.urlparse(urlstring,scheme='',allow_fragments = True)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| from urllib.parse import urlparse
result = urlparse('https://www.baidu.com/s?wd=urllib&ie=UTF-8') print(type(result)) print(result)
result1 = urlparse('www.baidu.com/s?wd=urllib&ie=UTF-8',scheme='https') print(result1)
result2 = urlparse('http://www.baidu.com/s?wd=urllib&ie=UTF-8',scheme='https') print(result2)
result3 = urlparse('http://www.baidu.com/s?wd=urllib&ie=UTF-8#comment',allow_fragments=True) print(result3)
result4 = urlparse('http://www.baidu.com/s?wd=urllib&ie=UTF-8#comment',allow_fragments=False) print(result4)
|
(2)urlunparse
urlparse的反函数
1 2 3 4 5 6
| from urllib.parse import urlunparse
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment'] result = urlunparse(data) print(result)
|
(3)urljoin
拼接url
1 2 3 4 5 6 7 8 9 10
| from urllib.parse import urljoin
result = urljoin('http://www.baidu.com','FQA.html') print(result)
result1 = urljoin('http://www.baidu.com','http://www.taobao.com/FQA.html') print(result1)
result2 = urljoin('http://www.baidu.com/about','https://www.taobao.com/FQA.html') print(result2)
|
(4)urlencode
字典对象转化为get请求参数
1 2 3 4 5 6 7 8 9 10
| from urllib.parse import urlencode
params={ 'name':'Arise', 'age':'21' } base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params) print(url)
|
8、robotparser
用来解析robot.txt(只做了解)
https://docs.python.org/3/library/urllib.robotparser.html#module-urllib.robotparser
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| import urllib.robotparser
rp = urllib.robotparser.RobotFileParser() rp.set_url("http://www.musi-cal.com/robots.txt") rp.read() rrate = rp.request_rate("*") print(rrate.requests)
rrate.seconds
rp.crawl_delay("*")
rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
rp.can_fetch("*", "http://www.musi-cal.com/")
|
持续更新…