人生苦短，我用python。

一、requests模块

使用前需安装requests库pip install requests。

requests库的七个主要方法：

方法	解释
requests.request()	构造一个请求，支持以下各种方法
requests.get()	获取html的主要方法
requests.head()	获取html头部信息的主要方法
requests.post()	向html网页提交post请求的方法
requests.put()	向html网页提交put请求的方法
requests.patch()	向html提交局部修改的请求
requests.delete()	向html提交删除请求

requests.request(method，url,**kwargs)构造一个服务器请求request：

method: “GET”、”POST”、”HEAD”、”PUT”、”PATCH”、”DELETE”
url: 请求的网址
**kwargs: 控制访问的参数
- params：字典或字节序列,作为参数增加到url中,使用这个参数可以把一些键值对以?key1=value1&key2=value2的模式增加到url中
  例如：kw = {‘key1: ’ values’, ‘key2’: ‘values’}
  r = requests.request(‘GET’, ‘http:www.python123.io/ws’, params=kw)
- data：字典，字节序或文件对象，重点作为向服务器提供或提交资源是提交，作为request的内容，与params不同的是，data提交的数据并不放在url链接里，而是放在url链接对应位置的地方作为数据来存储。，它也可以接受一个字符串对象。
- json：json格式的数据， json合适在相关的html，http相关的web开发中非常常见，也是http最经常使用的数据格式，他是作为内容部分可以向服务器提交。
  例如：kv = {”key1’: ‘value1’}
  r = requests.request(‘POST’, ‘http://python123.io/ws‘, json=kv)
- headers：字典是http的相关语，对应了向某个url访问时所发起的http的头i字段，可以用这个字段来定义http的访问的http头，可以用来模拟任何我们想模拟的浏览器来对url发起访问。
  例子： hd = {‘user-agent’: ‘Chrome/10’}
  r = requests.request(‘POST’, ‘http://python123.io/ws‘, headers=hd)
- cookies：字典或CookieJar，指的是从http中解析cookie
- auth：元组，用来支持http认证功能
- files：字典，是用来向服务器传输文件时使用的字段。
  例子：fs = {‘files’: open(‘data.txt’, ‘rb’)}
  r = requests.request(‘POST’, ‘http://python123.io/ws‘, files=fs)
- timeout: 用于设定超时时间，单位为秒，当发起一个get请求时可以设置一个timeout时间，如果在timeout时间内请求内容没有返回，将产生一个timeout的异常。
- proxies：字典，用来设置访问代理服务器。
- allow_redirects: 开关，表示是否允许对url进行重定向，默认为True。
- stream: 开关，指是否对获取内容进行立即下载，默认为True。
- verify：开关，用于认证SSL证书，默认为True。
- cert：用于设置保存本地SSL证书路径

返回一个包含服务器资源的response对象，具有以下属性和方法：

属性	说明
r.status_code	http请求的返回状态，若为200则表示请求成功。
r.text	http响应内容的字符串形式，即返回的页面内容
r.encoding	从http header 中猜测的相应内容编码方式
r.apparent_encoding	从内容中分析出的响应内容编码方式（备选编码方式）
r.content	http响应内容的二进制形式
r.raw	返回原始响应体，使用r.raw.read()读取
r.headers	以字典对象存储服务器响应头，但是这个字典比较特殊，字典键不区分大小写，若键不存在则返回None
r.json()	Requests中内置的JSON解码器
r.raise_for_status()	失败请求(非200响应)抛出异常

# requests库

import requests

requests.get('http://httpbin.org/get')  # 发送get请求
requests.post('http://httpbin.org/post', data={"key":"value",...})  # 发送post请求
requests.delete('http://httpbin.org/delete')  # 发送delete请求
requests.put('http://httpbin.org/put',data={"key":"value", ...})  # 发送put请求
requests.head('http://httpbin.org/get')  # 发送head请求
requests.options('http://httpbin.org/get')  # 发送options请求

# get请求参数
data = {
    'name':'asr',
    'age':'12'
}
response = requests.get('http://httpbin.org/get', params=data) 

# 添加headers
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; rv:60.0) Gecko/20100101 Firefox/60.0'}
response = requests.get('http://httpbin.org/get', headers=headers)

# 添加cookies
cookies = {'sessionId':'Jsession3123131'}
response = requests.get('http://httpbin.org/cookies', cookies=cookies)

# 超时设置 
response = requests.get('https://www.taobao.com', timeout=1)  # 指定超时时间
response = requests.get('https://www.taobao.com', timeout=(5.11,30))  # 分别指定连接（connect）和读取（read）两个阶段超时，可以传入一个元组
response = requests.get('https://www.taobao.com', timeout=None)  # 想永久等待，可以直接将timeout设置为None,或直接不加参数
response = requests.get('https://www.taobao.com')

# 重定向
response = requests.get('http://github.com', allow_redirects=False)  # 禁止重定向
response = requests.head('http://github.com', allow_redirects=True)  # 允许重定向，默认

# 代理设置
proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
}
requests.get('https://www.taobao.com', proxies=proxies)

# 身份认证
response = requests.get('http://120.27.34.24.9001',auth=('username', 'password'))
#上面代码是简写，实际调用的requests.auth.HTTPBasicAuth 
from requests.auth import HTTPBasicAuth
response = requests.get('http://120.27.34.24.9001',auth=HTTPBasicAuth('username', 'password'))

# 证书验证
# 1.引入requests.packages.urllib3设置忽略警告
from requests.packages import urllib3
urllib3.disable_warnings()
response = requests.get('https://www.12306.cn',verify=False)
# 2.通过捕获警告到日志的方式忽略警告
import logging
logging.captureWarnings(True)
response = requests.get('https://www.12306.cn',verify=False)
# 3.指定一个本地证书用作客户端证书，这可以是单个文件（包含密钥和证书）或一个包含两个文件路径的元组
response = requests.get('https://www.12306.cn',cert=('/path/server.crt', '/path/key'))

# post请求表单
data = {
    'name':'jk',
    'age':18
}
response = requests.post('http://httpbin.org/post',data=data)

# post请求json
import json
data = {
    'name':'jk',
    'age':18
}
response = requests.post("http://httpbin.org/post",data=json.dumps(data))  # 方式一
response = requests.post("http://httpbin.org/post",json=data)  # 方式二

# 文件上传
files = {
    'file':open('favircon.ico','rb')
}
response = requests.post('http://httpbin.org/post',files=files)

# 会话维持
s = requests.Session()  # 声明Session对象，使用这个对象发起两次GET请求（相当于同一个浏览器发出来的请求）
s.get('http://httpbin.org/cookies/set/number/123456789')
response = s.get('http://httpbin.org/cookies')

# 异常处理
from requests.exceptions import ReadTimeout,HTTPError,RequestException
try:
    r = requests.get('https://www.taobao.com',timeout=0.1)
except ReadTimeout:
    print('ReadTimeout')
except HTTPError:
    print('HTTPError')
except RequestException:
    print('RequestException')
    
# 响应内容
import requests
import json
response = requests.get('http://github.com')
text = response.text  # 文本数据信息(str)
content = response.content  # 二进制响应数据
json_data = response.json()  # json响应数据
status_code = response.status_code  # 响应状态码
headers = response.headers  # 响应headers
cookies = response.cookies  # 响应cookies
url = response.url  # 页面url
history = response.history  # history追踪重定向

1、发送请求

import requests

#1.发送get请求
res = requests.get("http://httpbin.org/get")

#2.发送post请求
res = requests.post("http://httpbin.org/post", data={"key":"value",...})

#3.发送head请求
res = requests.head("http://httpbin.org/get")

#4.发送put请求
res = requests.put("http://httpbin.org/put",data={"key":"value", ...})

#5.发送delete请求
res = requests.delete("http://httpbin.org/delete")

#6.发送patch请求
res = requests.patch("http://httpbin.org/get")

2、构造请求参数

(1)get请求参数

import requests

def function():
	#1.请求参数
	payload = {'key1':'value1', 'key2':'value2', 'key3':['value3','value4']} #注意:请求值可以传入列表
	
	#2.拼接url= http://httpbin.org/get?key1=value1&key2=value2&key3=value3&key3=value4
	res = requests.get("http://httpbin.org/get", params=payload) #注意：参数是params

	print(res.text)

(2)post请求表单

import requests

def function():
	#1.表单数据
	payload1 = {'key1':'value1','key2':'key2'} #注意：一键单值用字典
	payload2 = {('key1', 'value1'), ('key1', 'value2')} #注意：一键多值用元组
	
	#2.拼接url
	res1 = requests.post("http://httpbin.org/post",data=payload1) #注意：参数是data
	res2 = requests.post("http://httpbin.org/post",data=payload2)
	
	print(res1.text)
	print(res2.text)

(3)post请求json

def function():
	#1.post提交json对象数据
	payload = {'key1': 'value1', 'key2': 'value2'}

	#2.拼接url
	res = requests.post("http://httpbin.org/post",data=json.dumps(payload))# 方式一
	res = requests.post("http://httpbin.org/post",json=payload)# 方式二

(4)构造请求头headers

def function:

	#1.构造请求头
	headers = {'user-agent':'my-app/0.0.1'}
	#2.定义url
	url = "https://api.github.com/some/endpoint"

	#3.拼接url
	res = requests.get(url, headers=headers)

(5)构造cookies请求

def function():

	#1.构造cookies
	cookies = dict(sessionId='Jsession3123131')
	#2.定义url
	url = "http://httpbin.org/cookies"
	
	#3.拼接url
	res = requests.get(url, cookies=cookies)

(6)构造请求超时设置

def function()
	
	## 设置请求超时：0.1秒没响应就请求超时失败
	requests.get('http://github.com',timeout=0.1)

(7)基本身份认证

(HTTP Basic Auth)

import requests
from requests.auth import HTTPBasicAuth

#身份验证 
r = requests.get('https://httpbin.org/hidden-basic-auth/user/passwd', auth=HTTPBasicAuth('user', 'passwd'))
# r = requests.get('https://httpbin.org/hidden-basic-auth/user/passwd', auth=('user', 'passwd'))    # 简写

print(r.json())

3、响应内容

import requests

def function()
	
	#发送请求
	url = "https://api.github.com/some/endpoint"
	payload = {'key1':'value1','key2':'key2'}
	headers = {'user-agent':'my-app/0.0.1'}
	cookies = dict(sessionId='Jsession3123131')
	res = requests.post(url,data=payload,headers=headers,cookies=cookies)
	
	#文本数据信息，大部分情况使用这个方法
	text = res.text
	
	#原始二进制响应数据
	content = res.content
	
	# json响应数据，通常我们使用json模块来处理
	json_data = res.json()
	
	# 响应状态码
	status_code= res.status_code
	
	# 响应头
	headers = res.headers
	
	# 响应cookies
	cookies = res.cookies
	
	#使用响应对象的history方法来追踪重定向
	res.history

二、urllib模块

python内置http请求库urllib：

urllib.request：用于访问和读取URLS
urllib.error：包括了所有urllib.request导致的异常
urllib.parse：用于解析URLS
urllib.robotparser：用于解析robots.txt文件（网络蜘蛛）

1、urllib.request

(1)基本方法
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url: 需要打开的网址
data：Post提交的数据
timeout：设置网站的访问超时时间

urlopen返回对象提供方法：

geturl()：返回请求的url，通常用于确定是否遵循重定向。
info()：返回HTTPMessage对象，表示远程服务器返回的头信息
getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到
read() , readline() ,readlines() ：返回页面元素
fileno() , close() ：对HTTPResponse类型数据进行操作，这些方法的使用方式与文件对象完全一样

from urllib import request 
 
page = request.urlopen("http://www.baidu.com/")  

print(page.info())  
print(page.getURL())  
print(page.getcode())  
print(page.read())

(2)设置代理
urllib.request.build_opener([handler, …])
urllib.request.install_opener(opener)

做爬虫的时候，多都会用到代理IP的，步骤如下：
1）利用urllib.request.ProxyHandler准备代理IP或者请求头。
2）利用urllib.request.build_opener()封装代理IP或请求头。
3）利用urllib.request.instanll_opener()安装成全局。
4）利用urllib.request.urlopen访问网页。

from urllib import request
  
proxy_support = request.ProxyHandler({'sock5': 'localhost:1080'})  
opener = request.build_opener(proxy_support)  
request.install_opener(opener)  

a = request.urlopen("http://www.baidu.com/ ").read().decode("utf8")  

print(a)

(3)使用Request
urllib.request.Request(url, data=None, headers={}, method=None)
使用request()来包装请求，再通过urlopen()获取页面。

from urllib import request  

url = r'http://www.lagou.com/zhaopin/Python/?labelWords=label'  
headers = {  
    'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '  
                  r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',  
    'Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label',  
    'Connection': 'keep-alive'  
}  

req = request.Request(url, headers=headers)  
page = request.urlopen(req).read()  
page = page.decode('utf-8')

2、urllib.error

常见的由urllib.request导致的异常:

error.URLError: 是OSError的一个子类,URLError可能产生的原因:
- 网络无连接，即本机无法上网
- 连接不到特定的服务器
- 服务器不存在
error.HTTPError: 是URLError的一个子类,对于不能处理对象response的状态码，urlopen会产生一个HTTPError
- 101：转换协议在发送完这个响应最后的空行后，服务器将会切换到在Upgrade 消息头中定义的那些协议。只有在切换新的协议更有好处的时候才应该采取类似措施。
- 102：继续处理由WebDAV（RFC 2518）扩展的状态码，代表处理将被继续执行。
- 200：请求成功处理方式：获得响应的内容，进行处理
- 201：请求完成，结果是创建了新资源。新创建资源的URI可在响应的实体中得到处理方式：爬虫中不会遇到
- 202：请求被接受，但处理尚未完成处理方式：阻塞等待
- 204：服务器端已经实现了请求，但是没有返回新的信息。如果客户是用户代理，则无须为此更新自身的文档视图。处理方式：丢弃
- 300：该状态码不被HTTP/1.0的应用程序直接使用，只是作为3XX类型回应的默认解释。存在多个可用的被请求资源。处理方式：若程序中能够处理，则进行进一步处理，如果程序中不能处理，则丢弃
- 301：请求到的资源都会分配一个永久的URL，这样就可以在将来通过该URL来访问此资源处理方式：重定向到分配的URL
- 302：请求到的资源在一个不同的URL处临时保存处理方式：重定向到临时的URL
- 304：请求的资源未更新处理方式：丢弃
- 400：非法请求处理方式：丢弃
- 401：未授权处理方式：丢弃
- 403：禁止处理方式：丢弃
- 404：没有找到处理方式：丢弃
- 500：服务器内部错误服务器遇到了一个未曾预料的状况，导致了它无法完成对请求的处理。一般来说，这个问题都会在服务器端的源代码出现错误时出现。
- 501：服务器无法识别服务器不支持当前请求所需要的某个功能。当服务器无法识别请求的方法，并且无法支持其对任何资源的请求。
- 502：错误网关作为网关或者代理工作的服务器尝试执行请求时，从上游服务器接收到无效的响应。
- 503：服务出错由于临时的服务器维护或者过载，服务器当前无法处理请求。这个状况是临时的，并且将在一段时间以后恢复。

from urllib import request
from urllib import error

if __name__ == "__main__":
    #一个不存在的连接
    url = "http://www.test1.com/test2.html"
    req = request.Request(url)
    
	try:
        responese = request.urlopen(req)
        # html = responese.read()
	#想用HTTPError和URLError一起捕获异常,需要将HTTPError放在URLError的前面
    except error.HTTPError as e:
        print(e.code)
	except error.URLError as e:
        print(e.reason)

3、urllib.parse

(1)urllib.parse.urlparse(urlstring, scheme=”, allow_fragments=True)
用于将一个URL解析成六个部分，返回一个元组，URL的格式为：scheme://netloc/path;parameters?query#fragment

以下为返回的元组元素：

元素	编号	值	值不存在时默认值
scheme	0	请求	一定存在
netloc	1	网址	空字符串
path	2	分层路径	空字符串
params	3	参数	空字符串
query	4	查询组件	空字符串
fragment	5	标识符	空字符串
username		用户名	None
password		密码	None
hostname		主机名	None
port		端口号	None

from urllib import parse

o = parse.urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')

print(o)

#对应结果：ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='')

(2)urllib.parse.urljoin(base, url, allow_fragments=True)
用于将一个基本的URL和其他的URL组装成成一个完成的URL。

from urllib import parse

a=arse.urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')

print(a)

#对应结果:http://www.cwi.nl/%7Eguido/FAQ.html

(3)urllib.parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None)
urlencode()主要作用就是将url附上要提交的数据。

from urllib import parse

data = {  
    'first': 'true',  
    'pn': 1,  
    'kd': 'Python'  
}  
data = parse.urlencode(data).encode('utf-8')

#经过urlencode()转换后的data数据为?first=true?pn=1?kd=Python
#最后提交的url为http://www.lagou.com/jobs/positionAjax.json?first=true?pn=1?kd=Python

4、urllib.robotparser

robotparser为robots.txt文件实现了一个解释器，可以用来读取robots文本的格式和内容，用函数方法检查给定的User-Agent是否可以访问相应的网站资源。如果要编写一个网络蜘蛛，这个模块可以限制一些蜘蛛抓取无用的或者重复的信息，避免蜘蛛掉入动态asp/php网页程序的死循环中。

简单的来说，robots.txt文件是每个网站都应该有的，指引蜘蛛抓取和禁止抓取的一个文本格式的文件，一些合法的蜘蛛或者叫爬虫，都是遵守这个规则的，可以控制他们的访问。

5、GET请求

import urllib.request  
import urllib.parse  
  
data = {}  
data['word'] = 'python3'  
url_values = urllib.parse.urlencode(data)    
  
url = "http://www.baidu.com/s?"  
full_url = url + url_values  
  
data = urllib.request.urlopen(full_url).read()  
z_data = data.decode('UTF-8')  
print(z_data)

6、POST数据

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
urlopen()的data参数默认为None，当data参数不为空的时候，urlopen()提交方式为Post。

import urllib.request  
import urllib.parse

url = r'http://www.lagou.com/jobs/positionAjax.json?'  
headers = {  
    'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '  
                  r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',  
    'Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label',  
    'Connection': 'keep-alive'  
}  
data = {  
    'first': 'true',  
    'pn': 1,  
    'kd': 'Python'  
}  
data = parse.urlencode(data).encode('utf-8')  #Post的数据必须是bytes或者iterable of bytes，不能是str，因此需要进行encode()编码

req = request.Request(url, headers=headers, data=data)  

page = request.urlopen(req).read()  
page = page.decode('utf-8')  
print(page)

持续更新…

最后更新： 2018年12月04日 21:48

原始链接： http://pythonfood.github.io/2017/12/30/python网络请求/

赏

Python Food

A snake that grows up slowly

python网络请求