300字范文 > Python 网络爬虫笔记2 -- Requests库实战

Python 网络爬虫笔记2 -- Requests库实战

时间：2018-08-20 17:43:23

Python 网络爬虫笔记2 – Requests库实战

Python 网络爬虫系列笔记是笔者在学习嵩天老师的《Python网络爬虫与信息提取》课程及笔者实践网络爬虫的笔记。

课程链接：Python网络爬虫与信息提取

参考文档：

Requests 官方文档（英文）

Requests 官方文档（中文）

Beautiful Soup 官方文档

re 官方文档

Scrapy 官方文档（英文）

Scrapy 官方文档（中文）

1、Robots 协议

作用：网站告知网络爬虫哪些页面可以抓取，哪些不行

形式：在网站根目录下的robots.txt文件

Robots协议基本语法：

User-agent：访问对象，*代表所有Disallow：不予许爬取的目录，/代表根目录

# 京东Robots 协议：/robots.txtUser-agent: * Disallow: /?* Disallow: /pop/*.html Disallow: /pinpai/*.html?* User-agent: EtaoSpider Disallow: / User-agent: HuihuiSpider Disallow: / User-agent: GwdangSpider Disallow: / User-agent: WochachaSpider Disallow: /

2、京东商品页面的爬取

访问京东网站，获取所要爬取商品的url链接使用爬取网页的通用代码框架

import requestsdef get_html_text(url):"""爬取网页的通用代码框架"""try:r = requests.get(url, timeout=30)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:return '产生异常'def jd_goods():"""爬取京东上的某个商品，以华为 mate20 为例"""url = '/100000822981.html'print(get_html_text(url))

3、亚马逊商品页面的爬取

访问亚马逊网站，获取所要爬取商品的url链接亚马逊会拒绝非浏览器的请求，需修改url头部，伪装成浏览器发送请求修改爬取网页的通用代码框架

import requestsdef amazon_goods():"""爬取京东上的某个商品，以Kindle为例"""url = '/gp/product/B07746N2J9'try:hd = {'user-agent': 'Mozilla/5.0'}r = requests.get(url, headers=hd)r.raise_for_status()r.encoding = r.apparent_encodingprint(r.text[0:1000])except:print('爬取失败')if __name__ == '__main__':amazon_goods()

4、百度/360搜索关键字提交

百度的关键词接口：/s?wd=keyword360的关键词接口：/s?q=keyword

import requestsdef baidu_search():"""使用百度搜索引擎，提交关键词查询"""url = '/s'try:kv = {'wd': 'python'}r = requests.get(url, params=kv)r.raise_for_status()r.encoding = r.apparent_encodingprint(r.text)except:print('爬取失败')if __name__ == '__main__':baidu_search()

5、网络图片的爬取和存储

获取图片的url链接设置图片保存路径下载图片并保存

import requestsimport osdef download_image():"""爬取图片，以百度的logo为例"""url = '/img/bd_logo1.png'root = 'E:/pics/'path = root + url.split('/')[-1]try:if not os.path.exists(root):os.mkdir(root)if not os.path.exists(path):r = requests.get(url)with open(path, 'wb') as f:f.write(r.content)f.close()print('图片保存成功')else:print('图片已存在')except:print('爬取失败')if __name__ == '__main__':download_image()

6、IP地址归属地查询

138 IP地址归属地查询接口：/ip.asp?ip=ipaddress查询百度网站的ip地址，在cmd中输入：nslookup

import requestsdef ip_attribution():"""IP地址归属地查询, 使用138的接口查询百度的IP归属地"""url = '/ip.asp?ip='ip = '14.215.177.39'try:r = requests.get(url+ip, timeout=30)r.raise_for_status()r.encoding = r.apparent_encodingprint(r.text)except:print('查询失败')if __name__ == '__main__':ip_attribution()

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。