300字范文,内容丰富有趣,生活中的好帮手!
300字范文 > [爬虫][python][入门][网页源码][百度图片][豆瓣TOP250]

[爬虫][python][入门][网页源码][百度图片][豆瓣TOP250]

时间:2020-08-13 11:30:39

相关推荐

[爬虫][python][入门][网页源码][百度图片][豆瓣TOP250]

Robots协议 查看爬取规则 遵守相关法律法规

Robots协议(也称为爬虫协议、机器人协议等)的全称是“网络爬虫排除标准”(Robots Exclusion Protocol),网站通过Robots协议告诉爬虫哪些页面可以抓取,哪些页面不能抓取。robots.txt是搜索引擎中访问网站的时候要查看的第一个文件。robots.txt文件告诉蜘蛛程序在服务器上什么文件是可以被查看的。

抓取某网页源码

输入网址后若失败 即不允许爬虫如输入网址后 只在浏览器中打开页面 请将光标重新移动到末端 点击空格 后按回车

import requests # python http客户端库,编写爬虫和测试服务器响应数据经常会用到的import re # 导入正则表达式模块,用于提取所需要的内容import random # 随机生成一个实数,它的取值范围 [0,1]def spiderPic(html,keyword):print('正在查找:'+keyword+' 对应的文件,正在从百度文件库中下载文件,亲稍等 .....')for addr in re.findall('"objURL":"(.*?)"',html,re.S):print('现在正在爬取URL地址:'+str(addr)[0:30]+' ....')try:pics = requests.get(addr,timeout=10) # 请求图像的URL地址(最大时间10s)except requests.exceptions.ConnectionError:print('您当前的URL地址请求错误 !')continuefq = open('S:\\python\\search\\img\\'+(str(random.randrange(0,1000,4))+'.jpg'),'wb')fq.write(pics.content)fq.close()# python 主方法if __name__ == '__main__':print('太棒了 !')word = input('请输入你想要爬取的文件的关键词:')result = requests.get('/search/flip?tn=baiduimage&ie=utf-8&word='+word)# 调用函数spiderPic(result.text,word)

实战

豆瓣电影250top

原文链接:/qq_36759224/article/details/101572275

成功运行

import requestsfrom lxml import etreeimport csvimport reimport timeimport osheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}def index_pages(number):url = '/top250?start=%s&filter=' % numberindex_response = requests.get(url=url, headers=headers)tree = etree.HTML(index_response.text)m_urls = tree.xpath("//li/div/div/a/@href")return m_urlsdef parse_pages(url):movie_pages = requests.get(url=url, headers=headers)parse_movie = etree.HTML(movie_pages.text)# 排名ranking = parse_movie.xpath("//span[@class='top250-no']/text()")# 电影名name = parse_movie.xpath("//h1/span[1]/text()")# 评分score = parse_movie.xpath("//div[@class='rating_self clearfix']/strong/text()")# 参评人数value = parse_movie.xpath("//span[@property='v:votes']/text()")number = [" ".join(['参评人数:'] + value)]# value = parse_movie.xpath("//a[@class='rating_people']")# string = [value[0].xpath('string(.)')]# number = [a.strip() for a in string]# print(number)# 类型value = parse_movie.xpath("//span[@property='v:genre']/text()")types = [" ".join(['类型:'] + value)]# 制片国家/地区value = re.findall('<span class="pl">制片国家/地区:</span>(.*?)<br/>', movie_pages.text)country = [" ".join(['制片国家:'] + value)]# 语言value = re.findall('<span class="pl">语言:</span>(.*?)<br/>', movie_pages.text)language = [" ".join(['语言:'] + value)]# 上映时期value = parse_movie.xpath("//span[@property='v:initialReleaseDate']/text()")date = [" ".join(['上映日期:'] + value)]# 片长value = parse_movie.xpath("//span[@property='v:runtime']/text()")time = [" ".join(['片长:'] + value)]# 又名value = re.findall('<span class="pl">又名:</span>(.*?)<br/>', movie_pages.text)other_name = [" ".join(['又名:'] + value)]# 导演value = parse_movie.xpath("//div[@id='info']/span[1]/span[@class='attrs']/a/text()")director = [" ".join(['导演:'] + value)]# 编剧value = parse_movie.xpath("//div[@id='info']/span[2]/span[@class='attrs']/a/text()")screenwriter = [" ".join(['编剧:'] + value)]# 主演value = parse_movie.xpath("//div[@id='info']/span[3]")performer = [value[0].xpath('string(.)')]# URLm_url = ['豆瓣链接:' + movie_url]# IMDb链接value = parse_movie.xpath("//div[@id='info']/a/@href")imdb_url = [" ".join(['IMDb链接:'] + value)]# 保存电影海报poster = parse_movie.xpath("//div[@id='mainpic']/a/img/@src")response = requests.get(poster[0])name2 = re.sub(r'[A-Za-z\:\s]', '', name[0])poster_name = str(ranking[0]) + ' - ' + name2 + '.jpg'dir_name = 'douban_poster'if not os.path.exists(dir_name):os.mkdir(dir_name)poster_path = dir_name + '/' + poster_namewith open(poster_path, "wb")as f:f.write(response.content)return zip(ranking, name, score, number, types, country, language, date, time, other_name, director, screenwriter, performer, m_url, imdb_url)def save_results(data):with open('douban.csv', 'a', encoding="utf-8-sig") as fp:writer = csv.writer(fp)writer.writerow(data)if __name__ == '__main__':num = 0for i in range(0, 250, 25):movie_urls = index_pages(i)for movie_url in movie_urls:results = parse_pages(movie_url)for result in results:num += 1save_results(result)print('第' + str(num) + '条电影信息保存完毕!')time.sleep(3)

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。