300字范文 > Python爬虫实战 | (1) 爬取猫眼电影官网的TOP100电影榜单

Python爬虫实战 | (1) 爬取猫眼电影官网的TOP100电影榜单

时间：2019-04-07 02:31:09

在本篇博客中，我们将使用requests+正则表达式来爬取猫眼电影官网的TOP100电影榜单，获取每部电影的片名，主演，上映日期，评分和封面等内容。

打开猫眼Top100，分析URL的变化：发现Top100榜总共包含10页，每页10部电影，并且每一页的URL都是有规律的，如第2页为/board/4?offset=10，第三页为/board/4?offset=20。由此可得第n页为/board/4?offset=(n-1)*10。接下来我们通过爬虫四部曲，来对其进行爬取：

首先搭建起程序的主题框架：

import csvimport jsonimport timeimport reimport requestsfrom requests import RequestExceptiondef get_one_page(url):passdef parse_one_page(html):passdef write_tofile(content):passif __name__ == '__main__':for i in range(10):url = '/board/4?offset='+str(i*10)#发送请求获取响应html = get_one_page(url)#解析响应内容content = parse_one_page(html)#数据存储write_tofile(content)#每个页面间隔1stime.sleep(1)

然后逐一补全上述函数。首先是请求页面的函数：

def get_one_page(url):try:#添加User-Agent，放在headers中，伪装成浏览器headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}response = requests.get(url,headers=headers)if response.status_code == 200:response.encoding = response.apparent_encodingreturn response.textreturn Noneexcept RequestException:return None

其中User-Agent的值，可以打开Chrome浏览器，随便打开一个页面，右键检查，在Network中打开一个请求，在Headers中便可以找到，把值直接copy过去就好。

然后编写页面解析函数，首先用Chrome打开Top100榜页面，右键检查，在element选项卡中定位页面元素：

点击左上方的箭头图标，此时将鼠标移动到页面的任意位置，该位置对应的html代码就会在Elements中被定位，退出箭头模式可以按esc。我们发现每部电影都被包含在dd标签中，所有我们想要的信息也都在其中：

接下来我们可以把dd标签内所有的html代码全部copy下来，把冗余的部分用.*?过滤掉，留一些标识来帮助我们定位想要的信息，把想要的信息放在(.*?)中，进行提取：

def parse_one_page(html):pattern = pile('<dd>.*?board-index.*?">(.*?)' #序号+'.*?data-src="(.*?)"' #图片地址+'.*?class="name"><a.*?>(.*?)</a>' #影片名+'.*?class="star">(.*?)' #主演+'.*?class="releasetime">(.*?)'#上映时间+'.*?class="score">(.*?)'#评分整数+'.*?class="fraction">(.*?)'#评分小数+'.*?</dd>',re.S)items = pattern.findall(html) #返回一个数组content = []for item in items:content.append({'index':item[0],'image':item[1],'title':item[2],'actor':item[3],'time':item[4],'score':item[5]+item[6]})return content

最后进行数据存储：

def write_tofile(content):'''#存为json形式文本文件# for item in content:#with open('result.txt','a',encoding='utf-8') as f:# f.write(json.dumps(item,ensure_ascii=False)+'\n')'''#存为csv文件for item in content:print(item)with open('result.csv','a',encoding='utf-8') as f:writer = csv.writer(f)writer.writerow([item['index'],item['image'],item['title'],item['actor'],item['time'],item['score']])f.close()

完整代码：

import csvimport jsonimport timeimport reimport requestsfrom requests import RequestExceptiondef get_one_page(url):try:#添加User-Agent，放在headers中，伪装成浏览器headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}response = requests.get(url,headers=headers)if response.status_code == 200:response.encoding = response.apparent_encodingreturn response.textreturn Noneexcept RequestException:return Nonedef parse_one_page(html):pattern = pile('<dd>.*?board-index.*?">(.*?)' #序号+'.*?data-src="(.*?)"' #图片地址+'.*?class="name"><a.*?>(.*?)</a>' #影片名+'.*?class="star">(.*?)' #主演+'.*?class="releasetime">(.*?)'#上映时间+'.*?class="score">(.*?)'#评分整数+'.*?class="fraction">(.*?)'#评分小数+'.*?</dd>',re.S)items = pattern.findall(html) #返回一个数组content = []for item in items:content.append({'index':item[0],'image':item[1],'title':item[2],'actor':item[3],'time':item[4],'score':item[5]+item[6]})return contentdef write_tofile(content):'''#存为json形式文本文件# for item in content:#with open('result.txt','a',encoding='utf-8') as f:# f.write(json.dumps(item,ensure_ascii=False)+'\n')'''#存为csv文件for item in content:print(item)with open('result.csv','a',encoding='utf-8') as f:writer = csv.writer(f)writer.writerow([item['index'],item['image'],item['title'],item['actor'],item['time'],item['score']])f.close()if __name__ == '__main__':#写入csv文件头部with open('result.csv','w',encoding='utf-8') as f:writer = csv.writer(f)writer.writerow(['序号', '图片地址', '影片名', '主演', '上映时间', '评分'])f.close()for i in range(10):url = '/board/4?offset='+str(i*10)print(url)#发送请求获取响应html = get_one_page(url)#解析响应内容content = parse_one_page(html)#数据存储write_tofile(content)#每个页面间隔1stime.sleep(1)

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。