300字范文 > Python爬虫入门案例（一）豆瓣电影Top250爬取（request+XPath+csv）

Python爬虫入门案例（一）豆瓣电影Top250爬取（request+XPath+csv）

时间：2021-11-20 23:54:22

豆瓣电影top250是学习爬虫很好的入门案例。学习爬虫，首先我们应该清楚爬虫的流程。

一、流程分析

1.访问网页，获取网页回应（response）

2.分析源码结构，通过xpath或其他解析方法获得所需的相应信息内容。

3.通过csv方法将获取的内容存入文件中（新手可以先存入txt文件，这种方法比较简单）

二、代码实现与思路讲解

设置headers，设置编码解析格式，通过requests中的get方法获取网页回应。

headers查看方式：

进入网页，点击f12或查看开发者工具，点击Network-headers-User-Agent

def get_response(url): headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'} response = requests.get(url,headers=headers) response.encoding = 'UTF-8' return response.text

2.分析源码

通过查看源码，可以看出每个电影都是放置在li标签中的，这样我们就非常清晰。

def get_nodes(html): text = etree.HTML(html) nodes = text.xpath('//li/div[@class="item"]') #锁定代码位置 infos = [] for node in nodes: try: key = {} key['movieName'] = str(node.xpath('.//span[@class="title"][1]/text()')).strip("[']") print(key['movieName']) firstInfo = node.xpath('.//div[@class="bd"]/p/text()')[0] secondInfo = node.xpath('.//div[@class="bd"]/p/text()')[1] key['director'] = str(firstInfo.split("主演:")[0]).strip().strip('导演:') key['actors'] = firstInfo.split("主演:")[1] key['time'] = secondInfo.split('/')[0] key['country'] = secondInfo.split('/')[1] infos.append(key) except: key['actors'] = None return infos

这样我们就可以爬取到电影名，导演，主演，上映时间，国家等电影信息。（如果对xpath语法有问题的同学可以去w3school官网去查看。语法相当简单好学）

3. 将信息写入csv文件

def save_file(infos): headers = ['电影名称','导演','主演','上映时间','国家'] with open('DouBanMovieT250.csv','a+',encoding='UTF-8',newline='') as fp: writer = csv.writer(fp) writer.writerow(headers) for key in infos: writer.writerow([key['movieName'],key['director'],key['actors'],key['time'],key['country']])

这里我们先通过csv.writer(fp)方法生成一个csv对象。再通过这个对象来调用writerrow方法写入文件。到这里我们就完成了第一页的爬虫。

当然，我们的欲望不止爬取第一页。下面我们再来说一下翻页爬取。

if __name__ == '__main__': urls = ['/top250?start={}'.format(i) for i in range(0, 226, 25)] for url in urls: html = get_response(url) infos = get_nodes(html) save_file(infos)

通过查看网页的get请求可以知道，参数是固定变化的。这样我们直接可以通过for循环来遍历网页请求参数。

到这里，我们的豆瓣电影就爬取完成了。来看一下结果吧！

完整代码

import requestsfrom lxml import etreeimport csvdef get_response(url): headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'} response = requests.get(url,headers=headers) response.encoding = 'UTF-8' return response.textdef get_nodes(html): text = etree.HTML(html) nodes = text.xpath('//li/div[@class="item"]') infos = [] for node in nodes: try: key = {} key['movieName'] = str(node.xpath('.//span[@class="title"][1]/text()')).strip("[']") print(key['movieName']) firstInfo = node.xpath('.//div[@class="bd"]/p/text()')[0] secondInfo = node.xpath('.//div[@class="bd"]/p/text()')[1] key['director'] = str(firstInfo.split("主演:")[0]).strip().strip('导演:') key['actors'] = firstInfo.split("主演:")[1] key['time'] = secondInfo.split('/')[0] key['country'] = secondInfo.split('/')[1] infos.append(key) except: key['actors'] = None return infosdef save_file(infos): headers = ['电影名称','导演','主演','上映时间','国家'] with open('DouBanMovieT250.csv','a+',encoding='UTF-8',newline='') as fp: writer = csv.writer(fp) writer.writerow(headers) for key in infos: writer.writerow([key['movieName'],key['director'],key['actors'],key['time'],key['country']])if __name__ == '__main__': urls = ['/top250?start={}'.format(i) for i in range(0, 226, 25)] for url in urls: html = get_response(url) infos = get_nodes(html) save_file(infos)

每天进步一点点，Keep Going！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。