300字范文,内容丰富有趣,生活中的好帮手!
300字范文 > python 爬虫生成csv文件和图_csv文件操作和爬虫抓取豆瓣影评并生成词云图

python 爬虫生成csv文件和图_csv文件操作和爬虫抓取豆瓣影评并生成词云图

时间:2019-12-28 14:07:19

相关推荐

python 爬虫生成csv文件和图_csv文件操作和爬虫抓取豆瓣影评并生成词云图

import requests

from bs4 import BeautifulSoup

import csv

# 自定义一个抓取每页影评的方法

def getCommentByPage(url):

# response = requests.get(url)

# print(response.status_code) # 瓣做了反爬虫处理,会返回错误代码418

# 程序模拟浏览器向服务器发请求(浏览器发请求时会带有一些请求信息,而python脚本没用)

# 1. 设置请求头(伪装浏览器)

#因为豆瓣做了反爬虫处理。一般需要的信息包括浏览器信息和用户信息

# 格式应该为:变量名={"标签1":"内容","标签2","内容"} 如果标签中含有双引号,就换成单引号

header = {

#浏览器信息

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36",

# 用户信息

'Cookie': 'll="118104"; bid=lcfxfqam0UI; __utmz=30149280.1600249245.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; gr_user_id=84e4d664-c0db-4b2f-9063-5db2220d00b9; _vwo_uuid_v2=D5A4EEC3511DE982EE357AC0DEFCBDD68|a2199e2a903fdca1c047a02bf8ff8a91; viewed="34710120"; __utmz=223695111.1600249273.1.1.utmcsr=|utmccn=(referral)|utmcmd=referral|utmcct=/; __yadk_uid=pWo4BT7vwgPHXL7Bg7HkUFrerRqxQMVc; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1600305602%2C%22https%3A%2F%%2F%22%5D; _pk_ses.100001.4cf6=*; ap_v=0,6.0; __utma=30149280.132873610.1600249245.1600249245.1600305602.2; __utmc=30149280; __utma=223695111.407950405.1600249273.1600249273.1600305602.2; __utmc=223695111; __utmt=1; ct=y; __utmt=1; __utmb=30149280.1.10.1600305602; __gads=ID=ca7fb5b75ca8:T=1600249841:S=ALNI_MazhI6AdCCbmBvG3dw7KIZ7zE6hAg; __utmb=223695111.10.10.1600305602; _pk_id.100001.4cf6=0741429674f84c3c.1600249273.2.1600305674.1600250207.'

}

# 携带请求头发请求

response = requests.get(url, headers=header) #带请求信息的函数

if (response.status_code != 200):

print("访问失败")

else:

bs = BeautifulSoup(response.content, "html5lib") #使用靓汤结合 html5lib 结合解析获取到的网页内容

reviewItemList = bs.find_all("div", attrs={"class": "review-item"}) # 使用bs查找所有标签div 1.div标签 2. class=“review-item”

for reviewItem in reviewItemList:

# 找作者

author = reviewItem.find("a", attrs={"class": "name"}).text

# 找评分

rating = reviewItem.find("span", attrs={"class": "main-title-rating"})

if (rating != None):

star = rating.get("title")

content = reviewItem.find("div", attrs={"class": "short-content"}).text.replace("\n"," ")

# 将作者、评分、评论写成列表,并且添加到reviewList

reviewList.append([author, star, content])

def writeReview():

with open("最受欢迎影评.csv","w",newline="",encoding="utf-8") as fileW:

scvW = csv.writer(fileW)

scvW.writerows(reviewList)

# main方法:程序运行的入口

if __name__ == '__main__':

# 存储影评

reviewList = [] # 在main方法中定义的变量是全局变量

for i in range(10):

baseUrl = "https://movie.douban." \

"com/review/best/?start={}".format(i*20)

getCommentByPage(baseUrl)

# 将短评写入csv文件

writeReview()

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。