300字范文 > python爬虫--小白爬取哔哩哔哩动画排行榜

python爬虫--小白爬取哔哩哔哩动画排行榜

时间：2019-10-17 07:29:35

爬取哔哩哔哩网站动画排行榜

前言

本次跟博主上一边爬取CSDN的文章内容差不多，主要是爬取哔哩哔哩网站动画排行榜中的题目与链接以及综合得分，最后保存到excel文件中，此次在代码中添加了注释，通俗易懂，方法较简单，适合小白练手

一、爬取页面

二、代码

import refrom bs4 import BeautifulSoupimport pandas as pdimport requests# header模拟浏览器访问header={'Host': '','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/0101 Firefox/84.0','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2','Accept-Encoding': 'gzip, deflate, br','Referer': '/','Connection': 'keep-alive','Cookie': '_uuid=F8EE5482-2CE1-3426-0C4C-52ABF60EEF6896527infoc; buvid3=FD3F5860-5D46-461A-B7DC-313A2F98C59A58475infoc; finger=-863946392','Upgrade-Insecure-Requests': '1','Cache-Control': 'max-age=0','TE': 'Trailers'}# 要爬取的网站链接（哔哩哔哩动画排行榜）url ="/v/popular/rank/douga?spm_id_from=333.851.b_62696c695f7265706f72745f646f756761.39"# 解析urlresponse=requests.get(url,headers=header).textsoup = BeautifulSoup(response, features='html.parser')# 找出标签<div class="info">里面的内容，也就是需要爬取的内容，但是还包含其他的内容all_content=soup.find_all('div',{'class':"info"})# 通过正则爬取每个动画的综合得分score=re.findall(r"<div>(.+?)</div>综合得分", str(all_content))all_hrefs=[]all_titles=[]# 因为爬取的内容为列表形式，因此需要循环遍历，爬取出链接和题目for one_content in all_content:# 获取标签<a class="title">里面的内容,这次就是要的题目与链接ones=one_content.find_all('a',{"class":'title'})# 循环遍历for one in ones:# 链接one_href=one['href']one_href=one_href.replace('//', '')# 题目one_title=re.findall(r">(.+?)</a>", str(one))# 将题目与链接分别添加到两个列表中all_hrefs.append(one_href)all_titles.append(one_title[0])# 将三个列表放到字典中dict={'title（动画名称）':all_titles,'href(动画链接)':all_hrefs,'score(综合得分)':score}# 转换成DataFrame格式df=pd.DataFrame(dict)# 索引从1开始df.index=df.index+1# 存放到excel中df.to_excel('E:/output/bilibili_动画.xlsx')

爬取结果

总结

练习了几次，发现也挺好玩的，下此打算尝试一些有技术含量的，哈哈哈

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。