300字范文 > python爬虫入门实战之爬取美国体育网篮球比赛数据（selenium+xpath）

python爬虫入门实战之爬取美国体育网篮球比赛数据（selenium+xpath）

时间：2019-04-21 22:16:41

一、观察url及要爬取的数据

二、对网页结构进行分析

三、数据爬取

一、观察url及要爬取的数据

URL后面跟着a-index.html，我们再点开几个几面，发现它的链接后缀都是“小写字母-index.html”的格式结尾的，随意我们姜后缀链接进行构造拼接，将最终的URL拼接好后存储到一个容器。接下来我们对数据分析，经过观察后发现所有的数据我们都是需要的，没有不需要的数据。（如果有无用的数，那么我们在爬取数据的时候就不爬取这一项）在爬取的过程中发现以x-index.html结尾的链接是无法打开的，寻找问题之后发现是没有这个链接，x字体是黑色的。故在URL的构造的时候就不要构造此URL。

# 构造链接尾部方法 def getUrls(self,head_url):urls = []for i in range(97,123):if i == 120:passelse:url = head_urlurl += chr(i)+'-index.html'urls.append(url)return urls

需要注意的是在爬取数据的时候不要把这种无用的数据爬取下来

二、对网页结构进行分析

按F12查看网页代码，找到要爬取的数据不放呢。不难发现要爬取的数据都在tbody里边。

接下来我们查看要爬取的数据在具体哪一个标签内部，找到每一个tr内部的标签，不难发现每一个td都对应了一个数据，不同的是有两个数据是是在td标签下的a标签里面。

trTags = tree.xpath('//tr[@data-row and not(@class)]')#通过xpath寻找有用的数据，for tag in trTags:try:item = []#xpath找出的是列表，故要在后面加[0]，不然结果会有[]号item.append(tag.xpath('./th[1]/text()')[0])# 这个数据是存在a标签内部的item.append(tag.xpath('./td[1]/a/text()')[0])item.append(tag.xpath('./td[2]/text()')[0])item.append(tag.xpath('./td[3]/text()')[0])item.append(tag.xpath('./td[4]/a/text()')[0])item.append(tag.xpath('./td[5]/text()')[0])item.append(tag.xpath('./td[6]/text()')[0])item.append(tag.xpath('./td[7]/text()')[0])item.append(tag.xpath('./td[8]/text()')[0])item.append(tag.xpath('./td[9]/text()')[0])item.append(tag.xpath('./td[10]/text()')[0])item.append(tag.xpath('./td[11]/text()')[0])item.append(tag.xpath('./td[12]/text()')[0])item.append(tag.xpath('./td[13]/text()')[0])item.append(tag.xpath('./td[14]/text()')[0])items.append(item)item = []except:pass

三、数据爬取

在爬取数据的时候我开始使用的是Requests进行请求响应，但是之后我解析的时候发现得到的代码里面没有我需要的数据，这个问题我查询是因为这种网页是js动态导入的，对于这种我的解决办法是使用第三方包selenium进行网页的自动测试。

安装使用selenium请参考安装selenium及安装谷歌插件

import requestsfrom selenium import webdriverimport refrom lxml import etreefrom mon.action_chains import ActionChainsimport timeimport csvclass GetTiebaInfo(object):def __init__(self,url):self.url = urlself.urls = self.getUrls(self.url)self.items = self.spider(self.urls)self.pipelines(self.items)def getUrls(self,head_url):urls = []for i in range(97,123):if i == 120:passelse:url = head_urlurl += chr(i)+'-index.html'urls.append(url)return urls#爬虫模块def spider(self, urls):items = []for url in urls:print(url)browser = webdriver.Chrome()browser.maximize_window()browser.get(url)# left_click = browser.find_element_by_css_selector(".ranker.poptip.sort_default_asc.center")#按钮处理# ActionChains(browser).double_click(left_click).perform()tree = etree.HTML(browser.page_source)trTags = tree.xpath('//tr[@data-row and not(@class)]')for tag in trTags:try:item = []item.append(tag.xpath('./th[1]/text()')[0])item.append(tag.xpath('./td[1]/a/text()')[0])item.append(tag.xpath('./td[2]/text()')[0])item.append(tag.xpath('./td[3]/text()')[0])item.append(tag.xpath('./td[4]/a/text()')[0])item.append(tag.xpath('./td[5]/text()')[0])item.append(tag.xpath('./td[6]/text()')[0])item.append(tag.xpath('./td[7]/text()')[0])item.append(tag.xpath('./td[8]/text()')[0])item.append(tag.xpath('./td[9]/text()')[0])item.append(tag.xpath('./td[10]/text()')[0])item.append(tag.xpath('./td[11]/text()')[0])item.append(tag.xpath('./td[12]/text()')[0])item.append(tag.xpath('./td[13]/text()')[0])item.append(tag.xpath('./td[14]/text()')[0])# if tag.xpath('./td[15]/text()'):#item.append(tag.xpath('./td[15]/text()')[0])# else:#item.append('')items.append(item)item = []except:passreturn items#对已经爬取的数据做后续处理def pipelines(self, items):file_name = 'basketball.csv'with open(file_name, 'a', errors='ignore', newline='',encoding='utf-8') as f:f_csv = csv.writer(f)f_csv.writerows(items)print('写入完毕')if __name__ == '__main__':url = u'https://www.sports-/cbb/coaches/'GTI = GetTiebaInfo(url)

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。