300字范文,内容丰富有趣,生活中的好帮手!
300字范文 > 中关村ZOL搜索页面:找出值得抓取的host

中关村ZOL搜索页面:找出值得抓取的host

时间:2021-11-03 04:19:15

相关推荐

中关村ZOL搜索页面:找出值得抓取的host

现需要获取某个论坛的帖子的url。并且需要更具获取的url的统计情况,对出现比较多的url提供解析功能。本文主要对统计部分的功能进行记录。

以中关村在线的搜索结果页面为例,要获取华为和小米搜索结果的前5页进行统计。

//ZolGetNewsInfo.jsvar casper = require('casper').create({viewportSize: {width: 800,height: 600},pageSettings: {loadImages: false,loadPlugins: false,userAgent: 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/41.0.2272.118 Safari/537.36'},verbose: true,logLevel: 'info',stepTimeout: 300000,onStepTimeout: function(timeout, stepNum) {customEcho("step" + stepNum + ": time out!");require('utils').dump(urls);},});function getUrls() {aList =[]var aList1 = casper.getElementsAttribute("h3 a", "href");aList.push(aList1)}var aList = [];var urls = [];var time_out_value = 30000;var i = 1;casper.start().repeat(5, function() {base = casper.cli.get(0);url = base +'&page='+i ;i++;casper.log('current url'+url,'info');casper.open(url);casper.wait(1000, function(){getUrls();},time_out_value);casper.then(function() {require('utils').dump(aList);});});casper.run();

运行程序结果导入txt文件。

$casperjs ZolGetNewsInfo.js /s/article_more.php?kword=xiaomi >zol.txt$casperjs ZolGetNewsInfo.js /s/article_more.php?kword=huawei >>zol.txt

对结果进行处理,格式化和统计

//getUrl.pydef get_url_from_casper_file(file_path):url_list = []try:f = open(file_path)for each_line in f:if(each_line.startswith(" \"http")):url_list.append(each_line.replace(" \"","").replace("\"","").replace(",",""))except ValueError:passreturn url_listdef deduplicate_sort_urls(url_list):clean_url_list = list(set(url_list))clean_url_list.sort()return clean_url_listdef get_host_prefix(url):return str(url[:20])url = get_url_from_casper_file('../data/zol.txt')clean_url = deduplicate_sort_urls(url)count = 0host = ''print len(clean_url)for item in clean_url:if(count==0):host = get_host_prefix(item)count += 1elif(host==get_host_prefix(item)):count += 1else:print 'host: '+ host +', number of urls: ' + str(count)count = 1host = get_host_prefix(item)print 'host: '+ host +', number of urls: ' + str(count)

zol搜索结果页面 每页10个结果,上面2个url,各取5页,共有10*5*2=100个页面(其中有6个重复被去掉)。

运行python getUrl.py结果如下:

94host: , number of urls: 2host: http://dealer.zol.co, number of urls: 1host: , number of urls: 13host: http://mobile.zol.co, number of urls: 54host: .c, number of urls: 8host: ., number of urls: 1host: .c, number of urls: 7host: http://server.zol.co, number of urls: 1host: http://smartwear.zol, number of urls: 1host: , number of urls: 4host: /, number of urls: 2Process finished with exit code 0

出现次数一两次的就不单独进行支持了, 在抽样中他们出现的频率较少,由于是随机抽样,可以认为他们在今后出现的可能性也较少。这部分url不会太影响抓取率。

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。