300字范文 > python实现将html表格转换成CSV文件办法

python实现将html表格转换成CSV文件办法

时间：2022-05-06 11:09:52

后端开发|Python教程

python,html表格,转换,CSV文件

后端开发-Python教程

本文实例讲述了python实现将html表格转换成CSV文件的方法。分享给大家供大家参考。具体如下：

百度html5源码下载,ubuntu的pe,tomcat如何进行配置,go 自带爬虫,php会不会越来越火,天门外包seo推广怎么做lzw

使用方法：python html2csv.py *.html

这段代码使用了 HTMLParser 模块

php防伪查询源码,vscode写移动端,ubuntu 新开用户,tomcat容器实现原理,色情爬虫,php 输出汉字,宁波seo推广优化技术,js网站在线客服,js 响应式模板下载lzw

#!/usr/bin/python# -*- coding: iso-8859-1 -*-# Hello, this program is written in Python - programname = html2csv - version 2002-09-20 - import sys, getopt, os.path, glob, HTMLParser, retry: import psyco ; psyco.jit() # If present, use psyco to accelerate the programexcept: passdef usage(progname): \ Display program usage. \ progname = os.path.split(progname)[1] if os.path.splitext(progname)[1] in [.py,.pyc]: progname = python +progname return \\%sA coarse HTML tables to CSV (Comma-Separated Values) converter.Syntax : %s source.htmlArguments : source.html is the HTML file you want to convert to CSV.By default, the file will be converted to csv with the samename and the csv extension (source.html -> source.csv)You can use * and ?.Examples : %s mypage.html: %s *.htmlThis program is public domain.Author : Sebastien SAUVAGE\ % (programname, progname, progname, progname)class html2csv(HTMLParser.HTMLParser): \ A basic parser which converts HTML tables into CSV. Feed HTML with feed(). Get CSV with getCSV(). (See example below.) All tables in HTML will be converted to CSV (in the order they occur in the HTML file). You can process very large HTML files by feeding this class with chunks of html while getting chunks of CSV by calling getCSV(). Should handle badly formated html (missing,, , extraneous , ...). This parser uses HTMLParser from the HTMLParser module, not HTMLParser from the htmllib module. Example: parser = html2csv() parser.feed( open(mypage.html, b).read() ) open(mytables.csv,w+b).write( parser.getCSV() ) This class is public domain. Author: Sébastien SAUVAGE Versions:2002-09-19 : - First version2002-09-20 : - now uses HTMLParser.HTMLParser instead of htmllib.HTMLParser. - now parses command-line. To do:- handle

tags- convert html entities (&name; and &#ref;) to Ascii.\ def __init__(self): HTMLParser.HTMLParser.__init__(self) self.CSV = \ # The CSV data self.CSVrow = \ # The current CSV row beeing constructed from HTML self.inTD = 0 # Used to track if we are inside or outside a...tag. self.inTR = 0 # Used to track if we are inside or outside a...tag. self.re_multiplespaces = pile(\s+) # regular expression used to remove spaces in excess self.rowCount = 0 # CSV output line counter. def handle_starttag(self, tag, attrs): if tag == r: self.start_tr() elif tag == d: self.start_td() def handle_endtag(self, tag): if tag == r: self.end_tr() elif tag == d: self.end_td() def start_tr(self): if self.inTR: self.end_tr() #impliesself.inTR = 1 def end_tr(self): if self.inTD: self.end_td() # impliesself.inTR = 0if len(self.CSVrow) > 0:self.CSV += self.CSVrow[:-1]self.CSVrow = \ self.CSV += \ self.rowCount += 1 def start_td(self): if not self.inTR: self.start_tr() #impliesself.CSVrow += \" self.inTD = 1 def end_td(self): if self.inTD:self.CSVrow += \", self.inTD = 0 def handle_data(self, data): if self.inTD:self.CSVrow += self.re_multiplespaces.sub( ,data.replace(\ , ).replace(\ ,\).replace(\ ,\).replace(\",\"")) def getCSV(self,purge=False): \ Get output CSV.If purge is true, getCSV() will return all remaining data,even iforare not properly closed.(You would typically call getCSV with purge=True when you do not haveany more HTML to feed and you suspect dirty HTML (unclosed tags). \ if purge and self.inTR: self.end_tr() # This will also end_td and append last CSV row to output CSV. dataout = self.CSV[:] self.CSV = \ return dataoutif __name__ == "__main__": try: # Put getopt in place for future usage. opts, args = getopt.getopt(sys.argv[1:],None) except getopt.GetoptError: print usage(sys.argv[0]) # print help information and exit: sys.exit(2) if len(args) == 0: print usage(sys.argv[0]) # print help information and exit: sys.exit(2)print programname html_files = glob.glob(args[0]) for htmlfilename in html_files: outputfilename = os.path.splitext(htmlfilename)[0]+.csv parser = html2csv() print Reading %s, writing %s... % (htmlfilename, outputfilename) try:htmlfile = open(htmlfilename, b)csvfile = open( outputfilename, w+b)data = htmlfile.read(8192)while data: parser.feed( data ) csvfile.write( parser.getCSV() ) sys.stdout.write(\%d CSV rows written.\r % parser.rowCount) data = htmlfile.read(8192)csvfile.write( parser.getCSV(True) )csvfile.close()htmlfile.close() except:print Error converting %s % htmlfilenametry: htmlfile.close()except: passtry: csvfile.close()except: pass print All done.

php留言本源码,vscode 定制开发,ubuntu怎么更换apt源,tomcat主机配置,如何远程连接sqlite,手机端日期滚动插件,前端框架参考文献,容易爬虫的网站,防sql注入 php,淘宝seo绝密教程,hao123的网站源代码,网页图片显示圆形,php 管理系统模板,刷圈图页面的字体,仓库管理系统php源码下载,狮子鱼团购程序lzw

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。