为了使用XPath技术,对爬虫抓取的网页数据进行抽取(如标题、正文等等),花了一天的时间熟悉了一下Python语言,今天尝试在Windows下安装libxml2模块,将自己的一点学习实践简单记录一下。
Python在安装一个扩展的模块时,可以通过安装辅助工具包(Setuptools)来安装新的Python packages,并可以实现对已经安装的packages的管理。在http://pypi.python.org/pypi/setuptools上你可以找到对于不同平台下的安装包,这些工具主要包括Python Eggs和 Easy Install。在网上搜了很多,比较常用的应该是Easy Install,而且在网站http://peak.telecommunity.com/DevCenter/EasyInstall上给出了对EasyInstall的介绍:
- Easy Install is a python module (easy_install) bundled with setuptools that lets you automatically download, build, install, and manage Python packages.
Easy Install是一个Python模块,通过它可以方便地安装扩展的Python模块。
下面我们就一步步地准备、安装、配置。
准备
需要的软件包,及其相应的下载地址,分别整理如下:
Python 2.6 (python官网貌似打不开,也忘记从哪里下载的,到网上搜一下吧)
libxml2-python-2.7.7.win32-py2.7.exe (http://xmlsoft.org/sources/win32/python/libxml2-python-2.7.7.win32-py2.7.exe,http://xmlsoft.org/sources/win32/python/)
setuptools-0.6c11.win32-py2.6.exe (http://pypi.python.org/packages/2.6/s/setuptools/setuptools-0.6c11.win32-py2.6.exe#md5=1509752c3c2e64b5d0f9589aafe053dc,http://pypi.python.org/pypi/setuptools#downloads)
安装
第1步:安装Python
在Windows下面,只需要安装包的exe可行性文件即可安装,不在累述。
第2步:安装Easy Install工具
前提是Python要安装好,然后安装上面准备好的setuptools-0.6c11.win32-py2.6.exe即可,它会自动找到Python的安装目录,并将安装工具包安装到对应的目录下面。例如我的脚本目录为E:Program FilesPython26Scripts,验证一下:
- E:>cd E:Program FilesPython26Scripts
- E:Program FilesPython26Scripts>easy_install –help
- Global options:
- –verbose (-v) run verbosely (default)
- –quiet (-q) run quietly (turns verbosity off)
- –dry-run (-n) don’t actually do anything
- –help (-h) show detailed help message
- Options for ‘easy_install’ command:
- –prefix installation prefix
- –zip-ok (-z) install package as a zipfile
- –multi-version (-m) make apps have to require() a version
- –upgrade (-U) force upgrade (searches PyPI for latest
- versions)
- –install-dir (-d) install package to DIR
- –script-dir (-s) install scripts to DIR
- –exclude-scripts (-x) Don’t install scripts
- –always-copy (-a) Copy all needed packages to install dir
- –index-url (-i) base URL of Python Package Index
- –find-links (-f) additional URL(s) to search for packages
- –delete-conflicting (-D) no longer needed; don’t use this
- –ignore-conflicts-at-my-risk no longer needed; don’t use this
- –build-directory (-b) download/extract/build in DIR; keep the
- results
- –optimize (-O) also compile with optimization: -O1 for
- "python -O", -O2 for "python -OO", and -O0 to
- disable [default: -O0]
- –record filename in which to record list of installed
- files
- –always-unzip (-Z) don’t install as a zipfile, no matter what
- –site-dirs (-S) list of directories where .pth files work
- –editable (-e) Install specified packages in editable form
- –no-deps (-N) don’t install dependencies
- –allow-hosts (-H) pattern(s) that hostnames must match
- –local-snapshots-ok (-l) allow building eggs from local checkouts
- usage: easy_install-script.py [options] requirement_or_url …
- or: easy_install-script.py –help
如果能够看到上述easy_install的命令选项,就说明安装成功了。
第3步:安装libxml2
libxml2安装,通过libxml2-python-2.7.7.win32-py2.7.exe安装即可。安装完这个以后,只是将相应的模块解压到了对应的目录,并不能在Python编程中使用,还需要通过Easy Install来安装一个lxml库,它是一个C编写的库,能够加速对HTML或XML的解析处理,详细介绍可以参考(http://lxml.de/index.html)。安装lxml需要使用Easy Install的执行脚本,例如我的脚本目录为E:Program FilesPython26Scripts,执行安装:
E:Program FilesPython26Scripts>easy_install lxml==2.2.2
可以看到安装信息:
- E:Program FilesPython26Scripts>easy_install lxml==2.2.2
- Searching for lxml==2.2.2
- Reading http://pypi.python.org/simple/lxml/
- Reading http://codespeak.net/lxml
- Best match: lxml 2.2.2
- Downloading http://pypi.python.org/packages/2.6/l/lxml/lxml-2.2.2-py2.6-win32.egg#md5=dc73ae17e486037580371077efdc13e9
- Processing lxml-2.2.2-py2.6-win32.egg
- creating e:program filespython26libsite-packageslxml-2.2.2-py2.6-win32.egg
- Extracting lxml-2.2.2-py2.6-win32.egg to e:program filespython26libsite-packages
- Adding lxml 2.2.2 to easy-install.pth file
- Installed e:program filespython26libsite-packageslxml-2.2.2-py2.6-win32.egg
- Processing dependencies for lxml==2.2.2
- Finished processing dependencies for lxml==2.2.2
使用XPath抽取
下面,我们使用XPath来实现网页数据的抽取。这里,我使用了一个Python的IDE工具——EasyEclipse for Python(Version: 1.3.1),可以直接创建Pydev Project,具体使用请查阅相关资料。
验证可以使用XPath来定向抽取网页数据,Python代码如下:
- import codecs
- import sys
- from lxml import etree
- def readFile(file, decoding):
- html = ”
- try:
- html = open(file).read().decode(decoding)
- except:
- pass
- return html
- def extract(file, decoding, xpath):
- html = readFile(file, decoding)
- tree = etree.HTML(html)
- return tree.xpath(xpath)
- if __name__ == ‘__main__’:
- sections = extract(‘peak.txt’, ‘utf-8’, "//h3//a[@class=’toc-backref’]")
- for title in sections:
- print title.text
首先,把网页http://peak.telecommunity.com/DevCenter/EasyInstall的源代码下载下来,存储到文件peak.txt中,编码UTF-8;
然后,在Python中读取该文件内容,使用XPath抽取页面上每个段落的标题内容,最后输出到控制台上,结果如下所示:
- Troubleshooting
- Windows Notes
- Multiple Python Versions
- Restricting Downloads with
- Installing on Un-networked Machines
- Packaging Others’ Projects As Eggs
- Creating your own Package Index
- Password-Protected Sites
- Controlling Build Options
- Editing and Viewing Source Packages
- Dealing with Installation Conflicts
- Compressed Installation
- Administrator Installation
- Mac OS X "User" Installation
- Creating a "Virtual" Python
- "Traditional"
- Backward Compatibility
如果你足够熟悉XPath,借助于libxml2,你可以抽取网页中任何你想要的内容。