[python] Windows下安装libxml2并在Python中使用XPath-微慑信息网-VulSee.com

为了使用XPath技术，对爬虫抓取的网页数据进行抽取（如标题、正文等等），花了一天的时间熟悉了一下Python语言，今天尝试在Windows下安装libxml2模块，将自己的一点学习实践简单记录一下。

Python在安装一个扩展的模块时，可以通过安装辅助工具包（Setuptools）来安装新的Python packages，并可以实现对已经安装的packages的管理。在http://pypi.python.org/pypi/setuptools上你可以找到对于不同平台下的安装包，这些工具主要包括Python Eggs和 Easy Install。在网上搜了很多，比较常用的应该是Easy Install，而且在网站http://peak.telecommunity.com/DevCenter/EasyInstall上给出了对EasyInstall的介绍：

[plain]view plaincopy
 
Easy Install is a python module (easy_install) bundled with setuptools that lets you automatically download, build, install, and manage Python packages.  

Easy Install是一个Python模块，通过它可以方便地安装扩展的Python模块。

下面我们就一步步地准备、安装、配置。

准备

需要的软件包，及其相应的下载地址，分别整理如下：

Python 2.6 （python官网貌似打不开，也忘记从哪里下载的，到网上搜一下吧）
libxml2-python-2.7.7.win32-py2.7.exe （http://xmlsoft.org/sources/win32/python/libxml2-python-2.7.7.win32-py2.7.exe，http://xmlsoft.org/sources/win32/python/）
setuptools-0.6c11.win32-py2.6.exe （http://pypi.python.org/packages/2.6/s/setuptools/setuptools-0.6c11.win32-py2.6.exe#md5=1509752c3c2e64b5d0f9589aafe053dc，http://pypi.python.org/pypi/setuptools#downloads）

安装

第1步：安装Python

在Windows下面，只需要安装包的exe可行性文件即可安装，不在累述。

第2步：安装Easy Install工具

前提是Python要安装好，然后安装上面准备好的setuptools-0.6c11.win32-py2.6.exe即可，它会自动找到Python的安装目录，并将安装工具包安装到对应的目录下面。例如我的脚本目录为E:Program FilesPython26Scripts，验证一下：

[plain]view plaincopy
 
E:>cd E:Program FilesPython26Scripts  
E:Program FilesPython26Scripts>easy_install –help  
  
Global options:  
  –verbose (-v)  run verbosely (default)  
  –quiet (-q)    run quietly (turns verbosity off)  
  –dry-run (-n)  don’t actually do anything  
  –help (-h)     show detailed help message  
  
Options for ‘easy_install’ command:  
  –prefix                       installation prefix  
  –zip-ok (-z)                  install package as a zipfile  
  –multi-version (-m)           make apps have to require() a version  
  –upgrade (-U)                 force upgrade (searches PyPI for latest  
                                 versions)  
  –install-dir (-d)             install package to DIR  
  –script-dir (-s)              install scripts to DIR  
  –exclude-scripts (-x)         Don’t install scripts  
  –always-copy (-a)             Copy all needed packages to install dir  
  –index-url (-i)               base URL of Python Package Index  
  –find-links (-f)              additional URL(s) to search for packages  
  –delete-conflicting (-D)      no longer needed; don’t use this  
  –ignore-conflicts-at-my-risk  no longer needed; don’t use this  
  –build-directory (-b)         download/extract/build in DIR; keep the  
                                 results  
  –optimize (-O)                also compile with optimization: -O1 for  
                                 "python -O", -O2 for "python -OO", and -O0 to  
                                 disable [default: -O0]  
  –record                       filename in which to record list of installed  
                                 files  
  –always-unzip (-Z)            don’t install as a zipfile, no matter what  
  –site-dirs (-S)               list of directories where .pth files work  
  –editable (-e)                Install specified packages in editable form  
  –no-deps (-N)                 don’t install dependencies  
  –allow-hosts (-H)             pattern(s) that hostnames must match  
  –local-snapshots-ok (-l)      allow building eggs from local checkouts  
  
usage: easy_install-script.py [options] requirement_or_url …  
   or: easy_install-script.py –help  

如果能够看到上述easy_install的命令选项，就说明安装成功了。

第3步：安装libxml2

libxml2安装，通过libxml2-python-2.7.7.win32-py2.7.exe安装即可。安装完这个以后，只是将相应的模块解压到了对应的目录，并不能在Python编程中使用，还需要通过Easy Install来安装一个lxml库，它是一个C编写的库，能够加速对HTML或XML的解析处理，详细介绍可以参考（http://lxml.de/index.html）。安装lxml需要使用Easy Install的执行脚本，例如我的脚本目录为E:Program FilesPython26Scripts，执行安装：

E:Program FilesPython26Scripts>easy_install lxml==2.2.2

可以看到安装信息：

[plain]view plaincopy
 
E:Program FilesPython26Scripts>easy_install lxml==2.2.2  
Searching for lxml==2.2.2  
Reading http://pypi.python.org/simple/lxml/  
Reading http://codespeak.net/lxml  
Best match: lxml 2.2.2  
Downloading http://pypi.python.org/packages/2.6/l/lxml/lxml-2.2.2-py2.6-win32.egg#md5=dc73ae17e486037580371077efdc13e9  
Processing lxml-2.2.2-py2.6-win32.egg  
creating e:program filespython26libsite-packageslxml-2.2.2-py2.6-win32.egg  
Extracting lxml-2.2.2-py2.6-win32.egg to e:program filespython26libsite-packages  
Adding lxml 2.2.2 to easy-install.pth file  
  
Installed e:program filespython26libsite-packageslxml-2.2.2-py2.6-win32.egg  
  
Processing dependencies for lxml==2.2.2  
Finished processing dependencies for lxml==2.2.2  

使用XPath抽取

下面，我们使用XPath来实现网页数据的抽取。这里，我使用了一个Python的IDE工具——EasyEclipse for Python（Version: 1.3.1），可以直接创建Pydev Project，具体使用请查阅相关资料。

验证可以使用XPath来定向抽取网页数据，Python代码如下：

[python]view plaincopy
 
import codecs  
import sys  
from lxml import etree  
  
def readFile(file, decoding):  
    html = ”  
    try:  
        html = open(file).read().decode(decoding)  
    except:  
        pass  
    return html  
  
def extract(file, decoding, xpath):  
    html = readFile(file, decoding)  
    tree = etree.HTML(html)  
    return tree.xpath(xpath)  
  
if __name__ == ‘__main__’:  
    sections = extract(‘peak.txt’, ‘utf-8’, "//h3//a[@class=’toc-backref’]")  
    for title in sections:  
        print title.text  

首先，把网页http://peak.telecommunity.com/DevCenter/EasyInstall的源代码下载下来，存储到文件peak.txt中，编码UTF-8；

然后，在Python中读取该文件内容，使用XPath抽取页面上每个段落的标题内容，最后输出到控制台上，结果如下所示：

[plain]view plaincopy
 
Troubleshooting  
Windows Notes  
Multiple Python Versions  
Restricting Downloads with   
Installing on Un-networked Machines  
Packaging Others’ Projects As Eggs  
Creating your own Package Index  
Password-Protected Sites  
Controlling Build Options  
Editing and Viewing Source Packages  
Dealing with Installation Conflicts  
Compressed Installation  
Administrator Installation  
Mac OS X "User" Installation  
Creating a "Virtual" Python  
"Traditional"   
Backward Compatibility  

如果你足够熟悉XPath，借助于libxml2，你可以抽取网页中任何你想要的内容。

一	二	三	四	五	六	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30

[python] Windows下安装libxml2并在Python中使用XPath

准备

安装

使用XPath抽取

相关推荐

微慑网

最新文章

随机文章

微慑标签

热门文章

信息资源

友情链接

域名

安全站点

工具

特效

本站信息

其他操作

赞助本站

微慑信息网专注工匠精神

微慑信息网-VulSee.com-关注前沿安全态势,聚合网络安全漏洞信息,分享安全文档案例

觉得文章有用就打赏一下文章作者

非常感谢你的打赏，我们将继续提供更多优质内容，让我们一起创建更加美好的网络世界！

支付宝扫一扫

微信扫一扫

切换注册登录

切换登录注册

准备

安装

使用XPath抽取

相关推荐

微慑网

最新文章

随机文章

微慑标签

热门文章

信息资源

友情链接

域名

安全站点

工具

特效

本站信息

其他操作

赞助本站

微慑信息网 专注工匠精神

微慑信息网-VulSee.com-关注前沿安全态势,聚合网络安全漏洞信息,分享安全文档案例

觉得文章有用就打赏一下文章作者

非常感谢你的打赏，我们将继续提供更多优质内容，让我们一起创建更加美好的网络世界！

支付宝扫一扫

微信扫一扫

切换注册登录

切换登录注册

微慑信息网专注工匠精神