近期有接口需要通过获取URL内容,再根据内容获取关键字,介于不同网页的格式不同,要采用传统办法去读取内容,需要提前设定规则,
今天在群里问到fun大佬,
目前用了Arc90's readability.js 和Apple Safari阅读模式的ReaderArticleFinderJS
刚好python下有readability,
直接拿来用了
pip install readability-lxml
#coding:utf-8
import requests
from readability import Document
response = requests.get('https://vulsee.com/archives/vulsee_2022/0107_16048.html')
doc = Document(response.text)
title = doc.title()
print (title)
content = doc.summary()
print (content)