100字范文 > Python爬虫--抓取糗事百科段子

Python爬虫--抓取糗事百科段子

时间：2019-08-13 17:30:06

相关推荐

Python爬虫--抓取糗事百科段子

今天使用python爬虫实现了自动抓取糗事百科的段子，因为糗事百科不需要登录,抓取比较简单。程序每按一次回车输出一条段子，代码参考了/990.html 但该博主的代码似乎有些问题，我自己做了修改，运行成功,下面是代码内容：

1 # -*- coding:utf-8 -*- 2 __author__ = 'Jz' 3 import urllib2 4 import re 5 6 #糗事百科爬虫类 7 class QSBK: 8#初始化 9def __init__(self):10 self.pageIndex = 111 self.user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64)'12 self.headers = {'User-Agent': self.user_agent}13 #joke的每一个元素是每一页的段子14 self.joke = []15 #判断是否继续运行16 self.enable = False1718def getPage(self, pageIndex):19 try:20 URL = '/hot/page/' + str(pageIndex)21 request = urllib2.Request(url = URL, headers = self.headers)22 response = urllib2.urlopen(request)23 pageContent = response.read().decode('utf-8')24 return pageContent25 except urllib2.URLError, e:26 if hasattr(e, 'reason'):27 print '段子抓取失败，失败原因：', e.reason28 return None2930def getJokeList(self, pageIndex):31 pageContent = self.getPage(pageIndex)32 if not pageContent:33 print '段子获取失败...'34 return None35 #第三个组中的内容用于判断段子是否附带图片36 pattern = pile(r'<div.*?class="author">.*?<a.*?>.*?<img.*?/>\n(.*?)\n</a>.*?</div>.*?<div class="content">\n\n(.*?)\n.*?</div>' +37r'(.*?)class="stats">.*?<span.*?class="stats-vote"><i.*?class="number">(.*?)</i>'38, re.S)39 jokes = re.findall(pattern, pageContent)40 pageJokes = []41 #过滤带有图片的段子42 for joke in jokes:43 hasImg = re.search('img', joke[2])44 #joke[0]为发布人,joke[1]为段子内容,joke[3]为点赞数量45 if not hasImg:46 pageJokes.append([joke[0].strip(), joke[1].strip(), joke[3].strip()])47 return pageJokes4849def loadPage(self):50 if self.enable == True:51 #若当前已看的页数少于两页，则加载新的一页52 if len(self.joke) < 2:53 pageJokes = self.getJokeList(self.pageIndex)54 if pageJokes:55 self.joke.append(pageJokes)56 self.pageIndex += 15758#每输入一次回车，打印一条段子59def getOneJoke(self, pageJokes, page):60 jokes = pageJokes61 for joke in jokes:62 userInput = raw_input('请输入回车键或Q/q: ')63 self.loadPage()64 if userInput == 'Q' or userInput == 'q':65 self.enable = False66 print '退出爬虫...'67 return68 print u'段子内容:%s\n第%d页\t发布人:%s\t赞:%s' % (joke[1], page, joke[0], joke[2])69 70def start(self):71 print '正在从糗事百科抓取段子，按回车键查看新段子，按Q/q退出...'72 self.enable = True73 self.loadPage()74 page = 075 while self.enable:76 if len(self.joke) > 0:77 pageJokes = self.joke[0]78 page += 179 #删除已经读取过的段子页80 del self.joke[0]81 self.getOneJoke(pageJokes, page)82 83 spider = QSBK()84 spider.start()

注释已经附上，其中有几点需要注意的地方：

1.需要加上header验证进行伪装,否则无法抓取网页内容

2.正则表达式的书写,需要将内容提取出来以验证是否有附带图片(代码中已用红色标注)

3.getOneJoke函数中格式化输出段子时(已用红色标注),需要在字符串前加上u,否则会报如下错误:

Traceback (most recent call last):File "D:\coding_file\python_file\TestPython\src\Test\QSBK.py", line 84, in <module>spider.start()File "D:\coding_file\python_file\TestPython\src\Test\QSBK.py", line 81, in startself.getOneJoke(pageJokes, page)File "D:\coding_file\python_file\TestPython\src\Test\QSBK.py", line 68, in getOneJokeprint '段子内容:%s\n第%d页\t发布人:%s\t赞:%s' % (joke[1], page, joke[0], joke[2])UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 3: ordinal not in range(128)

这是因为Python默认编码方式为Unicode,所以joke[0]等也是Unicode编码,为了格式化输出,前面的字符串也需要转换成Unicode编码

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。