100字范文,内容丰富有趣,生活中的好帮手!
100字范文 > python教程是用什么博客写的-Python爬虫入门教程:博客园首页推荐博客排行的秘密...

python教程是用什么博客写的-Python爬虫入门教程:博客园首页推荐博客排行的秘密...

时间:2022-12-29 21:23:23

相关推荐

python教程是用什么博客写的-Python爬虫入门教程:博客园首页推荐博客排行的秘密...

1. 前言

虽然博客园注册已经有五年多了,但是最近才正式开始在这里写博客。(进了博客园才知道这里面个个都是人才,说话又好听,超喜欢这里...)但是由于写的内容都是软件测试相关,热度一直不是很高。看到首页的推荐博客排行时,心里痒痒的,想想看看这些大佬究竟是写了什么文章这么受欢迎,可以被推荐。所以用Python抓取了这100位推荐博客,简单分析了每个博客的文章分类,阅读排行榜,评论排行榜及推荐排行榜,最后统计汇总并生成词云。正好这也算是一篇非常好的Python爬虫入门教程了。

2. 环境准备

2.1 操作系统及浏览器版本

Windows 10

Chrome 62

2.2 Python版本

Python 2.7

2.3 用到的lib库

1. requests Http库

2. re 正则表达式

3. json json数据处理

4. BeautifulSoup Html网页数据提取

5. jieba 分词

6. wordcloud 生成词云

7. concurrent.futures 异步并发

所有模块均可使用pip命令安装,如下:

pip installrequests

pipinstallbeautifulsoup4

pipinstalljieba

pipinstallwordcloud

pipinstall futures

3. 编写爬虫

上面的环境准备好之后,我们正式开始编写爬虫,但是写代码之前,我们首先需要对需要爬取的页面进行分析。

3.1 页面分析

3.1.1 博客园首页推荐博客排行

1. 运行Chrome浏览器,按快捷键F12打开开发者工具,打开博客园首页:/

2. 在右侧点击Network,选中XHR类型,点击下面的每一个请求都可以看到详细的Http请求信息

3. 依次选中右侧的Response,查看接口响应,筛选我们需要的接口,这里我们找到了UserStats接口,可以看到这个接口返回了我们需要的“推荐博客排行”信息

4. 点击右侧Headers查看详细的接口信息,可以看到这是一个简单的Http GET接口,不需要传递任何参数:/aggsite/UserStats

5. 这样我们使用requests编写简单的请求就可以获取首页“推荐博客排行”信息

#coding:utf-8

importrequests

r=requests.get('/aggsite/UserStats')print r.text

返回结果如下:

博问专家排行

LauncherAstar幻天芒dudu爱编程的大叔邀月吴瑞祥丁学Gray Zhangeaglet

» 更多博问专家

最新推荐博客

雨夜朦胧枕边书sparkdev悦光阴Emrys5

» 更多推荐博客

推荐博客排行

1. Artech2. 路过秋天3. 数据之巅4. 腾飞(Jesse)5. tkbSimplest6. 圣殿骑士7. CareySon8. 三生石上(FineUI控件)9. 葡萄城控件技术团队10. 一线码农11. Vamei12. 农码一生13. 张善友14. 小坦克15. ChokCoco16. Jimmy Zhang17. Edison Chou18. KenshinCui19. 滴答的雨20. 21. 司徒正美22. 【艾伦】23. 请叫我头头哥24. Savorboard25. 桦仔26. 刘哇勇27. 匠心十年28. keepfool29. 左潇龙30. stoneniqiu31. 深蓝色右手32. mindwind33. 焰尾迭34. 道法自然35. netfocus36. 纯洁的微笑37. snandy38. Jeffcky39. JustRun40. 41. wolfy42. EtherDream43. 王清培44. 潇湘隐者45. 陈希章46. 自由飞47. 李永京48. 周见智49. 木宛城主50. 冠军51. dotNetDR_52. 邀月53. Barret Lee54. 程兴亮55. sparkdev56. 计算机的潜意识57. 慕容小匹夫58. 【当耐特】59. vajoy60. 菩提树下的杨过61. Todd Wei62. 黄博文63. LoveJenny64. webabcd65. 悦光阴66. 风尘浪子67. 木小楠68. 玉开69. 农民伯伯70. Terry_龙71. BIT祝威72. beautifulzzzz73. 刘冬.NET74. 传说中的弦哥75. 最课程陆敏技76. 韩子迟77. 代震军78. hystar79. 随它去吧80. 岑安81. skyme82. DebugLZQ83. 灵感之源84. 金色海洋(jyk)阳光男孩85. 银河86. lovecindywang87. zdd88. foreach_break89. BloodyAngel90. JeffWong91. porschev92. 坚强200293. 飘扬的红领巾94. 啊汉95. 万一96. 丁浪97. 心态要好98. 1-2-399. 程序诗人100. SoftwareTeacher

» 更多推荐博客

» 博客列表(按积分)

View Code

可以看到返回的内容是HTML格式,这里我们有两种方法可以获取“推荐博客排行”,一种是使用Beautiful Soup解析Html内容,另外一种是使用正则表达式筛选内容。代码如下:

#coding:utf-8

importrequestsimportreimportjsonfrom bs4 importBeautifulSoup#获取推荐博客列表

r = requests.get('/aggsite/UserStats')#使用BeautifulSoup解析

soup = BeautifulSoup(r.text, 'lxml')

users= [(i.text, i['href']) for i in soup.select('#blogger_list > ul > li > a') if 'AllBloggers.aspx' not in i['href'] and 'expert' not in i['href']]print json.dumps(users,ensure_ascii=False)#也可以使用使用正则表达式

user_re=pile('(.+)')

users=[(name,url) for url,name in re.findall(user_re,r.text) if 'AllBloggers.aspx' not in url and 'expert' not inurl]print json.dumps(users,ensure_ascii=False)

运行结果如下:

[["Artech", "/artech/"], ["路过秋天", "/cyq1162/"], ["数据之巅", "/asxinyu/"], ["腾飞(Jesse)", "/jesse/"], ["tkbSimplest", "/farb/"], ["圣殿骑士", "/KnightsWarrior/"], ["CareySon", "/CareySon/"], ["三生石上(FineUI控件)", "/sanshi/"], ["葡萄城控件技术团队", "/powertoolsteam/"], ["一线码农", "/huangxincheng/"], ["Vamei", "/vamei/"], ["农码一生", "/zhaopei/"], ["张善友", "/shanyou/"], ["小坦克", "/TankXiao/"], ["ChokCoco", "/coco1s/"], ["Jimmy Zhang", "/JimmyZhang/"], ["Edison Chou", "/edisonchou/"], ["KenshinCui", "/kenshincui/"], ["滴答的雨", "/heyuquan/"], ["", "/insus/"], ["司徒正美", "/rubylouvre/"], ["【艾伦】", "/aaronjs/"], ["请叫我头头哥", "/toutou/"], ["Savorboard", "/savorboard/"], ["桦仔", "/lyhabc/"], ["刘哇勇", "/Wayou/"], ["匠心十年", "/gaochundong/"], ["keepfool", "/keepfool/"], ["左潇龙", "/zuoxiaolong/"], ["stoneniqiu", "/stoneniqiu/"], ["深蓝色右手", "/alamiye010/"], ["mindwind", "/mindwind/"], ["焰尾迭", "/yanweidie/"], ["道法自然", "/baihmpgy/"], ["netfocus", "/netfocus/"], ["纯洁的微笑", "/ityouknow/"], ["snandy", "/snandy/"], ["Jeffcky", "/CreateMyself/"], ["JustRun", "/JustRun1983/"], ["", "/daxnet/"], ["wolfy", "/wolf-sun/"], ["EtherDream", "/index-html/"], ["王清培", "/wangiqngpei557/"], ["潇湘隐者", "/kerrycode/"], ["陈希章", "/chenxizhang/"], ["自由飞", "/freeflying/"], ["李永京", "/lyj/"], ["周见智", "/xiaozhi_5638/"], ["木宛城主", "/OceanEyes/"], ["冠军", "/haogj/"], ["dotNetDR_", "/highend/"], ["邀月", "/downmoon/"], ["Barret Lee", "/hustskyking/"], ["程兴亮", "/chengxingliang/"], ["sparkdev", "/sparkdev/"], ["计算机的潜意识", "/subconscious/"], ["慕容小匹夫", "/murongxiaopifu/"], ["【当耐特】", "/iamzhanglei/"], ["vajoy", "/vajoy/"], ["菩提树下的杨过", "/yjmyzz/"], ["Todd Wei", "/weidagang2046/"], ["黄博文", "/huang0925/"], ["LoveJenny", "/LoveJenny/"], ["webabcd", "/webabcd/"], ["悦光阴", "/ljhdo/"], ["风尘浪子", "/leslies2/"], ["木小楠", "/liuhaorain/"], ["玉开", "/yukaizhao/"], ["农民伯伯", "/over140/"], ["Terry_龙", "/TerryBlog/"], ["BIT祝威", "/bitzhuwei/"], ["beautifulzzzz", "/zjutlitao/"], ["刘冬.NET", "/GoodHelper/"], ["传说中的弦哥", "/legendxian/"], ["最课程陆敏技", "/luminji/"], ["韩子迟", "/zichi/"], ["代震军", "/daizhj/"], ["hystar", "/lsxqw/"], ["随它去吧", "/dowinning/"], ["岑安", "/hongru/"], ["skyme", "/skyme/"], ["DebugLZQ", "/DebugLZQ/"], ["灵感之源", "/unruledboy/"], ["金色海洋(jyk)阳光男孩", "/jyk/"], ["银河", "/skyivben/"], ["lovecindywang", "/lovecindywang/"], ["zdd", "/graphics/"], ["foreach_break", "/foreach-break/"], ["BloodyAngel", "/zgynhqf/"], ["JeffWong", "/jeffwongishandsome/"], ["porschev", "/zhongweiv/"], ["坚强2002", "/me-sa/"], ["飘扬的红领巾", "/leefreeman/"], ["啊汉", "/hlxs/"], ["万一", "/del/"], ["丁浪", "/dinglang/"], ["心态要好", "/oppoic/"], ["1-2-3", "/1-2-3/"], ["程序诗人", "/scy251147/"], ["SoftwareTeacher", "/xinz/"]]

[["雨夜朦胧", "/RainingNight/"], ["枕边书", "/zhenbianshu/"], ["sparkdev", "/sparkdev/"], ["悦光阴", "/ljhdo/"], ["Emrys5", "/emrys5/"], ["Artech", "/artech/"], ["路过秋天", "/cyq1162/"], ["数据之巅", "/asxinyu/"], ["腾飞(Jesse)", "/jesse/"], ["tkbSimplest", "/farb/"], ["圣殿骑士", "/KnightsWarrior/"], ["CareySon", "/CareySon/"], ["三生石上(FineUI控件)", "/sanshi/"], ["葡萄城控件技术团队", "/powertoolsteam/"], ["一线码农", "/huangxincheng/"], ["Vamei", "/vamei/"], ["农码一生", "/zhaopei/"], ["张善友", "/shanyou/"], ["小坦克", "/TankXiao/"], ["ChokCoco", "/coco1s/"], ["Jimmy Zhang", "/JimmyZhang/"], ["Edison Chou", "/edisonchou/"], ["KenshinCui", "/kenshincui/"], ["滴答的雨", "/heyuquan/"], ["", "/insus/"], ["司徒正美", "/rubylouvre/"], ["【艾伦】", "/aaronjs/"], ["请叫我头头哥", "/toutou/"], ["Savorboard", "/savorboard/"], ["桦仔", "/lyhabc/"], ["刘哇勇", "/Wayou/"], ["匠心十年", "/gaochundong/"], ["keepfool", "/keepfool/"], ["左潇龙", "/zuoxiaolong/"], ["stoneniqiu", "/stoneniqiu/"], ["深蓝色右手", "/alamiye010/"], ["mindwind", "/mindwind/"], ["焰尾迭", "/yanweidie/"], ["道法自然", "/baihmpgy/"], ["netfocus", "/netfocus/"], ["纯洁的微笑", "/ityouknow/"], ["snandy", "/snandy/"], ["Jeffcky", "/CreateMyself/"], ["JustRun", "/JustRun1983/"], ["", "/daxnet/"], ["wolfy", "/wolf-sun/"], ["EtherDream", "/index-html/"], ["王清培", "/wangiqngpei557/"], ["潇湘隐者", "/kerrycode/"], ["陈希章", "/chenxizhang/"], ["自由飞", "/freeflying/"], ["李永京", "/lyj/"], ["周见智", "/xiaozhi_5638/"], ["木宛城主", "/OceanEyes/"], ["冠军", "/haogj/"], ["dotNetDR_", "/highend/"], ["邀月", "/downmoon/"], ["Barret Lee", "/hustskyking/"], ["程兴亮", "/chengxingliang/"], ["sparkdev", "/sparkdev/"], ["计算机的潜意识", "/subconscious/"], ["慕容小匹夫", "/murongxiaopifu/"], ["【当耐特】", "/iamzhanglei/"], ["vajoy", "/vajoy/"], ["菩提树下的杨过", "/yjmyzz/"], ["Todd Wei", "/weidagang2046/"], ["黄博文", "/huang0925/"], ["LoveJenny", "/LoveJenny/"], ["webabcd", "/webabcd/"], ["悦光阴", "/ljhdo/"], ["风尘浪子", "/leslies2/"], ["木小楠", "/liuhaorain/"], ["玉开", "/yukaizhao/"], ["农民伯伯", "/over140/"], ["Terry_龙", "/TerryBlog/"], ["BIT祝威", "/bitzhuwei/"], ["beautifulzzzz", "/zjutlitao/"], ["刘冬.NET", "/GoodHelper/"], ["传说中的弦哥", "/legendxian/"], ["最课程陆敏技", "/luminji/"], ["韩子迟", "/zichi/"], ["代震军", "/daizhj/"], ["hystar", "/lsxqw/"], ["随它去吧", "/dowinning/"], ["岑安", "/hongru/"], ["skyme", "/skyme/"], ["DebugLZQ", "/DebugLZQ/"], ["灵感之源", "/unruledboy/"], ["金色海洋(jyk)阳光男孩", "/jyk/"], ["银河", "/skyivben/"], ["lovecindywang", "/lovecindywang/"], ["zdd", "/graphics/"], ["foreach_break", "/foreach-break/"], ["BloodyAngel", "/zgynhqf/"], ["JeffWong", "/jeffwongishandsome/"], ["porschev", "/zhongweiv/"], ["坚强2002", "/me-sa/"], ["飘扬的红领巾", "/leefreeman/"], ["啊汉", "/hlxs/"], ["万一", "/del/"], ["丁浪", "/dinglang/"], ["心态要好", "/oppoic/"], ["1-2-3", "/1-2-3/"], ["程序诗人", "/scy251147/"], ["SoftwareTeacher", "/xinz/"]]

View Code

其中BeautifulSoup解析时,我们使用的是CSS选择器.select方法,查找id="blogger_list" > ul >li下的所有a标签元素,同时对结果进行处理,去除了"更多推荐博客"及""博客列表(按积分)链接。

使用正则表达式筛选也是同理:我们首先构造了符合条件的正则表达式,然后使用re.findall找出所有元素,同时对结果进行处理,去除了"更多推荐博客"及""博客列表(按积分)链接。

这样我们就完成了第一步,获取了首页推荐博客列表。

3.1.2 博客随笔分类

1. 同理,我们使用Chrome开发者工具,打开博客页面(如本人博客:/lovesoo/)进行分析

2. 我们找到了接口sidecolumn.aspx,这个接口返回了我们需要的信息:随笔分类

3. 点击Headers查看接口调用信息,可以看到这也是一个GET类型接口,路径含有博客用户名,且传入参数blogApp=用户名:/lovesoo/mvc/blog/sidecolumn.aspx?blogApp=lovesoo

4. 使用Requests发送GET请求,获取“随笔分类”示例代码如下:

#coding:utf-8

importrequests

user='lovesoo'url= '/{0}/mvc/blog/sidecolumn.aspx'.format(user)

blogApp=user

payload= dict(blogApp=blogApp)

r= requests.get(url, params=payload)print r.text

返回结果如下:

搜索

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。