100字范文 > python爬虫豆瓣影评的爬取cookies实现自动登录账号

python爬虫豆瓣影评的爬取cookies实现自动登录账号

时间：2020-11-14 22:42:38

频繁的登录网页会让豆瓣锁定你的账号……

网页请求

使用cookies来实现的自动登录账号，这里的cookies因为涉及到账号我屏蔽了，具体的cookies获取方法直接可以让浏览器实现自动登录后，在网页请求信息中自己找到。

def askURL(url):head = {"User-Agent": "Mozilla / 5.0(Windows NT 10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 77.0.3865.90Safari / 537.36"}cookies ={"Cookie":' ***********************'}# request = urllib.request.Request(url, headers=head)# html = ""# response = urllib.request.urlopen(request)# html = response.read().decode("utf-8")html = requests.get(url,cookies=cookies,headers=head)print("网站返回成功")return html.text

获取数据代码片段

再看豆瓣影评的时候，我发现他的所有评论我没有办法完全获取下来

他这里的评论我没理解错的话应该是31万+的评论，但是实际获取的时候在26页以后就什么都没有了。

re正则表达式

findCritic = pile(r'<span class="short">(.*?)</span>',re.S)findUser = pile(r'<a href=.*? title="(.*?)">',re.S)findScore = pile(r'<span class="(.*?)" title=')

具体方法

def getDate(base_url):datelist = []for i in range(0,25):url = base_url +str(i*20)html = askURL(url)print("第{0}页".format(i+1))soup = BeautifulSoup(html,"html.parser")for item in soup.find_all('div',class_ = "comment-item"):date = []item = str(item)#print(item)user = re.findall(findUser,item)date.append(user)score = re.findall(findScore, item)[0]date.append(score)critic = re.findall(findCritic,item)date.append(critic)datelist.append(date)return datelist

数据库保存

这里因为处理用户名中含有单引号的问题给我搞得有点傻，使用str.replace()先把用户名中的单引号变为空格，再将字符串格式的两边双引号变为单引号，最后才满足的数据库插入格式。

如果有大佬有更好的解决办法可以评论区告诉我。

def saveDate_DB(datelist,dbpath):init_DB(dbpath)conn = sqlite3.connect(dbpath)cursor = conn.cursor()for date in datelist:for index in range(len(date)):date[index] = str(date[index])date[index] = date[index].replace("'"," ");date[index] = date[index].replace('"', "'");date[index] = '"'+str(date[index])+'"'sql = '''insert into bawangbieji(author ,score ,critics)values(%s)'''%",".join(date)#print(sql)cursor.execute(sql)mit()conn.close()print("保存到数据库",dbpath)

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。