100字范文 > 爬虫爬取安居客二手房和新房信息你是买新房还是二手的呢？

爬虫爬取安居客二手房和新房信息你是买新房还是二手的呢？

时间：2021-02-08 14:16:55

本文主要讲解爬取安居客买房类别中的二手房和新房，将提取的信息存储在记事本中，也可以转存CSV格式或者MongoDB中。

网站HTML信息提取比较简单，没有什么特别的地方，作为爬虫入门可以让初学者快速了解爬虫方法。

认为有用的话请点赞，码字不易，谢谢。

1.页面分析

我们从网站的主页面开始提取网站信息，一直到最后具体的房产信息。

以二手房为例，我们对网页源代码进行分析。

（其余卖房租房等网址我们也可以爬取）

首先我们对页面源码进行分析，查找二手房和新房对应的源码和链接。

我们可以从HTML代码中找到每个区域对应的网址URL，因此直接提取出href属性就可以跳转到对应区域的房产信息。

以包河区为例，我们可以得到房产信息，因此对节点li进行遍历就可提取出所有的房产详细信息的地址URL。

我们这里以第一页为基础爬取，若爬取多页信息更改URL即可，查看url变化，之后每一页都是从p2->p3

从：

/sale/baohequ/?from=SearchBar

到：

/sale/baohequ/p2/#filtersort
/sale/baohequ/p3/#filtersort

最后提取房产详细页面的信息。

2.代码

其实代码比较简单，没有什么特别讲解的地方，如果有什么不懂的可以在下方评论留言

将提取的标题作为文件的名称，有些标题存在非法字符，因此使用replace代替。

from pyquery import PyQuery as pqimport requestsimport pymongoimport os'''client = pymongo.MongoClient(host='localhost', port=27017)db = client.安居客 # 指定数据库,若不存在，则直接创建一个test数据库collection = db.合肥'''def gethtml(url):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/0101 Firefox/80.0'}response = requests.get(url, headers=headers)if response.status_code == 200:return response.textelse:return Nonedef getaddress(html):doc = pq(html)old_news = doc('#content_Rd1 div.clearfix .details').items()for i, old_new in enumerate(old_news):if i == 0:print('二手房')old_houses = old_new.find('.areas a').items()for old_house in old_houses:house_url = old_house.attr('href')address = old_house.text()old_house_list(house_url, address)else:new_houses = old_new.find('.areas a').items()print('新房')for new_house in new_houses:house_url = new_house.attr('href')address = new_house.text()new_house_list(house_url, address)def old_house_list(url, address):html = gethtml(url)doc = pq(html)items = doc('#houselist-mod-new .list-item .house-details .house-title').items()for item in items:url = item.find('a').attr('href')old_house_information(url, address)def new_house_list(url, address):url = 'https:'+urlhtml = gethtml(url)doc = pq(html)items = doc('.key-list .item-mod').items()for item in items:url = item.find('a').attr('href')new_house_information(url, address)def old_house_information(url, address):html = gethtml(url)doc = pq(html)title = doc('#content .clearfix h3').text()information = doc('.houseInfo-wrap .houseInfo-detail-item').text().replace('\n', '')old = '二手房'write_txt(title, information, address, old, url)def new_house_information(url, address):html = gethtml(url)doc = pq(html)title = doc('.basic-info .basic-fst').text().replace('\n', '')information = doc('.basic-parms').text().replace('\n', '').replace('变价通知我', '').replace('全部户型', '').replace('开盘通知我', '').replace('查看地图','')new = '新盘'write_txt(title, information, address, new, url)def write_txt(title, content, address, infor, url):house_path = '安居客'+os.path.sep+infor+os.path.sep+addressif not os.path.exists(house_path):os.makedirs(house_path)if title:file_path = house_path+os.path.sep + \'{0}.{1}'.format(title.replace(' ', '').replace('|','').replace('*',''), 'txt')if not os.path.exists(file_path):with open(file_path, 'w', encoding='utf-8')as f:print('正在爬取 '+address+' '+infor+' '+' '+title)f.write(content+'\n')f.write(url)else:print('已爬取 '+address+' '+infor+' '+' '+title)if __name__ == "__main__":url = '/'html = gethtml(url)getaddress(html)