100字范文,内容丰富有趣,生活中的好帮手!
100字范文 > 科大讯飞:酒店价格预测挑战赛!

科大讯飞:酒店价格预测挑战赛!

时间:2019-03-15 02:59:39

相关推荐

科大讯飞:酒店价格预测挑战赛!

Datawhale干货

方向:数据挖掘,科大讯飞赛事

赛题名称:酒店住宿价格预测挑战赛

赛题类型:数据挖掘

赛题链接👇:

/topic/info?type=accommodation-price&ch=vWxQGFU

赛题背景

国内疫情防控形势目前良好,国内外旅游市场也逐渐有所好转,市场格局也在正常的构建当中。随着旅游产业的恢复,也将对酒店行业拥有促进作用,让市场趋势逐渐向好的方向发展。

酒店住宿平台为了扩展旅行的可能性,不断地呈现出更独特、个性化的体验方式。不同地区之间的交通是否有明显差异、节假日差异、品质差异、位置差异都将影响酒店住宿价格,能否通过已知数据预测出酒店住宿价格存在着很大的挑战。

赛事任务

本次比赛任务是根据酒店相关信息数据,然后通过训练数据训练模型,预测测试集酒店住宿房间的价格结果。

赛题数据集

赛题数据由训练集、测试集据组成,包含15个字段,其中target字段为预测目标。

评价指标

本次竞赛的评价标准采用MAE,即MAE越小,效果越好。评估代码参考:

fromsklearn.metricsimportmean_absolute_errory_true=[3,-0.5,2,7]y_pred=[2.5,0.0,2,8]mean_absolute_error(y_true,y_pred)

赛题思路

赛题是一个典型的回归任务类型的比赛,需要考虑对标签进行缩放然后进行建模,并且需要考虑加入一些特征工程:

#读取数据集并进行标签缩放importpandasaspdimportnumpyasnptrain_df=pd.read_csv('./酒店住宿价格预测挑战赛公开数据/train.csv')test_df=pd.read_csv('./酒店住宿价格预测挑战赛公开数据/test.csv')train_df['target']=np.log1p(train_df['target'])fromsklearn.model_selectionimportcross_val_predict,cross_validatefromlightgbmimportLGBMRegressorfromcatboostimportCatBoostRegressorfromxgboostimportXGBRegressorfromsklearn.metricsimportmean_absolute_error#训练集特征工程train_df['last_review_isnull']=train_df['last_review'].isnull()train_df['last_review_year']=pd.to_datetime(train_df['last_review']).dt.yeartrain_df['neighbourhood_group_mean']=train_df['neighbourhood_group'].map(train_df.groupby(['neighbourhood_group'])['target'].mean())train_df['neighbourhood_group_counts']=train_df['neighbourhood_group'].map(train_df['neighbourhood_group'].value_counts())train_df['room_type_mean']=train_df['room_type'].map(train_df.groupby(['room_type'])['target'].mean())train_df['room_type_counts']=train_df['room_type'].map(train_df['room_type'].value_counts())train_df['region_1_id_mean']=train_df['region_1_id'].map(train_df.groupby(['region_1_id'])['target'].mean())train_df['region_1_counts']=train_df['region_1_id'].map(train_df['region_1_id'].value_counts())train_df['region_2_id_mean']=train_df['region_2_id'].map(train_df.groupby(['region_2_id'])['target'].mean())train_df['region_2_counts']=train_df['region_2_id'].map(train_df['region_2_id'].value_counts())train_df['region_3_id_mean']=train_df['region_3_id'].map(train_df.groupby(['region_3_id'])['target'].mean())train_df['region_3_counts']=train_df['region_3_id'].map(train_df['region_3_id'].value_counts())train_df['availability_month']=train_df['availability']//30train_df['availability_week']=train_df['availability']//7train_df['reviews_per_month_count']=train_df['reviews_per_month']*train_df['calculated_host_listings_count']train_df['room_type_calculated_host_listings_count']=train_df['room_type'].map(train_df.groupby(['room_type'])['calculated_host_listings_count'].sum())#测试集数据增强test_df['last_review_isnull']=test_df['last_review'].isnull()test_df['last_review_year']=pd.to_datetime(test_df['last_review']).dt.yeartest_df['neighbourhood_group_mean']=test_df['neighbourhood_group'].map(train_df.groupby(['neighbourhood_group'])['target'].mean())test_df['neighbourhood_group_counts']=test_df['neighbourhood_group'].map(train_df['neighbourhood_group'].value_counts())test_df['room_type_mean']=test_df['room_type'].map(train_df.groupby(['room_type'])['target'].mean())test_df['room_type_counts']=test_df['room_type'].map(train_df['room_type'].value_counts())test_df['region_1_id_mean']=test_df['region_1_id'].map(train_df.groupby(['region_1_id'])['target'].mean())test_df['region_1_counts']=test_df['region_1_id'].map(train_df['region_1_id'].value_counts())test_df['region_2_id_mean']=test_df['region_2_id'].map(train_df.groupby(['region_2_id'])['target'].mean())test_df['region_2_counts']=test_df['region_2_id'].map(train_df['region_2_id'].value_counts())test_df['region_3_id_mean']=test_df['region_3_id'].map(train_df.groupby(['region_3_id'])['target'].mean())test_df['region_3_counts']=test_df['region_3_id'].map(train_df['region_3_id'].value_counts())test_df['availability_month']=test_df['availability']//30test_df['availability_week']=test_df['availability']//7test_df['reviews_per_month_count']=test_df['reviews_per_month']*test_df['calculated_host_listings_count']test_df['room_type_calculated_host_listings_count']=test_df['room_type'].map(train_df.groupby(['room_type'])['calculated_host_listings_count'].sum())#交叉验证训练模型cat_val=cross_validate(CatBoostRegressor(verbose=0,n_estimators=1000),train_df.drop(['id','target','last_review'],axis=1),train_df['target'],return_estimator=True)lgb_val=cross_validate(LGBMRegressor(verbose=0,force_row_wise=True),train_df.drop(['id','target','last_review'],axis=1),train_df['target'],return_estimator=True)xgb_val=cross_validate(XGBRegressor(),train_df.drop(['id','target','last_review'],axis=1),train_df['target'],return_estimator=True)#模型预测pred=np.zeros(len(test_df))#forclfincat_val['estimator']+lgb_val['estimator']+xgb_val['estimator']:forclfincat_val['estimator']:pred+=clf.predict(test_df.drop(['id','last_review'],axis=1))pred/=5pred=np.exp(pred)-1pd.DataFrame({'id':range(30000,40000),'target':pred}).to_csv('a.csv',index=None)

完整代码见👇:

/datawhalechina/competition-baseline/tree/master/competition/%E7%A7%91%E5%A4%A7%E8%AE%AF%E9%A3%9EAI%E5%BC%80%E5%8F%91%E8%80%85%E5%A4%A7%E8%B5%9B

参与学习

Datawhale、科大讯飞、天池联合发起的 AI 夏令营,第三期报名截止到8月15号,扫码申请。基于科大讯飞、天池最新赛事。

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。