100字范文 > 决策树模型回归可视化分析_基于Blank Friday商店销售数据分析构建回归模型

决策树模型回归可视化分析_基于Blank Friday商店销售数据分析构建回归模型

时间：2020-10-16 15:17:30

一、背景

此项目是Kaggel网站上的一个学习项目，（数据集来源于Analytics Vidhya主办的竞赛）。中型商店销售额预测问题，数据集是从不同城市关于零售商店中黑色星期五的55万观察数据集的样本。商店希望更好地了解针对不同产品的客户购买行为，通过其他变量的关系做回归来预测客户的购买量。

此文章分析工具为R语言，最后是构建了加入虚拟变量的多元非线性回归模型。

数据来源/mehdidag/black-friday

二、提出问题

1、每个用户的消费情况是什么？

2、不同产品的销量情况。

3、对于商店销售来说，什么类型的客户有较强的购买力。

4、构建怎么样的模型可以较好的通过其他变量来预测客户的购买量。

三、数据集

1、变量描述：

User_ID：客户Id

Product_IDId ：产品Id

Gender：客户性别

Age：客户年龄

OccupationId ：各个客户的id占有量

City_Category：客户所在城市

Stay_In_Current_City_Years：客户在目前的城市呆的年数

Marital_Status：婚姻状况

Product_Category_1：产品1类型

Product_Category_2：产品2类型

Product_Category_3：产品3类型

Purchase：购买金额，单位美元

2、数据处理

· 探索数据

###加载包library(tidyverse) library(Hmisc)library(corrplot)library(ggpubr)###载入数据a<-read_csv('BlackFriday.csv') ###描述性统计describe(a)

· 数据描述性信息

12 Variables537577 Observations----------------------------------------------------------------------------------------User_ID n missing distinctInfoMeanGmd.05.10.25 537577 05891 1 10029921979 1000325 1000659 1001495 .50.75.90.95 1003031 1004417 1005402 1005724 lowest : 1000001 1000002 1000003 1000004 1000005, highest: 1006036 1006037 1006038 1006039 1006040----------------------------------------------------------------------------------------Product_ID n missing distinct 537577 03623 lowest : P00000142 P00000242 P00000342 P00000442 P00000542highest: P0099442 P0099642 P0099742 P0099842 P0099942 ----------------------------------------------------------------------------------------Gender n missing distinct 537577 0 2 Value FMFrequency 132197 405380Proportion 0.246 0.754----------------------------------------------------------------------------------------Age n missing distinct 537577 0 7 Value 0-17 18-25 26-35 36-45 46-50 51-55 55+Frequency 14707 97634 214690 107499 44526 37618 20903Proportion 0.027 0.182 0.399 0.200 0.083 0.070 0.039----------------------------------------------------------------------------------------Occupation n missing distinctInfoMeanGmd.05.10.25 537577 0 21 0.993 8.083 7.392 0 0 2 .50.75.90.95 7 14 17 20 lowest : 0 1 2 3 4, highest: 16 17 18 19 20----------------------------------------------------------------------------------------City_Category n missing distinct 537577 0 3 Value ABCFrequency 144638 226493 166446Proportion 0.269 0.421 0.310----------------------------------------------------------------------------------------Stay_In_Current_City_Years n missing distinct 537577 0 5 Value 01234+Frequency 72725 189192 99459 93312 82889Proportion 0.135 0.352 0.185 0.174 0.154----------------------------------------------------------------------------------------Marital_Status n missing distinctInfoSumMeanGmd 537577 0 2 0.725 219760 0.4088 0.4834 ----------------------------------------------------------------------------------------Product_Category_1 n missing distinctInfoMeanGmd.05.10.25 537577 0 18 0.952 5.296 4.043 1 1 1 .50.75.90.95 5 8 11 12 Value 1234567891011Frequency 138353 23499 19849 11567 148592 4 3668 112132 404 5032 23960Proportion 0.257 0.044 0.037 0.022 0.276 0.038 0.007 0.209 0.001 0.009 0.045Value12131415161718Frequency 3875 5440 1500 6203 9697 567 3075Proportion 0.007 0.010 0.003 0.012 0.018 0.001 0.006----------------------------------------------------------------------------------------Product_Category_2 n missing distinctInfoMeanGmd.05.10.25 370591 166986 17 0.986 9.842 5.778 2 2 5 .50.75.90.95 9 15 16 16 Value23456789 10 11 12 13Frequency 48481 2835 25225 25874 16251 615 63058 5591 2991 13945 5419 10369Proportion 0.131 0.008 0.068 0.070 0.044 0.002 0.170 0.015 0.008 0.038 0.015 0.028Value 14 15 16 17 18Frequency 54158 37317 42602 13130 2730Proportion 0.146 0.101 0.115 0.035 0.007----------------------------------------------------------------------------------------Product_Category_3 n missing distinctInfoMeanGmd.05.10.25 164278 373299 15 0.983 12.674.49 5 5 9 .50.75.90.95 14 16 17 17 Value345689 10 11 12 13 14 15Frequency 600 1840 16380 4818 12384 11414 1698 1773 9094 5385 18121 27611Proportion 0.004 0.011 0.100 0.029 0.075 0.069 0.010 0.011 0.055 0.033 0.110 0.168Value 16 17 18Frequency 32148 16449 4563Proportion 0.196 0.100 0.028----------------------------------------------------------------------------------------Purchase n missing distinctInfoMeanGmd.05.10.25 537577 0 17959 193345560209636085866 .50.75.90.95 8062 12073 16337 19342 lowest : 185 186 187 188 189, highest: 23956 23958 23959 23960 23961----------------------------------------------------------------------------------------

从该数据信息可以知道，总观测量有537577个，变量有12个。

其中Product_Category_2 和Product_Category_3 变量中存在缺失值，且缺失的较多，需要处理。然后根据元数据集考虑到这两个变量的数据值为空可以理解为产品属于产品类型的相似度，所以可以用0来替换这些缺失值。

当然还可以初步看出数据集中男性占比较高等。

· 清洗数据处理缺失值

qna<-function(x)(ifelse(is.na(x),0,x))#####将整理后的数据存储在data中data<-mutate(a,nProduct_Category_2=a$Product_Category_2%>%sapply(qna),nProduct_Category_3=a$Product_Category_3%>%sapply(qna))%>%select(-c(10,11))

四、对数据进行EDA

1、对问题进行分解，变量描述

· 每个客户的消费情况，可根据各个客户的购买金额，购买产品数量，客户占用id等来描述各个客户的消费情况。变量：

· 不同产品的销售情况，可根据产品被购买的次数，及产品类型来描述。

·什么类型的客户有较强的购买力，根据客户的购买金额，性别、婚姻状况、所在城市、居住时间等去描述。

·构建什么模型预测，观察其他变量的数据类型及变量之间的相关性等方面再考虑构建什么模型较好。

2、客户消费情况探索

·购买金额 Purchase

ggplot(data,aes(Purchase))+geom_histogram(fill='blue')

·客户购买产品个数

num<-data.frame(data%>%group_by(User_ID)%>%group_size)arrange(num,desc(data.....group_by.User_ID......group_size))%>%do(head(.,20))>1 10252978389848615822676677528739971710 714

·客户占有id

User<-data%>%group_by(User_ID)%>%summarise(User_Purchase_sum=sum(Purchase))%>%left_join(data)%>%distinct(User_ID,.keep_all = TRUE)%>%select(-c(3,10:13))##此处的User数据是去除重复的客户idUser%>%ggplot+geom_bar(aes(User$Occupation,fill='green'))

综上可视化结果表明客户消费情况为：

购买金额数量大部分是在[5000:10000]区间内，且购买金额有往正态分布靠拢的趋势。大多数客户有购买0~250件产品，少量人有购买500以上。大多数客户只占有0~10个id

3、不同产品的销售情况

· 产品被购买次数

ggplot(data,aes(Product_ID))+geom_bar(fill='red')

·产品类型

ggplot(data)+geom_bar(aes(Product_Category_1),fill='blue')+coord_flip()

可视化结果表明产品销售情况：

大多数产品有被购买0~500次，猜测少部分生活常用品被购买次数有达到800次以上。产品1类型有销售较多产品。

4、客户画像

· 性别Gender

ggplot(data)+geom_density(aes(Purchase,fill=Gender))Sex<-data%>%group_by(Gender)%>%summarise(sum_Purchase=sum(Purchase))lab<- c('F','M');pert<-round((Sex$sum_Purchase/sum(Sex$sum_Purchase))*100,2)lab2<-str_c(lab,pert,'%',sep = ' ')pie(Sex$sum_Purchase,lab2,col=rainbow(2),main='男女总消费金额占比')

·年龄Age

ggplot(data,aes(Age,fill=Age))+geom_bar()ggplot(data,aes(Purchase,col=Age))+geom_freqpoly()

· 城市类型 City_Category 、居住时间 Stay_In_Current_City_Years 及 3类城市总购买金额 ABCsum

User%>%ggplot+geom_bar(aes(City_Category,fill=Stay_In_Current_City_Years),position='dodge')User%>%group_by(City_Category)%>%group_size> 1045 1707 3139##分别为ABC类城市的人数User%>%group_by(City_Category)%>%summarise(ABCsum=sum(User_Purchase_sum))>City_CategoryABCsum<chr> <dbl>1 A 12956687972 B 20834316123 C 1638567969

· 婚姻状况

ggplot(data)+geom_density(aes(Purchase,fill=Marital_Status))+facet_grid(Marital_Status~.)t.test(data$Purchase~data$Marital_Status)>Welch Two Sample t-testdata: data$Purchase by data$Marital_Statust = -0.094631, df = 473150, p-value = 0.9246alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-28.38174 25.76732sample estimates:mean in group 0 mean in group 1 9333.325 9334.633

可视化结果表明：

从男女购买金额的密度图可以看出女性在低购买金额段有比男性多，而观察男女消费总金额的饼图看出，男性总消费有达到总金额的76.79%，男性商品销售额上占主导地位。年龄段在26~35之间的人购买产品最多且购买金额也最多，这可能是由于这个年龄段工作有资金来源。其次36~45之间，最少的是年龄段在55+的人，相对于实际中55+年龄段的人大多都是中老年人，对产品的需求不是太大。C类城市人数最多，但B类城市的总购买金额最大，并且这3类城市的客户居住时间由长到短普遍依次为：1年>2年>3年>4+年>0年。客户是否结婚与其购买金额并没显著差别，进一步的t检验p=09246也表明结婚是否与购买量独立，即没有太大相关性。综上有较强的购买能力的客户画像大概率为：住在B类型城市居龄1年，年龄段在26~35岁之间的男性。

五、建立回归模性

1、初步建模过程

假设可以用客户购买产品的总金额来描述客户的消费能力。

观察到其他变量都为类别型变量，所以建立加入虚拟变量的多元非线性回归模型。以Occupation为定量解释变量，Gender1个虚拟变量，Age6个虚拟变量，Citi_Category2个虚拟变量，Stay_In_Current_City_Years 3个虚拟变量等变量为解释变量，总购买金额为被解释变量。

虚拟变量：在建立回归模型中，被解释变量受到非定量解释变量影响。（详细介绍可参考计量经济学。）

变量准备

User<-mutate(User,D2=ifelse(User$City_Category=='C',1,0),D3=ifelse(User$City_Category=='B',1,0), D4=ifelse(User$Stay_In_Current_City_Years=='0',1,0),D5=ifelse(User$Stay_In_Current_City_Years=='1',1,0),D6=ifelse(User$Stay_In_Current_City_Years=='2',1,0),D7=ifelse(User$Stay_In_Current_City_Years=='3',1,0),D8=ifelse(User$Age=='0-17',1,0),D9=ifelse(User$Age=='18-25',1,0),D10=ifelse(User$Age=='26-35',1,0),D11=ifelse(User$Age=='36-45',1,0),D12=ifelse(User$Age=='46-50',1,0),D13=ifelse(User$Age=='51-55',1,0)) ###此处的User有加入新的虚拟变量。

变量间的相关系数

lmUser<-select(User,c(5,9:21)) ###选择构建模型所需的变量的虚拟变量cor(lmUser[2:16])%>%corrplot(method = 'pie',type='lower')

可以看出这些变量之间没有太大的相关性。

购买金额正态性检验

ggplot(lmUser,aes(User_Purchase_sum))+geom_histogram(fill='blue')

总购买金额并不呈正态性，猜测对其进行ln变换

lmUser<-mutate(User,newp=log(User_Purchase_sum))%>%ggplot(aes(newp))+geom_histogram(fill='red')

进行ln变换后的购买金额与正态较吻合。故建立模型。

2、含虚拟变量的多元非线性回归模型

lmUser<-select(lmUser,-1)lm(logsum~.,lmUser)%>%summaryCall:lm(formula = logsum ~ ., data = lmUser)Residuals:Min 1Q Median 3QMax -2.80962 -0.70924 0.02891 0.72833 2.59265 Coefficients:Estimate Std. Error t value Pr(>|t|) (Intercept) 12.9588957 0.0664883 194.905 < 2e-16 ***Occupation 0.0012718 0.0019327 0.658 0.510541 D1 0.2805490 0.0269691 10.403 < 2e-16 ***D2-0.5971450 0.0331653 -18.005 < 2e-16 ***D3 0.1086465 0.0361726 3.004 0.002680 ** D4-0.0006983 0.0449729 -0.016 0.987612 D5-0.0082382 0.0365330 -0.226 0.821597 D6-0.0092085 0.0408443 -0.225 0.821635 D7 0.0238420 0.0423550 0.563 0.573519 D8 0.1275658 0.0784192 1.627 0.103851 D9 0.2619449 0.0557564 4.698 2.69e-06 ***D100.3690978 0.0521661 7.075 1.66e-12 ***D110.3186347 0.0548328 5.811 6.54e-09 ***D120.2737147 0.0622415 4.398 1.11e-05 ***D130.2252799 0.0635916 3.543 0.000399 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.9185 on 5876 degrees of freedomMultiple R-squared: 0.146,Adjusted R-squared: 0.144 F-statistic: 71.76 on 14 and 5876 DF, p-value: < 2.2e-16

删除Occupation，D4，D5 ，D6，D7 ，D8等变量再次做回归。

lm(logsum~D1+D2+D3+D9+D10+D11+D12+D13,lmUser)%>%summaryCall:lm(formula = logsum ~ D1 + D2 + D3 + D9 + D10 + D11 + D12 + D13, data = lmUser)Residuals:Min 1Q Median 3QMax -2.81903 -0.70742 0.03012 0.72836 2.60163 Coefficients:Estimate Std. Error t value Pr(>|t|) (Intercept) 13.01512 0.05043 258.077 < 2e-16 ***D1 0.28194 0.02661 10.596 < 2e-16 ***D2-0.59687 0.03313 -18.014 < 2e-16 ***D3 0.11014 0.03613 3.048 0.00231 ** D9 0.21178 0.04736 4.472 7.91e-06 ***D100.32089 0.04328 7.413 1.40e-13 ***D110.27169 0.04651 5.842 5.44e-09 ***D120.22541 0.05500 4.099 4.21e-05 ***D130.17638 0.05649 3.122 0.00180 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.9183 on 5882 degrees of freedomMultiple R-squared: 0.1454,Adjusted R-squared: 0.1443 F-statistic: 125.1 on 8 and 5882 DF, p-value: < 2.2e-16

此时所有变量的p值都远小于0.05，即变量显著。但先决系数只有0.1443，表明此模型的整体模拟较差。最后要写出此模型的方程还需对构建的函数进行反ln转化。

模型的优缺点：

优点在于，模型引入虚拟变量，更好的解决了解释变量为类别型变量的回归问题，缺点在于，可能做得假设与实际有较大误差，并不能完全用购买总金额来描述一个客户的购买能力。

六、总结

经过EDA，有发现年轻人那些短居生活的年轻人有很强的购买能力，这也与实际相比较符合的现象，短居的年轻人大多没有消费概念，差什么就会买什么活在当下，并没什么可担忧。直接对总购买金额进行多元非线性回归预测效果较差。可以选择尝试使用随机森林之类的分类回归进行预测。希望下回能够使用它。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。