100字范文 > 基于RFM模型的电商零售数据分析

基于RFM模型的电商零售数据分析

时间：2020-12-08 17:51:14

相关推荐

基于RFM模型的电商零售数据分析

数据集介绍

E-Commerce Data

Actual transactions from UK retailer

数据集地址（kaggle）

/download/W_H_M_/12337579

该数据来源于kaggle，是一家注册在英国的电子商务网站的12月份-12月份之间的交易数据

该公司主要销售独特的全天候礼品，大部分出售对象是批发商。

数据集字段介绍

数据包含541910行，8个字段，字段内容为：

InvoiceNo: 订单编号，每笔交易有6个整数，退货订单编号开头有字母’C’。

StockCode: 产品编号，由5个整数组成。

Description: 产品描述。

Quantity: 产品数量，有负号的表示退货

InvoiceDate: 订单日期和时间。

UnitPrice: 单价（英镑），单位产品的价格。

CustomerID:客户编号，每个客户编号由5位数字组成。

Country: 国家的名称，每个客户所在国家/地区的名称。

分析目的/方向

RFM模型（方向）

RFM模型根据客户活跃程度和交易金额的贡献，进行客户价值细分的一种方法。

R（Recency）：最近一次交易时间间隔。基于最近一次交易日期计算的得分，距离当前日期越近，得分越高。如5分制。反映客户交易活跃度。

F（Frequency）：客户在最近一段时间内交易次数。基于交易频率计算的得分，交易频率越高，得分越高。如5分制。反映客户交易活跃度。

M（Monetray）：客户最近一段时间内交易金额。基于交易金额计算的得分，交易金额越高，得分越高。如5分制。反映客户价值。

RFM总分值：RFM=RS∗100+FS∗10+MS∗1

分析流程

读取数据

import warningswarnings.filterwarnings('ignore') #忽略报警信息import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsimport osimport plotly as pyimport plotly.graph_objs as gopyplot = py.offline.iplotpy.offline.init_notebook_mode()os.chdir(r'C:\Users\W\Desktop')df = pd.read_csv('data.csv',encoding = 'ISO-8859-1',dtype = {'CustomerID':str})df.shape

(541909, 8)

df.info()

<class 'pandas.core.frame.DataFrame'>RangeIndex: 541909 entries, 0 to 541908Data columns (total 8 columns):InvoiceNo541909 non-null objectStockCode541909 non-null objectDescription 540455 non-null objectQuantity 541909 non-null int64InvoiceDate 541909 non-null objectUnitPrice541909 non-null float64CustomerID406829 non-null objectCountry 541909 non-null objectdtypes: float64(1), int64(1), object(6)memory usage: 33.1+ MB

df.head()

数据清洗

统计缺失率

df.apply(lambda x :sum(x.isnull())/len(x),axis=0)

InvoiceNo0.000000StockCode0.000000Description 0.002683Quantity 0.000000InvoiceDate 0.000000UnitPrice0.000000CustomerID0.249267Country 0.000000dtype: float64

删除不用数据

df.drop(['Description'],axis=1,inplace=True)df.head()

缺失值处理

缺失值一般不会删掉，而是标识

df['CustomerID'] = df['CustomerID'].fillna('U')

增加销售额字段

df['amount'] = df['Quantity']*df['UnitPrice']df.head()

日期处理

#使用字符串方法获取日期df['date'] = [x.split(' ')[0] for x in df['InvoiceDate']]# # 将订单日期由字符串转换为标准日期格式# # errors = 'coerce' 错误转换为缺失值（容错处理）# df['InvoiceDate'] = pd.to_datetime(df1['InvoiceDate'],errors = 'coerce')# # 提取日期部分# df['InvoiceDate'] = df1['InvoiceDate'].dt.date#获取时间df['time'] = [x.split(' ')[1] for x in df['InvoiceDate']]df['year'] = [x.split('/')[2] for x in df['date']]df['month'] = [x.split('/')[0] for x in df['date']]df['day'] = [x.split('/')[1] for x in df['date']]df.drop(['InvoiceDate'],axis=1,inplace=True)df['date'] = pd.to_datetime(df['date'])df[['date','time','year','month','day']]

去重

df = df.drop_duplicates()df.describe()

异常值处理

df2 = df.loc[df['UnitPrice']<=0]df2.shape[0]/df.shape[0]

0.0046809866595607106

df2['UnitPrice'].groupby(df2['UnitPrice']).count()

UnitPrice-11062.06 20.00 2510Name: UnitPrice, dtype: int64

df2['UnitPrice']

622 0.019700.019710.019720.019870.0... 536981 0.0538504 0.0538505 0.0538554 0.0538919 0.0Name: UnitPrice, Length: 2512, dtype: float64

数据分析

月退货情况

df1 = df.loc[df['Quantity']<=0]df1

tt = pd.pivot_table(df1,index='year',columns='month',values='amount',aggfunc={'amount':np.sum})tt

#提出退货数据以及异常数据-》df2df2 = df[(df['Quantity']>0) &(df['UnitPrice']>0)]df2

pp = pd.pivot_table(df2,index='year',columns='month',values='amount',aggfunc={'amount':np.sum})pp

#退货情况np.abs(tt/pp)

#退货率平均数np.abs(tt/pp).loc[''].mean()

0.09249438130255831

用户分类 RFM

#用户最近一次消费时间R_value = df2.groupby('CustomerID')['date'].max()R_value

CustomerID12346 -01-1812347 -12-0712348 -09-2512349 -11-2112350 -02-02... 18281 -06-1218282 -12-0218283 -12-0618287 -10-28U -12-09Name: date, Length: 4339, dtype: datetime64[ns]

#客户最近一次消费距参考时间间隔R_value = (df2['date'].max() - R_value).dt.daysR_value

CustomerID12346 3251234721234875123491812350 310... 18281 1801828271828331828742U0Name: date, Length: 4339, dtype: int64

#每个客户所下的订单数量F_value = df2.groupby('CustomerID')['InvoiceNo'].nunique()F_value

CustomerID12346 112347 712348 412349 112350 1... 18281 118282 2182831618287 3U 1428Name: InvoiceNo, Length: 4339, dtype: int64

#每个客户的消费总金额M_value = df2.groupby('CustomerID')['amount'].sum()M_value

CustomerID12346 7.718360e+0412347 4.310000e+0312348 1.797240e+0312349 1.757550e+0312350 3.344000e+02...18281 8.082000e+0118282 1.780500e+0218283 2.045530e+0318287 1.837280e+03U 1.754902e+06Name: amount, Length: 4339, dtype: float64

R_value.describe()

count 4339.000000mean 92.038258std 100.010502min 0.00000025% 17.00000050% 50.00000075% 141.500000max 373.000000Name: date, dtype: float64

import seaborn as snssns.set(style = 'darkgrid')plt.hist(R_value,bins=30)plt.show()

#判断异常是否严重：均值和中位数的差距是否很大？M_value.describe()

count 4.339000e+03mean2.452537e+03std2.808589e+04min3.750000e+0025%3.065050e+0250%6.685800e+0275%1.660890e+03max1.754902e+06Name: amount, dtype: float64

plt.hist(M_value,bins=30)plt.show()

#显示消费金额小于两千的用户数量（取均值，剔除异常值影响）plt.hist(M_value[M_value<2000],bins=30)plt.show()

F_value.describe()

count 4339.000000mean 4.600138std 22.943499min 1.00000025% 1.00000050% 2.00000075% 5.000000max1428.000000Name: InvoiceNo, dtype: float64

#分位数观察出异常值还是很严重F_value.quantile([0.1,0.2,0.3,0.4,0.5,0.9,1])

0.1 1.00.2 1.00.3 1.00.4 2.00.5 2.00.9 9.01.0 1428.0Name: InvoiceNo, dtype: float64

#异常值很严重plt.hist(F_value,bins=30)plt.show()

plt.hist(F_value[F_value<30],bins=30)plt.show()

#五级R_bins = [0,30,90,180,360,720]F_bins = [1,2,5,10,20,5000]M_bins = [0,500,2000,5000,10000,200000]

#0<=x<30, 30<=x<90 right=FalseR_score = pd.cut(R_value,R_bins,labels=[5,4,3,2,1],right=False)R_score

CustomerID12346 212347 512348 412349 512350 2..18281 218282 518283 518287 4U 5Name: date, Length: 4339, dtype: categoryCategories (5, int64): [5 < 4 < 3 < 2 < 1]

F_score = pd.cut(F_value,F_bins,labels=[1,2,3,4,5],right=False)F_score

CustomerID12346 112347 312348 212349 112350 1..18281 118282 218283 418287 2U 5Name: InvoiceNo, Length: 4339, dtype: categoryCategories (5, int64): [1 < 2 < 3 < 4 < 5]

M_score = pd.cut(M_value,M_bins,labels=[1,2,3,4,5],right=False)M_score

CustomerID123465123473123482123492123501... 182811182821182833182872U NaNName: amount, Length: 4339, dtype: categoryCategories (5, int64): [1 < 2 < 3 < 4 < 5]

rfm = pd.concat([R_score,F_score,M_score],axis=1)rfm.rename(columns={'date':'R_score','InvoiceNo':'F_score','amount':'M_score'},inplace=True)rfm

rfm.info()

<class 'pandas.core.frame.DataFrame'>Index: 4339 entries, 12346 to UData columns (total 3 columns):R_score 4339 non-null categoryF_score 4339 non-null categoryM_score 4336 non-null categorydtypes: category(3)memory usage: 47.2+ KB

for i in ['R_score','F_score','M_score']:rfm[i]=rfm[i].astype(float)rfm.describe()

rfm['R'] = np.where(rfm['R_score']>3.82,'高','低')rfm['F'] = np.where(rfm['F_score']>2.03,'高','低')rfm['M'] = np.where(rfm['M_score']>1.89,'高','低')rfm

rfm['value']=rfm['R'].str[:]+rfm['F'].str[:]+rfm['M'].str[:]rfm

rfm['value'] =rfm['value'].str.strip()def trans_value(x):if x=='高高高':return '高价值客户'elif x=='高低高':return '重点发展客户'elif x=='低高高':return '重点保持客户'elif x=='低低高':return '重点挽留客户'elif x=='高高低':return '一般价值客户'elif x=='高低低':return '一般发展客户'elif x=='低高低':return '一般保持客户'else:return '潜在客户' rfm['用户等级']=rfm['value'].apply(trans_value)rfm

用户等级结果

rfm['用户等级'].value_counts()

高价值客户1034重点发展客户 1017潜在客户 923一般发展客户829重点挽留客户455重点保持客户66一般价值客户10一般保持客户 5Name: 用户等级, dtype: int64

trace_basic = [go.Bar(x=rfm['用户等级'].value_counts().index,y=rfm['用户等级'].value_counts().values,marker=dict(color='orange'),opacity=0.50)]layout = go.Layout(title='用户等级情况',xaxis=dict(title='用户重要度'))figure_basic = go.Figure(data = trace_basic,layout =layout)pyplot(figure_basic)

trace_basic = [go.Pie(labels=rfm['用户等级'].value_counts().index,values=rfm['用户等级'].value_counts().values,textfont=dict(size=12,color='white'))]layout = go.Layout(title='用户等级比例')figure_basic = go.Figure(data = trace_basic,layout =layout)pyplot(figure_basic)

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。