100字范文 > 22_多易教育之《yiee数据运营系统》用户画像-消费行为性别预测篇

22_多易教育之《yiee数据运营系统》用户画像-消费行为性别预测篇

时间：2023-05-31 08:26:51

一、性别预测需求

二、算法选择-逻辑回归分类算法

三、特征工程-选择特征

四、特征工程-数据加工

五、机器学习-模型训练

六、机器学习-性别预测

多易教育，专注大数据培训；课程引领市场，就业乘风破浪
多易教育官网地址
多易教育在线学习平台

一、性别预测需求

用户注册时，所填写的性别，存在大概率的随意性；

不能完全作为用户画像的性别参考；

在无法通过直接手段获得用户真实性别的情况下，需要通过用户的各种行为特征，来对用户的性别进行预测；

二、算法选择-逻辑回归分类算法

可以用朴素贝叶斯做；

也可以用逻辑回归算法来做；

三、特征工程-选择特征

有以下样本特征数据，需要对未知数据进行消费行为性别预测

category1： 30天内买得最多的品类

category2： 30天内买得第二多的品类

category3： 30天内买得第三多的品类

brand1： 30天内买得最多的品牌

brand2： 30天内买得第二多的品牌

brand3： 30天内买得第三多的品牌

day30_buy_cnts： 30天内的购买单数

day30_buy_amt： 30天内的消费总金额

还可以加：30天兴趣关键词中的top10个 ......label,gid,category1,category2,category3,brand1,brand2,brand3,day30_buy_cnts,day30_buy_amt0.0,1,105.0,106.0,102.0,1101.0,1108.0,1109.0,20.0,100.00,2,105,107,102,1101,1108,1105,25,800,3,106,104,102,1102,1108,1109,20,1000,4,106,107,105,1103,1108,1105,30,900,5,112,107,105,2103,1108,1105,38,601,6,112,116,112,2101,2107,2109,10,30001,7,115,117,112,2103,2107,2105,9,18001,8,112,118,113,2102,2108,2109,10,10091,9,116,113,118,2103,2106,2105,5,20001,10,115,117,102,2101,2108,2105,8,800

四、特征工程-数据加工

对上述数据，简单转向量即可

五、机器学习-模型训练

import mons.utils.SparkUtilimport org.apache.spark.ml.classification.LogisticRegressionimport org.apache.spark.ml.linalg.Vectorsimport org.apache.spark.mllib.evaluation.MulticlassMetricsimport org.apache.spark.sql.{Dataset, Row}/*** @date: /9/26* @site: * @author: hunter.d 涛哥* @qq: 657270652* @description:*/object GenderPre {def main(args: Array[String]): Unit = {val spark = SparkUtil.getSparkSession(this.getClass.getSimpleName)import spark.implicits._val dfsample = spark.read.option("header", true).option("inferSchema", true).csv("rec_system/data/lgreg/lgreg_sample.csv")dfsample.printSchema()val dftest = spark.read.option("header", true).option("inferSchema", true).csv("rec_system/data/lgreg/lgreg_test.csv")dftest.printSchema()val vecsample = dfsample.map({case Row(label: Double, gid: Int, category1: Double, category2: Double, category3: Double, brand1: Double, brand2: Double, brand3: Double, day30_buy_cnts: Double, day30_buy_amt: Double)=> {val fts = Array(category1, category2, category3, brand1, brand2, brand3, day30_buy_cnts, day30_buy_amt)(gid, label, Vectors.dense(fts))}}).toDF("gid", "label", "features")vecsample.show(10, false)val vectest = dftest.map({case Row(label: Double, gid: Int, category1: Double, category2: Double, category3: Double, brand1: Double, brand2: Double, brand3: Double, day30_buy_cnts: Double, day30_buy_amt: Double)=> {val fts = Array(category1, category2, category3, brand1, brand2, brand3, day30_buy_cnts, day30_buy_amt)(gid, label, Vectors.dense(fts))}}).toDF("gid", "label", "features")val lg = new LogisticRegression().setRegParam(0.5).setLabelCol("label").setFeaturesCol("features")val res = lg.fit(vecsample).transform(vectest)res.show(10, false)val rd = res.rdd.map(row => (row(1).asInstanceOf[Double], row(5).asInstanceOf[Double]))rd.take(10).foreach(println)/*** 算法评估*/val mc = new MulticlassMetrics(rd)// 准确度println(mc.accuracy)// 精确度println(mc.precision(1.0))// 混淆矩阵val iter = mc.confusionMatrix.rowIteriter.foreach(println)// 召回率println(mc.recall(1.0))spark.close()}}

六、机器学习-性别预测

val res = lg.fit(vecsample).transform(vectest)res.show(10, false)val rd = res.rdd.map(row => (row(1).asInstanceOf[Double], row(5).asInstanceOf[Double]))rd.take(10).foreach(println)/*** 算法评估*/val mc = new MulticlassMetrics(rd)// 准确度println(mc.accuracy)// 精确度println(mc.precision(1.0))// 混淆矩阵val iter = mc.confusionMatrix.rowIteriter.foreach(println)// 召回率println(mc.recall(1.0))