目录
一、性别预测需求
二、算法选择-逻辑回归分类算法
三、特征工程-选择特征
四、特征工程-数据加工
五、机器学习-模型训练
六、机器学习-性别预测
多易教育,专注大数据培训; 课程引领市场,就业乘风破浪
多易教育官网地址
多易教育在线学习平台
一、性别预测需求
用户注册时,所填写的性别,存在大概率的随意性;
不能完全作为用户画像的性别参考;
在无法通过直接手段获得用户真实性别的情况下,需要通过用户的各种行为特征,来对用户的性别进行预测;
二、算法选择-逻辑回归分类算法
可以用朴素贝叶斯做;
也可以用逻辑回归算法来做;
三、特征工程-选择特征
有以下样本特征数据,需要对未知数据进行消费行为性别预测
category1: 30天内买得最多的品类
category2: 30天内买得第二多的品类
category3: 30天内买得第三多的品类
brand1: 30天内买得最多的品牌
brand2: 30天内买得第二多的品牌
brand3: 30天内买得第三多的品牌
day30_buy_cnts: 30天内的购买单数
day30_buy_amt: 30天内的消费总金额
还可以加:30天兴趣关键词中的top10个 ......label,gid,category1,category2,category3,brand1,brand2,brand3,day30_buy_cnts,day30_buy_amt0.0,1,105.0,106.0,102.0,1101.0,1108.0,1109.0,20.0,100.00,2,105,107,102,1101,1108,1105,25,800,3,106,104,102,1102,1108,1109,20,1000,4,106,107,105,1103,1108,1105,30,900,5,112,107,105,2103,1108,1105,38,601,6,112,116,112,2101,2107,2109,10,30001,7,115,117,112,2103,2107,2105,9,18001,8,112,118,113,2102,2108,2109,10,10091,9,116,113,118,2103,2106,2105,5,20001,10,115,117,102,2101,2108,2105,8,800
四、特征工程-数据加工
对上述数据,简单转向量即可
五、机器学习-模型训练
import mons.utils.SparkUtilimport org.apache.spark.ml.classification.LogisticRegressionimport org.apache.spark.ml.linalg.Vectorsimport org.apache.spark.mllib.evaluation.MulticlassMetricsimport org.apache.spark.sql.{Dataset, Row}/*** @date: /9/26* @site: * @author: hunter.d 涛哥* @qq: 657270652* @description:*/object GenderPre {def main(args: Array[String]): Unit = {val spark = SparkUtil.getSparkSession(this.getClass.getSimpleName)import spark.implicits._val dfsample = spark.read.option("header", true).option("inferSchema", true).csv("rec_system/data/lgreg/lgreg_sample.csv")dfsample.printSchema()val dftest = spark.read.option("header", true).option("inferSchema", true).csv("rec_system/data/lgreg/lgreg_test.csv")dftest.printSchema()val vecsample = dfsample.map({case Row(label: Double, gid: Int, category1: Double, category2: Double, category3: Double, brand1: Double, brand2: Double, brand3: Double, day30_buy_cnts: Double, day30_buy_amt: Double)=> {val fts = Array(category1, category2, category3, brand1, brand2, brand3, day30_buy_cnts, day30_buy_amt)(gid, label, Vectors.dense(fts))}}).toDF("gid", "label", "features")vecsample.show(10, false)val vectest = dftest.map({case Row(label: Double, gid: Int, category1: Double, category2: Double, category3: Double, brand1: Double, brand2: Double, brand3: Double, day30_buy_cnts: Double, day30_buy_amt: Double)=> {val fts = Array(category1, category2, category3, brand1, brand2, brand3, day30_buy_cnts, day30_buy_amt)(gid, label, Vectors.dense(fts))}}).toDF("gid", "label", "features")val lg = new LogisticRegression().setRegParam(0.5).setLabelCol("label").setFeaturesCol("features")val res = lg.fit(vecsample).transform(vectest)res.show(10, false)val rd = res.rdd.map(row => (row(1).asInstanceOf[Double], row(5).asInstanceOf[Double]))rd.take(10).foreach(println)/*** 算法评估*/val mc = new MulticlassMetrics(rd)// 准确度println(mc.accuracy)// 精确度println(mc.precision(1.0))// 混淆矩阵val iter = mc.confusionMatrix.rowIteriter.foreach(println)// 召回率println(mc.recall(1.0))spark.close()}}
六、机器学习-性别预测
val res = lg.fit(vecsample).transform(vectest)res.show(10, false)val rd = res.rdd.map(row => (row(1).asInstanceOf[Double], row(5).asInstanceOf[Double]))rd.take(10).foreach(println)/*** 算法评估*/val mc = new MulticlassMetrics(rd)// 准确度println(mc.accuracy)// 精确度println(mc.precision(1.0))// 混淆矩阵val iter = mc.confusionMatrix.rowIteriter.foreach(println)// 召回率println(mc.recall(1.0))
多易教育,专注大数据培训; 课程引领市场,就业乘风破浪
多易教育官网地址
多易教育在线学习平台