100字范文 > 《绝地求生》玩家排名预测-pubg(竞赛)参考型模

《绝地求生》玩家排名预测-pubg(竞赛)参考型模

时间：2022-05-18 01:48:54

1 《绝地求生》玩家排名预测

---- 你能预测《绝地求生》玩家战斗结束后的排名吗？

2 项目背景

2.1 项目简介

绝地求生(Player unknown’s Battlegrounds)，俗称吃鸡，是一款战术竞技型射击类沙盒游戏。这款游戏是一款大逃杀类型的游戏，每一局游戏将有最多100名玩家参与，他们将被投放在绝地岛(battlegrounds)上，在游戏的开始时所有人都一无所有。玩家需要在岛上收集各种资源，在不断缩小的安全区域内对抗其他玩家，让自己生存到最后。

该游戏拥有很高的自由度，玩家可以体验飞机跳伞、开越野车、丛林射击、抢夺战利品等玩法，小心四周埋伏的敌人，尽可能成为最后1个存活的人。

2.2 项目涉及知识点

sklearn基本操作

数据基本处理

机器学习基本算法的使用

2.3 数据集介绍

本项目中，将为您提供大量匿名的《绝地求生》游戏统计数据。其格式为每行包含一个玩家的游戏后统计数据，列为数据的特征值。数据来自所有类型的比赛：单排，双排，四排；不保证每场比赛有100名人员，每组最多4名成员。

文件说明:

train_V2.csv - 训练集

test_V2.csv - 测试集

数据集局部图如下图所示:

数据集中字段解释：Id [用户id]Player’s IdgroupId [所处小队id]ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.matchId [该场比赛id]ID to identify match. There are no matches that are in both the training and testing set.assists [助攻数]Number of enemy players this player damaged that were killed by teammates.boosts [使用能量,道具数量]Number of boost items used.damageDealt [总伤害]Total damage dealt. Note: Self inflicted damage is subtracted.DBNOs [击倒敌人数量]Number of enemy players knocked.headshotKills [爆头数]Number of enemy players killed with headshots.heals [使用治疗药品数量]Number of healing items used.killPlace [本厂比赛杀敌排行]Ranking in match of number of enemy players killed.killPoints [Elo杀敌排名]Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.kills [杀敌数]Number of enemy players killed.killStreaks [连续杀敌数]Max number of enemy players killed in a short amount of time.longestKill [最远杀敌距离]Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.matchDuration [比赛时长]Duration of match in seconds.matchType [比赛类型(小组人数)]String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.maxPlace [本局最差名次]Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.numGroups [小组数量]Number of groups we have data for in the match.rankPoints [Elo排名]Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.revives [救活队员的次数]Number of times this player revived teammates.rideDistance [驾车距离]Total distance traveled in vehicles measured in meters.roadKills [驾车杀敌数]Number of kills while in a vehicle.swimDistance [游泳距离]Total distance traveled by swimming measured in meters.teamKills [杀死队友的次数]Number of times this player killed a teammate.vehicleDestroys [毁坏机动车的数量]Number of vehicles destroyed.walkDistance [步行距离]Total distance traveled on foot measured in meters.weaponsAcquired [收集武器的数量]Number of weapons picked up.winPoints [胜率Elo排名]Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.winPlacePerc [百分比排名]The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

3 项目评估方式

3.1 评估方式

你必须创建一个模型，根据他们的最终统计数据预测玩家的排名，从1（第一名）到0（最后一名）。

最后结果通过平均绝对误差（MAE）进行评估，即通过预测的winPlacePerc和真实的winPlacePerc之间的平均绝对误差

3.2 MAE(Maean Absolute Error)介绍

就是绝对误差的平均值

能更好地反映预测值误差的实际情况

𝑀𝐴𝐸(𝑋,ℎ)=1𝑚∑𝑖=1𝑚|ℎ(𝑥(𝑖))−𝑦(𝑖)|

api:

sklearn.metrics.mean_absolute_error

4 项目实现（数据分析+RL）

在接下来的分析中，我们将分析数据集，检测异常值。

然后我们通过随机森林模型对其训练，并对对该模型进行了优化。

导入数据基本处理阶段需要用到的api

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

import seaborn as sns

4.1 获取数据、基本数据信息查看

导入数据，且查看数据的基本信息

train = pd.read_csv("./data/train_V2.csv")train.describe()assistsboostsdamageDealtDBNOsheadshotKillshealskillPlacekillPointskillskillStreaks...revivesrideDistanceroadKillsswimDistanceteamKillsvehicleDestroyswalkDistanceweaponsAcquiredwinPointswinPlacePerccount4.446966e+064.446966e+064.446966e+064.446966e+064.446966e+064.446966e+064.446966e+064.446966e+064.446966e+064.446966e+06...4.446966e+064.446966e+064.446966e+064.446966e+064.446966e+064.446966e+064.446966e+064.446966e+064.446966e+064.446965e+06mean2.338149e-011.106908e+001.307171e+026.578755e-012.268196e-011.370147e+004.759935e+015.050060e+029.247833e-015.439551e-01...1.646590e-016.061157e+023.496091e-034.509322e+002.386841e-027.918208e-031.154218e+033.660488e+006.064601e+024.728216e-01std5.885731e-011.715794e+001.707806e+021.145743e+006.021553e-012.679982e+002.746294e+016.275049e+021.558445e+007.109721e-01...4.721671e-011.498344e+037.337297e-023.050220e+011.673935e-019.261157e-021.183497e+032.456544e+007.397004e+023.074050e-01min0.000000e+000.000000e+000.000000e+000.000000e+000.000000e+000.000000e+001.000000e+000.000000e+000.000000e+000.000000e+00...0.000000e+000.000000e+000.000000e+000.000000e+000.000000e+000.000000e+000.000000e+000.000000e+000.000000e+000.000000e+0025%0.000000e+000.000000e+000.000000e+000.000000e+000.000000e+000.000000e+002.400000e+010.000000e+000.000000e+000.000000e+00...0.000000e+000.000000e+000.000000e+000.000000e+000.000000e+000.000000e+001.551000e+022.000000e+000.000000e+002.000000e-0150%0.000000e+000.000000e+008.424000e+010.000000e+000.000000e+000.000000e+004.700000e+010.000000e+000.000000e+000.000000e+00...0.000000e+000.000000e+000.000000e+000.000000e+000.000000e+000.000000e+006.856000e+023.000000e+000.000000e+004.583000e-0175%0.000000e+002.000000e+001.860000e+021.000000e+000.000000e+002.000000e+007.100000e+011.172000e+031.000000e+001.000000e+00...0.000000e+001.909750e-010.000000e+000.000000e+000.000000e+000.000000e+001.976000e+035.000000e+001.495000e+037.407000e-01max2.200000e+013.300000e+016.616000e+035.300000e+016.400000e+018.000000e+011.010000e+022.170000e+037.200000e+012.000000e+01...3.900000e+014.071000e+041.800000e+013.823000e+031.200000e+015.000000e+002.578000e+042.360000e+022.013000e+031.000000e+008 rows × 25 columnstrain.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 4446966 entries, 0 to 4446965Data columns (total 29 columns):Id objectgroupId objectmatchId objectassists int64boosts int64damageDealt float64DBNOs int64headshotKillsint64heals int64killPlaceint64killPoints int64kills int64killStreaks int64longestKill float64matchDurationint64matchTypeobjectmaxPlace int64numGroupsint64rankPoints int64revives int64rideDistance float64roadKillsint64swimDistance float64teamKillsint64vehicleDestroys int64walkDistance float64weaponsAcquired int64winPointsint64winPlacePerc float64dtypes: float64(6), int64(19), object(4)memory usage: 983.9+ MB可以看到数据一共有4446966条，train.shape(4446966, 29)

4.2 数据基本处理

4.2.1 数据缺失值处理

查看目标值，我们发现有一条样本，比较特殊，其“winplaceperc”的值为NaN，也就是目标值是缺失值，

因为只有一个玩家是这样，直接进行删除处理。

# 查看缺失值train[train['winPlacePerc'].isnull()]IdgroupIdmatchIdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlace...revivesrideDistanceroadKillsswimDistanceteamKillsvehicleDestroyswalkDistanceweaponsAcquiredwinPointswinPlacePerc2744604f70c74418bb06412dfbede33f92b224a123c53e008000.00001...00.000.0000.000NaN1 rows × 29 columns# 删除缺失值train.drop(2744604, inplace=True)train.shape(4446965, 29)4.2.2 特征数据规范化处理4.2.2.1 查看每场比赛参加的人数处理完缺失值之后，我们看一下每场参加的人数会有多少呢，是每次都会匹配100个人，才开始游戏吗？# 显示每场比赛参加人数# transform的作用类似实现了一个一对多的映射功能，把统计数量映射到对应的每个样本上count = train.groupby('matchId')['matchId'].transform('count')count096191298391497..4446961 944446962 934446963 984446964 944446965 98Name: matchId, Length: 4446965, dtype: int64train['playersJoined'] = countcount.count()4446965train.head()IdgroupIdmatchIdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlace...rideDistanceroadKillsswimDistanceteamKillsvehicleDestroyswalkDistanceweaponsAcquiredwinPointswinPlacePercplayersJoined07f96b2f878858a4d4b580de459bea10357fd1a4a91000.0000060...0.000000.0000244.80114660.4444961eef90569b9d03c684d5656442f9eaeb375fc57110c0091.4700057...0.0045011.04001434.00500.64009121eaf90ac73de726a4a42c3245a74110163d8bb94ae1068.0000047...0.000000.0000161.80200.77559834616d365dd2853a930a9c79cd721f1f1f4ef412d7e0032.9000075...0.000000.0000202.70300.1667914315c96c26c9aacde04010b3458dd6dc8ff871e21e600100.0000045...0.000000.000049.75200.1875975 rows × 30 columns# 通过每场参加人数进行，按值升序排列train["playersJoined"].sort_values().head()1206365 22109739 23956552 5368 56960005Name: playersJoined, dtype: int64通过结果发现，最少的一局，竟然只有两个人，wtf!!!!# 通过绘制图像，查看每局开始人数# 通过seaborn下的countplot方法，可以直接绘制统计过数量之后的直方图plt.figure(figsize=(20,10))sns.countplot(train['playersJoined'])plt.title('playersJoined')plt.grid()plt.show()

通过观察，发现一局游戏少于75个玩家，就开始的还是比较少同时大部分游戏都是在接近100人的时候才开始限制每局开始人数大于等于75，再进行绘制。猜想：把这些数据在后期加入数据处理，应该会得到的结果更加准确一些# 再次绘制每局参加人数的直方图plt.figure(figsize=(20,10))sns.countplot(train[train['playersJoined']>=75]['playersJoined'])plt.title('playersJoined')plt.grid()plt.show()

4.2.2.2 规范化输出部分数据

现在我们统计了“每局玩家数量”，那么我们就可以通过“每局玩家数量”来进一步考证其它特征，同时对其规范化设置

试想：一局只有70个玩家的杀敌数，和一局有100个玩家的杀敌数，应该是不可以同时比较的

可以考虑的特征值包括

1.kills（杀敌数）

2.damageDealt（总伤害）

3.maxPlace（本局最差名次）

4.matchDuration（比赛时长）

# 对部分特征值进行规范化train['killsNorm'] = train['kills']*((100-train['playersJoined'])/100 + 1)train['damageDealtNorm'] = train['damageDealt']*((100-train['playersJoined'])/100 + 1)train['maxPlaceNorm'] = train['maxPlace']*((100-train['playersJoined'])/100 + 1)train['matchDurationNorm'] = train['matchDuration']*((100-train['playersJoined'])/100 + 1)# 比较经过规范化的特征值和原始特征值的值to_show = ['Id', 'kills','killsNorm','damageDealt', 'damageDealtNorm', 'maxPlace', 'maxPlaceNorm', 'matchDuration', 'matchDurationNorm']train[to_show][0:11]IdkillskillsNormdamageDealtdamageDealtNormmaxPlacemaxPlaceNormmatchDurationmatchDurationNorm07f96b2f878858a00.000.0000.000002829.1213061358.241eef90569b9d03c00.0091.47099.702302628.3417771936.9321eaf90ac73de7200.0068.00069.360005051.0013181344.3634616d365dd285300.0032.90035.861003133.7914361565.244315c96c26c9aac11.03100.000103.000009799.9114241466.725ff79c12f32650611.05100.000105.000002829.4013951464.75695959be0e21ca300.000.0000.000002828.8413161355.487311b84c6ff439000.008.5388.879529699.8419672045.6881a68204ccf989100.0051.60053.148002828.8413751416.259e5bb5a4358725300.0037.27038.388102929.8719301987.90102b574d4397281300.0028.38028.663802929.2918111829.11

4.2.3 部分变量合成

此处我们把特征：heals(使用治疗药品数量)和boosts(能量、道具使用数量)合并成一个新的变量，命名：”healsandboosts“，这是一个探索性过程，最后结果不一定有用，如果没有实际用处，最后再把它删除。

# 创建新变量“healsandboosts”train['healsandboosts'] = train['heals'] + train['boosts']train[["heals", "boosts", "healsandboosts"]].tail()healsboostshealsandboosts4446961000444696444696300044469642464446965123

4.2.4 异常值处理

4.2.4.1 异常值处理：删除有击杀，但是完全没有移动的玩家

异常数据处理：

一些行中的数据统计出来的结果非常反常规，那么这些玩家肯定有问题，为了训练模型的准确性，我们会把这些异常数据剔除

通过以下操作，识别出玩家在游戏中有击杀数，但是全局没有移动；

这类型玩家肯定是存在异常情况（挂**），我们把这些玩家删除。

# 创建新变量，统计玩家移动距离train['totalDistance'] = train['rideDistance'] + train['walkDistance'] + train['swimDistance']# 创建新变量，统计玩家是否在游戏中，有击杀，但是没有移动，如果是返回True, 否则返回falsetrain['killsWithoutMoving'] = ((train['kills'] > 0) & (train['totalDistance'] == 0))train["killsWithoutMoving"].head()0 False1 False2 False3 False4 FalseName: killsWithoutMoving, dtype: booltrain["killsWithoutMoving"].describe()count4446965unique2top Falsefreq4445430Name: killsWithoutMoving, dtype: object# 检查是否存在有击杀但是没有移动的数据train[train['killsWithoutMoving'] == True].shape(1535, 37)train[train['killsWithoutMoving'] == True].head()IdgroupIdmatchIdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlace...winPointswinPlacePercplayersJoinedkillsNormdamageDealtNormmaxPlaceNormmatchDurationNormhealsandbooststotalDistancekillsWithoutMoving1824b538d514ef24760eb2ce2f43f9d635e7d750e442e93.000318...00.8571588.52842.06021.30842.0630.0True66736d3a61da07b7cb2d8119b1544f87904cecf36217df20346.600633...00.6000424.74547.62817.382834.5260.0True11892550398a8f33db7c3fd0e2abab0afdb6f6d1f0d490450.00453...00.89472135.803132.50035.801607.4250.0True1463158d690ee461e9dea5b6630b33d67dbf34301df5e5300157.800069...15000.0000731.27200.40624.131014.7300.0True1559149b61fc963d6320f5c5f19d9cc21904cecf36217df00100.001037...00.3000421.58158.00017.382834.5200.0True5 rows × 37 columns# 删除这些数据train.drop(train[train['killsWithoutMoving'] == True].index, inplace=True)4.2.4.2 异常值处理：删除驾车杀敌数异常的数据# 查看载具杀敌数超过十个的玩家train[train['roadKills'] > 10]IdgroupIdmatchIdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlace...winPointswinPlacePercplayersJoinedkillsNormdamageDealtNormmaxPlaceNormmatchDurationNormhealsandbooststotalDistancekillsWithoutMoving2733926c3e444f7d1289f489dd6d1f2b3bb4797482205aaa4001246.00001...13710.42869215.121345.6899.361572.4801282.302False276799934193085975338bd7d50fa305700a22354d036b3d6001102.00001...15330.47138812.321234.2498.562179.5204934.600False2890740a3438934e3e5351081c315a80d14fe744430ac0070082074.001111...15681.00003832.403359.8861.563191.40195876.000False35244139d9d044f81de728be97e1ba792e3859e2c2db5b125031866.00571...16060.93988420.882164.5697.442233.00107853.000False4 rows × 37 columns# 删除这些数据train.drop(train[train['roadKills'] > 10].index, inplace=True)train.shape(4445426, 37)4.2.4.3 异常值处理：删除玩家在一局中杀敌数超过30人的数据# 首先绘制玩家杀敌数的条形图plt.figure(figsize=(10,4))sns.countplot(data=train, x=train['kills']).set_title('Kills')plt.show()

train[train['kills'] > 30].shape(95, 37)train[train['kills'] > 30].head()IdgroupIdmatchIdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlace...winPointswinPlacePercplayersJoinedkillsNormdamageDealtNormmaxPlaceNormmatchDurationNormhealsandbooststotalDistancekillsWithoutMoving579789d8253e21ccbbdef7135ed856cd837f05e2a01015f903725.00702...15000.85711664.406854.0014.723308.32048.82False8779345f76442384931b3627758941d3437f05e2a01015f803087.008273...15001.00001657.045680.0814.723308.3227780.70False156599746aa7eabf7c865723e7d8250da3f900de1ec39fa52105479.001274...00.70001190.7210355.3120.793398.22723.71False16025415622257cb44e21a513eeecfe724db413c7c48292c104033.004001...15001.00006257.965565.5411.041164.720718.30False1801891355613d43e2d0f863cd38c61dbf39c442628f5df5503171.006151...01.00001166.155993.1917.013394.441571.51False5 rows × 37 columns# 异常数据删除train.drop(train[train['kills'] > 30].index, inplace=True)4.2.4.4 异常值处理：删除爆头率异常数据如果一个玩家的击杀爆头率过高，也说明其有问题# 创建变量爆头率train['headshot_rate'] = train['headshotKills'] / train['kills']train['headshot_rate'] = train['headshot_rate'].fillna(0)train["headshot_rate"].tail()4446961 0.04446962 0.04446963 0.04446964 0.54446965 0.0Name: headshot_rate, dtype: float64# 绘制爆头率图像plt.figure(figsize=(12,4))sns.distplot(train['headshot_rate'], bins=10)plt.show()

train[(train['headshot_rate'] == 1) & (train['kills'] > 9)].shape(24, 38)train[(train['headshot_rate'] == 1) & (train['kills'] > 9)].head()IdgroupIdmatchIdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlace...winPlacePercplayersJoinedkillsNormdamageDealtNormmaxPlaceNormmatchDurationNormhealsandbooststotalDistancekillsWithoutMovingheadshot_rate281570ab9d7168570927add05ebde0214ce016a873339c7b231212.081001...0.84629310.701296.8428.891522.6132939.0False1.0346124044d18fc42fc75fc1dbc2df6a887628107d4c41084351620.0131131...1.00009611.441684.8028.081796.0888142.0False1.0871244e668a25f5488e35ba8feabfb2a23f6e6581e03ba4f041365.091301...1.00009813.261392.3027.541280.1042105.0False1.0908815566d8218b705aaa9b056478d71b23a41552d553583251535.0101031...0.96309510.501611.7529.401929.9087948.0False1.09634631bd6fd288df4f090584ffa22fe15ba2de992ec7bb8261355.0121021...1.00009610.401409.2028.081473.6883476.0False1.05 rows × 38 columnstrain.drop(train[(train['headshot_rate'] == 1) & (train['kills'] > 9)].index, inplace=True)4.2.4.5 异常值处理：删除最远杀敌距离异常数据# 绘制图像plt.figure(figsize=(12,4))sns.distplot(train['longestKill'], bins=10)plt.show()

# 找出最远杀敌距离大于等于1km的玩家train[train['longestKill'] >= 1000].shape(20, 38)train[train['longestKill'] >= 1000]["longestKill"].head()81 1000.0240005 1004.0324313 1026.0656553 1000.0803632 1075.0Name: longestKill, dtype: float64train.drop(train[train['longestKill'] >= 1000].index, inplace=True)train.shape(4445287, 38)4.2.4.6 异常值处理：删除关于运动距离的异常值# 距离整体描述train[['walkDistance', 'rideDistance', 'swimDistance', 'totalDistance']].describe()walkDistancerideDistanceswimDistancetotalDistancecount4.445287e+064.445287e+064.445287e+064.445287e+06mean1.154619e+036.063215e+024.510898e+001.765451e+03std1.183508e+031.498562e+033.050738e+012.183248e+03min0.000000e+000.000000e+000.000000e+000.000000e+0025%1.554000e+020.000000e+000.000000e+001.584000e+0250%6.863000e+020.000000e+000.000000e+007.892500e+0275%1.977000e+032.566000e-010.000000e+002.729000e+03max2.578000e+044.071000e+043.823000e+034.127010e+04a）行走距离处理plt.figure(figsize=(12,4))sns.distplot(train['walkDistance'], bins=10)plt.show()

train[train['walkDistance'] >= 10000].shape(219, 38)train[train['walkDistance'] >= 10000].head()IdgroupIdmatchIdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlace...winPlacePercplayersJoinedkillsNormdamageDealtNormmaxPlaceNormmatchDurationNormhealsandbooststotalDistancekillsWithoutMovingheadshot_rate230268a6562381dd83f23e638cd6eaf77b0a804a610e9b0010.0000044...0.8163990.000.000099.991925.06113540.3032False0.0343445a591ecc9573936717370b51c247a15d93e7165b050323.2200134...0.9474650.0031.347027.002668.95410070.9073False0.049312582685f487f0b4338112cd12f1e7d0afbf5c3a6dc904117.124...0.9130941.06124.232049.822323.52512446.7588False0.0685908c0d9dd0b4463cc963553dc937e9926681ea721a470132.3400146...0.8333960.0033.633650.961909.44212483.6200False0.094400d441bebd01db617e179b3366adb8923b57b8b834cc1173.0800327...0.8194730.0092.811692.712293.62411490.6300False0.05 rows × 38 columnstrain.drop(train[train['walkDistance'] >= 10000].index, inplace=True)b）载具行驶距离处理plt.figure(figsize=(12,4))sns.distplot(train['rideDistance'], bins=10)plt.show()

train[train['rideDistance'] >= 20000].shape(150, 38)train[train['rideDistance'] >= 20000].head()IdgroupIdmatchIdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlace...winPlacePercplayersJoinedkillsNormdamageDealtNormmaxPlaceNormmatchDurationNormhealsandbooststotalDistancekillsWithoutMovingheadshot_rate285886260f7c49dc16fb24589f02eedd76ebea3b4f55b4a0099.200130...0.6421961.04103.16899.841969.76126306.6False0.00000063015adb7dae4d0c10a8ede98a241f30a8b36eac66378e4000.000055...0.5376940.000.00099.64.46022065.4False0.00000070507ca6fa339064d67f7bb2e30c3461f3bfd8d66edbeff00100.000026...0.8878991.01101.00099.991947.28028917.5False0.00000072763198e5894e68ff4ccf47c82abb11fd92bf8e696b61d000.000046...0.7917970.000.00099.911861.21021197.2False0.00000095276c3fabfce7589ae15529e25aa4a74d055504340e5f407778.2...0.9785947.42824.89299.641986.44926733.2False0.1428575 rows × 38 columnstrain.drop(train[train['rideDistance'] >= 20000].index, inplace=True)c）游泳距离处理plt.figure(figsize=(12,4))sns.distplot(train['swimDistance'], bins=10)plt.show()

train[train['swimDistance'] >= 2000].shape(12, 38)train[train['swimDistance'] >= 2000][["swimDistance"]]swimDistance1779732295.02742582148.010053372718.011958182668.012273623823.018891632484.020659403514.023275862387.027848552206.033594392338.035135222124.041322252382.0train.drop(train[train['swimDistance'] >= 2000].index, inplace=True)4.2.4.7 异常值处理：武器收集异常值处理plt.figure(figsize=(12,4))sns.distplot(train['weaponsAcquired'], bins=100)plt.show()

train[train['weaponsAcquired'] >= 80].shape(19, 38)train[train['weaponsAcquired'] >= 80][['weaponsAcquired']].head()weaponsAcquired233643128588387801437471102144929395159274494train.drop(train[train['weaponsAcquired'] >= 80].index, inplace=True)4.2.4.8 异常值处理：删除使用治疗药品数量异常值plt.figure(figsize=(12,4))sns.distplot(train['heals'], bins=10)plt.show()

train[train['heals'] >= 40].shape(135, 38)train[train['heals'] >= 40][["heals"]].head()heals18405475446343126439522593514226874748train.drop(train[train['heals'] >= 40].index, inplace=True)train.shape(4444752, 38)4.2.5 类别型数据处理4.2.5.1 比赛类型one-hot处理# 关于比赛类型，共有16种方式train['matchType'].unique()array(['squad-fpp', 'duo', 'solo-fpp', 'squad', 'duo-fpp', 'solo','normal-squad-fpp', 'crashfpp', 'flaretpp', 'normal-solo-fpp','flarefpp', 'normal-duo-fpp', 'normal-duo', 'normal-squad','crashtpp', 'normal-solo'], dtype=object)# 对matchType进行one_hot编码# 通过在后面添加的方式,实现,赋值并不是替换train = pd.get_dummies(train, columns=['matchType'])train.head()IdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlacekillPointskills...matchType_normal-solomatchType_normal-solo-fppmatchType_normal-squadmatchType_normal-squad-fppmatchType_solomatchType_solo-fppmatchType_squadmatchType_squad-fppgroupId_catmatchId_cat07f96b2f878858a000.000006012410...00000001613591300851eef90569b9d03c0091.470005700...000000018275803275121eaf90ac73de721068.000004700...00000000843271314334616d365dd28530032.900007500...000000011340070452604315c96c26c9aac00100.000004501...000001001757334205315 rows × 53 columnstrain.shape(4444752, 53)# 通过正则匹配查看具体内容matchType_encoding = train.filter(regex='matchType')matchType_encoding.head()matchType_crashfppmatchType_crashtppmatchType_duomatchType_duo-fppmatchType_flarefppmatchType_flaretppmatchType_normal-duomatchType_normal-duo-fppmatchType_normal-solomatchType_normal-solo-fppmatchType_normal-squadmatchType_normal-squad-fppmatchType_solomatchType_solo-fppmatchType_squadmatchType_squad-fpp00000000000000001100000000000000012001000000000000030000000000000001400000000000001004.2.5.2 对groupId,matchId等数据进行处理关于groupId,matchId这类型数据，也是类别型数据。但是它们的数据量特别多，如果你使用one-hot编码，无异于自杀。在这儿我们把它们变成用数字统计的类别型数据依旧不影响我们正常使用。# 把groupId 和 match Id 转换成类别类型 categorical types# 就是把一堆不怎么好识别的内容转换成数字# 转换group_idtrain["groupId"].head()0 4d4b580de459be1 684d5656442f9e2 6a4a42c3245a743 a930a9c79cd7214 de04010b3458ddName: groupId, dtype: objecttrain['groupId'] = train['groupId'].astype('category')train["groupId"].head()0 4d4b580de459be1 684d5656442f9e2 6a4a42c3245a743 a930a9c79cd7214 de04010b3458ddName: groupId, dtype: categoryCategories (2026153, object): [00000c08b5be36, 00000d1cbbc340, 000025a09dd1d7, 000038ec4dff53, ..., fffff305a0133d, fffff32bc7eab9, fffff7edfc4050, fffff98178ef52]train["groupId_cat"] = train["groupId"].cat.codestrain["groupId_cat"].head()0613591182758028432713 13400704 1757334Name: groupId_cat, dtype: int32# 转换match_idtrain['matchId'] = train['matchId'].astype('category')train['matchId_cat'] = train['matchId'].cat.codes# 删除之前列train.drop(['groupId', 'matchId'], axis=1, inplace=True)# 查看新产生列train[['groupId_cat', 'matchId_cat']].head()groupId_catmatchId_cat0613591300851827580327512843271314331340070452604175733420531train.head()IdassistsboostsdamageDealtDBNOsheadshotKillshealskillPlacekillPointskills...matchType_normal-solomatchType_normal-solo-fppmatchType_normal-squadmatchType_normal-squad-fppmatchType_solomatchType_solo-fppmatchType_squadmatchType_squad-fppgroupId_catmatchId_cat07f96b2f878858a000.000006012410...00000001613591300851eef90569b9d03c0091.470005700...000000018275803275121eaf90ac73de721068.000004700...00000000843271314334616d365dd28530032.900007500...000000011340070452604315c96c26c9aac00100.000004501...000001001757334205315 rows × 53 columns4.2.6 数据截取4.2.6.1 取部分数据进行使用（1000000）# 取前100万条数据，进行训练sample = 1000000df_sample = train.sample(sample)df_sample.shape(1000000, 53)4.2.7 确定特征值和目标值# 确定特征值和目标值df = df_sample.drop(["winPlacePerc", "Id"], axis=1) #all columns except targety = df_sample['winPlacePerc'] # Only target variabledf.head()assistsboostsdamageDealtDBNOsheadshotKillshealskillPlacekillPointskillskillStreaks...matchType_normal-solomatchType_normal-solo-fppmatchType_normal-squadmatchType_normal-squad-fppmatchType_solomatchType_solo-fppmatchType_squadmatchType_squad-fppgroupId_catmatchId_cat23240520.2000067133700...00000001339395431135332070032.9300044110000...000000019142061339932580102161.1000352000...000000011119774459814783730063.9400056000...000000001932650443931021200000.0000089000...000000001706611447235 rows × 51 columnsy.head()2324052 0.40745332070.69233258010.80004783730.48941021200 0.0816Name: winPlacePerc, dtype: float64print(df.shape, y.shape)(1000000, 51) (1000000,)4.2.8 分割训练集和验证集# 自定义函数，分割训练集和验证集def split_vals(a, n : int): # ps: n:int 是一种新的定义函数方式，告诉你这个n,传入应该是int类型，但不是强制的return a[:n].copy(), a[n:].copy()val_perc = 0.12 # % to use for validation setn_valid = int(val_perc * sample) n_trn = len(df)-n_valid# 分割数据集raw_train, raw_valid = split_vals(df_sample, n_trn)X_train, X_valid = split_vals(df, n_trn)y_train, y_valid = split_vals(y, n_trn)# 检查数据集维度print('Sample train shape: ', X_train.shape, '\nSample target shape: ', y_train.shape, '\nSample validation shape: ', X_valid.shape)Sample train shape: (880000, 51) Sample target shape: (880000,) Sample validation shape: (120000, 51)4.3 机器学习（模型训练）和评估# 导入需要训练和评估apifrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_absolute_error4.3.1 初步使用随机森林进行模型训练# 模型训练m1 = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features='sqrt', n_jobs=-1)# n_jobs=-1 表示训练的时候，并行数和cpu的核数一样，如果传入具体的值，表示用几个核去跑m1.fit(X_train, y_train)RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,max_features='sqrt', max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=3, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=40, n_jobs=-1,oob_score=False, random_state=None, verbose=0, warm_start=False)y_pre = m1.predict(X_valid)m1.score(X_valid, y_valid)0.920974556316mean_absolute_error(y_true=y_valid, y_pred=y_pre)0.06134694645458625经过第一次计算，得出准确率为：0.92， mae=0.064.3.2 再次使用随机森林，进行模型训练减少特征值，提高模型训练效率# 查看特征值在当前模型中的重要程度m1.feature_importances_array([2.78018429e-03, 7.09256632e-02, 1.29069372e-02, 2.66242891e-03,3.22533248e-03, 4.03021778e-02, 1.96128120e-01, 1.93034113e-03,6.83785312e-03, 7.15347972e-03, 1.41163125e-02, 9.47316429e-03,6.60520545e-03, 7.54587098e-03, 3.7764e-03, 2.36387992e-03,1.79286139e-02, 2.71369501e-05, 1.37579746e-03, 1.15328556e-04,2.24318554e-05, 2.84514166e-01, 4.87906687e-02, 2.25145604e-03,6.34793081e-03, 1.17778514e-02, 9.38100747e-03, 6.92348737e-03,1.26945601e-02, 3.77781917e-02, 1.57621968e-01, 0.00000000e+00,1.22717096e-03, 6.59880420e-05, 7.40618079e-07, 2.08886376e-04,4.90619346e-04, 4.92684576e-08, 2.07956404e-06, 2.76274752e-07,9.83807884e-05, 0.00000000e+00, 9.89167941e-06, 1.01437578e-06,2.77453616e-04, 2.41390349e-04, 1.06886660e-03, 1.08671409e-03,9.27714940e-04, 4.00129494e-03, 4.00750039e-03])imp_df = pd.DataFrame({"cols":df.columns, "imp":m1.feature_importances_})imp_df.head()colsimp0assists0.0027801boosts0.0709262damageDealt0.0129073DBNOs0.0026624headshotKills0.003225imp_df = imp_df.sort_values("imp", ascending=False)imp_df.head()colsimp21walkDistance0.2845146killPlace0.19612830totalDistance0.1576221boosts0.07092622weaponsAcquired0.048791# Plot a feature importance graph for the 20 most important features# 绘制特征重要性程度图，仅展示排名前二十的特征plot_fea = imp_df[:20].plot('cols', 'imp', figsize=(14,6), legend=False, kind = 'barh')plot_fea<matplotlib.axes._subplots.AxesSubplot at 0x1713427b8>

# 保留比较重要的特征to_keep = imp_df[imp_df.imp>0.005].colsprint('Significant features: ', len(to_keep))to_keepSignificant features: walkDistance6 killPlace30 totalDistance1boosts22weaponsAcquired5 heals29 healsandboosts16 rideDistance10longestKill2 damageDealt28 matchDurationNorm25 killsNorm11 matchDuration26damageDealtNorm13 numGroups9 killStreaks27 maxPlaceNorm8 kills12 maxPlace24 playersJoinedName: cols, dtype: object# 由这些比较重要的特征值，生成新的dfdf[to_keep].head()walkDistancekillPlacetotalDistanceboostsweaponsAcquiredhealshealsandboostsrideDistancelongestKilldamageDealtmatchDurationNormkillsNormmatchDurationdamageDealtNormnumGroupskillStreaksmaxPlaceNormkillsmaxPlaceplayersJoined23240521192.0000671754.90001601562.90.0120.65.600.01872126.210028029.40028955332073105.0000443105.000007000.00.032.931857.420.0182133.588627027.54027983258014036.0000524342.90002235306.90.0161.101820.000.0181.100031031.00031100478373611.000056611.000005000.00.063.941501.500.0143067.137047050.400489510212000.3298890.329801000.00.00.001505.860.014620.000046051.5005097df.head()assistsboostsdamageDealtDBNOsheadshotKillshealskillPlacekillPointskillskillStreaks...matchType_normal-solomatchType_normal-solo-fppmatchType_normal-squadmatchType_normal-squad-fppmatchType_solomatchType_solo-fppmatchType_squadmatchType_squad-fppgroupId_catmatchId_cat23240520.2000067133700...00000001339395431135332070032.9300044110000...000000019142061339932580102161.1000352000...000000011119774459814783730063.9400056000...000000001932650443931021200000.0000089000...000000001706611447235 rows × 51 columns# 重新制定训练集和测试集df_keep = df[to_keep]X_train, X_valid = split_vals(df_keep, n_trn)# 模型训练m2 = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features='sqrt',n_jobs=-1)# n_jobs=-1 表示训练的时候，并行数和cpu的核数一样，如果传入具体的值，表示用几个核去跑m2.fit(X_train, y_train)RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,max_features='sqrt', max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=3, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=40, n_jobs=-1,oob_score=False, random_state=None, verbose=0, warm_start=False)# 模型评分y_pre = m2.predict(X_valid)m2.score(X_valid, y_valid)0.9247615702679183# mae评估mean_absolute_error(y_true=y_valid, y_pred=y_pre)0.05956897962889757print(m2.score)<bound method RegressorMixin.score of RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,max_features='sqrt', max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=3, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=40, n_jobs=-1,oob_score=False, random_state=None, verbose=0, warm_start=False)>

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。