100字范文 > Kaggle项目之泰坦尼克号titanic实践与相关知识点总结

Kaggle项目之泰坦尼克号titanic实践与相关知识点总结

时间：2019-02-16 05:03:38

这里写自定义目录标题

泰坦尼克号Titanic读入数据1、读取数据2、读入csv\excel\txt 数据可视化分析图数据分析1、数据处理—特征工程(feature engineering)2、线性回归3、逻辑回归4、随机森林

泰坦尼克号Titanic

Kaggle项目之泰坦尼克号titanic实践与相关知识点总结

读入数据

1、读取数据

pandas是常用的python数据处理包 ,它能够把csv文件读入成dataframe格式。

pandas库详细介绍链接/docs/

import pandastitanic = pandas.read_csv("train.csv")#head()函数参数表示打印出几行数据，默认为五head3=titanic.head(3)print(head3)#描述性数据，均值最值等print(titanic.describe())#数据属性和个数print(titanic.info())

读入数据总共有12列，其中Survived字段表示的是该乘客是否获救，其余都是乘客的个人信息，包括：

PassengerId => 乘客ID

Pclass => 乘客等级(1/2/3等舱位)

Name => 乘客姓名

Sex => 性别

Age => 年龄

SibSp => 堂兄弟/妹个数

Parch => 父母与小孩个数

Ticket => 船票信息

Fare => 票价

Cabin => 客舱

Embarked => 登船港口

2、读入csv\excel\txt

excel和csv

/p/0fd5551bac37

pandas读入/happymeng/p/10481293.html

其他方式读入/caiyishuai/p/9462833.html

数据可视化分析

通过可视化图形初步了解数据情况及其与是否存活的关系

图

单个特征与存活率关系

1、乘客等级Pclass与survived关系，某一等级对应存活率之比

2、存活人数中男女比（饼状图）

3、总体年龄频率直方图、是否存活分别的年龄分布（横坐标为survived）

4、兄弟姐妹/父母孩子个数SibSp/Parch，同上。或者横坐标为个数

5、票价

6、登船港口

数据内部关系

各等级车厢年龄分布（三条曲线分布表示不同等级，x为年龄）

登船港口和票价/乘客等级

家庭人口与存活率

舱位等级和性别共同影响生存率

matplotlib教程/docs/sfile/matplotlib-intro/index.html

数据分析

1、数据处理—特征工程(feature engineering)

缺失值填充

mage = titanic["Age"].median()titanic["Age"] = titanic["Age"].fillna(mage)#将空值用平均值替换print(titanic.describe())

替换string为int类型

print(titanic["Sex"].unique()) #对于一维数组或者列表，unique函数去除其中重复的元素，#并按元素由大到小返回一个新的无元素重复的元组或者列表#print(titanic["Sex"]) #返回series类型print(type(titanic["Sex"]))#.unique()加括号只打印不重复的值，不加括号打印所有值的对应值#现在的语法是values()?#print(titanic["Sex"].values) #.values()加括号错误#series对象区别于字典，titanic.loc[titanic["Sex"] == "male","Sex"] = 0titanic.loc[titanic["Sex"] == "female","Sex"] = 1

缺失值填充及替换为int类型

print(titanic["Embarked"].unique())titanic["Embarked"] = titanic["Embarked"].fillna("S")#没有均值的时候，选择一个出现次数较多的值进行填充titanic.loc[titanic["Embarked"] == "S","Embarked"] = 0titanic.loc[titanic["Embarked"] == "C","Embarked"] = 1titanic.loc[titanic["Embarked"] == "Q","Embarked"] = 2

2、线性回归

#二分类线性回归#Scikit-learn python机器学习库from sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import KFoldpredictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]#初始化赋值函数alg = LinearRegression()#样本均分为3份，3折交叉验证kf = KFold(n_splits = 3,shuffle = False,random_state = 1)predictions = []#for train,test in kf.split(titanic):#获取训练集的值train_predictors = (titanic[predictors].iloc[train,:])#获取label值#对于单独一列值，iloc()只能有一个参数train_target = titanic["Survived"].iloc[train]#训练模型alg.fit(train_predictors,train_target)#使用测试集检验test_predictions = alg.predict(titanic[predictors].iloc[test,:])#测试结果predictions.append(test_predictions)

计算准确率

import numpy as np#将二维数组转换成一维predictions = np.concatenate(predictions,axis=0)#映射成分类结果，计算准确率predictions[predictions > .5] = 1predictions[predictions <= .5] = 0#accuracy = sum(predictions == titanic["Survived"])/len(predictions)#predictions == titanic["Survived"] boolean类型，相同为true值为1print(accuracy)#二分类，本身准确率就应该有50%

输出为0.7833894500561167

3、逻辑回归

#逻辑回归from sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LogisticRegressionalg = LogisticRegression(random_state = 1)scores = cross_val_score(alg,titanic[predictors],titanic["Survived"],cv = 3)print(scores.mean())

输出为0.7957351290684623

#上述结果使用的是交叉验证的验证集进行的分类，实际结果中应该使用测试集titanic_test = pandas.csv("test.csv")#其他处理数据过程同上

4、随机森林

#随机森林#有放回的的取值，随机取特征值（可以指定个数）#构造了多个决策树？哪个影响因素对最终结果影响更大，防止过拟合，剔除负面因素from sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import KFoldfrom sklearn.ensemble import RandomForestClassifierpredictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]alg = RandomForestClassifier(random_state=1,n_estimators=10,#决策树数量min_samples_split=2,min_samples_leaf=1)kf = KFold(n_splits=3,shuffle=False,random_state=1)scores = cross_val_score(alg,titanic[predictors],titanic["Survived"],cv = kf)print(scores.mean())

输出为0.7856341189674523，结果不是很理想，所以要调参

alg = RandomForestClassifier(random_state=1,n_estimators=100,min_samples_split=4,min_samples_leaf=2)# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)kf = KFold(n_splits=3, shuffle=False, random_state=1)scores = cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)# Take the mean of the scores (because we have one for each fold)print(scores.mean())

输出为0.8148148148148148

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。