100字范文,内容丰富有趣,生活中的好帮手!
100字范文 > DataScience:基于GiveMeSomeCredit数据集利用特征工程处理 逻辑回归LoR算法实现构建

DataScience:基于GiveMeSomeCredit数据集利用特征工程处理 逻辑回归LoR算法实现构建

时间:2020-08-16 13:31:57

相关推荐

DataScience:基于GiveMeSomeCredit数据集利用特征工程处理 逻辑回归LoR算法实现构建

DataScience:基于GiveMeSomeCredit数据集利用特征工程处理、逻辑回归LoR算法实现构建风控中的金融评分卡模型

目录

基于GiveMeSomeCredit数据集利用特征工程处理、逻辑回归LoR算法实现构建风控中的金融评分卡模型

1、加载数据集

查看数据集的摘要信息

2、特征工程:数据分析与处理

# 2.1、缺失值分析及处理

# 2.2、单个字段逐个分析

分析label字段:统计SeriousDlqin2yrs类别及其个数统计

分析age字段

分析3个类似字段—NumberOfTimes90DaysLate、NumberOfTime60

分析单个字段—DebtRatio及与MonthlyIncome、SeriousDlqin2yrs关系

分析单个字段—MonthlyIncome

分析单个字段—NumberOfOpenCreditLinesAndLoans

分析单个字段—NumberRealEstateLoansOrLines

分析单个字段—NumberOfDependents

# 2.3、数据分箱

# 2.4、特征筛选:利用IV方法

# 2.5、计算WOE值

# 2.5.1、基于筛选的特征,利用WOE函数把分箱转成WOE值

# 2.5.2、解析不同bin对应woe值的一一对应情况

# 2.6、切分数据集:留25%作为模型的验证集

# 3、逻辑回归建模

# 3.1、建立模型

# 3.2、模型评估:计算AUC值、绘制ROC曲线、输出混淆矩阵

# 4、模型推理

# 4.1、设计评分卡规则表

# 4.1.1、求出两个刻度A、B:根据2个假设推导出评分卡的刻度参数A和B计算公式

# 4.1.2、设计评分卡规则表 :根据刻度B、对应分箱的WOE编码、模型系数,得到score_card_rule

# 4.2、结合刻度A计算样本评分卡得分

# 4.2.1、随机选取12个样本(6个好的和6个坏的)并计算每个样本的总评分并对比Label,可验证模型效果

# 4.2.2、结合刻度A计算样本评分卡得分

# 4.3、对比测试样本得分及其对应标签,进而设计评审策略

相关文章

DataScience:基于GiveMeSomeCredit数据集利用特征工程处理、逻辑回归LoR算法实现构建风控中的金融评分卡模型

DataScience:基于GiveMeSomeCredit数据集利用特征工程处理、逻辑回归LoR算法实现构建风控中的金融评分卡模型实现

基于GiveMeSomeCredit数据集利用特征工程处理、逻辑回归LoR算法实现构建风控中的金融评分卡模型

1、加载数据集

查看数据集的摘要信息

Unnamed: 0 ... NumberOfDependents0 1 ... 2.01 2 ... 1.02 3 ... 0.03 4 ... 0.04 5 ... 0.0[5 rows x 12 columns]<class 'pandas.core.frame.DataFrame'>RangeIndex: 150000 entries, 0 to 149999Data columns (total 12 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 150000 non-null int64 1 SeriousDlqin2yrs 150000 non-null int64 2 RevolvingUtilizationOfUnsecuredLines 150000 non-null float643 age150000 non-null int64 4 NumberOfTime30-59DaysPastDueNotWorse 150000 non-null int64 5 DebtRatio 150000 non-null float646 MonthlyIncome120269 non-null float647 NumberOfOpenCreditLinesAndLoans 150000 non-null int64 8 NumberOfTimes90DaysLate150000 non-null int64 9 NumberRealEstateLoansOrLines150000 non-null int64 10 NumberOfTime60-89DaysPastDueNotWorse 150000 non-null int64 11 NumberOfDependents146076 non-null float64dtypes: float64(4), int64(8)memory usage: 13.7 MBNoneUnnamed: 0 ... NumberOfDependentscount 150000.000000 ... 146076.000000mean 75000.500000 ... 0.757222std43301.414527 ... 1.115086min 1.000000 ... 0.00000025%37500.750000 ... 0.00000050%75000.500000 ... 0.00000075% 112500.250000 ... 1.000000max 150000.000000 ... 20.000000

2、特征工程:数据分析与处理

# 2.1、缺失值分析及处理

[8 rows x 12 columns]Column Number_of_Null_Values Proportion0 Unnamed: 0 0 0.0000001 SeriousDlqin2yrs 0 0.0000002 RevolvingUtilizationOfUnsecuredLines 0 0.0000003age 0 0.0000004 NumberOfTime30-59DaysPastDueNotWorse 0 0.0000005DebtRatio 0 0.0000006MonthlyIncome 29731 0.1982077 NumberOfOpenCreditLinesAndLoans 0 0.0000008NumberOfTimes90DaysLate 0 0.0000009 NumberRealEstateLoansOrLines 0 0.00000010 NumberOfTime60-89DaysPastDueNotWorse 0 0.00000011NumberOfDependents 3924 0.026160Unnamed: 00SeriousDlqin2yrs 0RevolvingUtilizationOfUnsecuredLines 0age 0NumberOfTime30-59DaysPastDueNotWorse 0DebtRatio 0MonthlyIncome 0NumberOfOpenCreditLinesAndLoans 0NumberOfTimes90DaysLate 0NumberRealEstateLoansOrLines 0NumberOfTime60-89DaysPastDueNotWorse 0NumberOfDependents 0

# 2.2、单个字段逐个分析

分析label字段:统计SeriousDlqin2yrs类别及其个数统计

Default Rate: 0.06684count 150000.000000mean6.048438std 249.755371min 0.00000025% 0.02986750% 0.15418175% 0.559046max 50708.000000Name: RevolvingUtilizationOfUnsecuredLines, dtype: float64

[[0, 0.06684], [1, 0.37177950868783705], [2, 0.14555256064690028], [3, 0.09931506849315068], [4, 0.08679245283018867], [5, 0.07874015748031496], [6, 0.07692307692307693], [7, 0.0778688524590164], [8, 0.07407407407407407], [9, 0.07053941908713693], [10, 0.07053941908713693], [11, 0.07053941908713693], [12, 0.06666666666666667], [13, 0.058823529411764705], [14, 0.058823529411764705], [15, 0.05531914893617021], [16, 0.05531914893617021], [17, 0.05531914893617021], [18, 0.05531914893617021], [19, 0.05555555555555555]]Proportion of Defaulters with Total Amount of Money Owed Not Exceeding Total Credit Limit: 0.05991996127598361Proportion of Defaulters with Total Amount of Money Owed Not Exceeding or Equal to 13 times of Total Credit Limit:0.06685273968029273

分析age字段

count 150000.000000mean 52.295207std14.771866min 0.00000025%41.00000050%52.00000075%63.000000max 109.000000Name: age, dtype: float64

分析3个类似字段—NumberOfTimes90DaysLate、NumberOfTime60

89DaysPastDueNotWorse、NumberOfTime30-59DaysPastDueNotWorse

01416621 52432 15553 6674 2915 1316 807 388 219 1910 811 512 213 414 215 217 196 598 264Name: NumberOfTimes90DaysLate, dtype: int64 01423961 57312 11183 3184 1055 346 1679829111 196 598 264Name: NumberOfTime60-89DaysPastDueNotWorse, dtype: int64 01260181160332 45983 17544 7475 3426 1407 548 259 1210 411 112 213 196 598 264Name: NumberOfTime30-59DaysPastDueNotWorse, dtype: int64NumberOfTimes90DaysLate ... NumberOfTime30-59DaysPastDueNotWorsecount269.000000 ... 269.000000mean 97.962825 ... 97.962825std 0.270628 ...0.270628min 96.000000 ... 96.00000025% 98.000000 ... 98.00000050% 98.000000 ... 98.00000075% 98.000000 ... 98.000000max 98.000000 ... 98.000000[8 rows x 3 columns]{'98,98,98': 263, '96,96,96': 4}

分析单个字段—DebtRatio及与MonthlyIncome、SeriousDlqin2yrs关系

temp = df_train[(df_DR > df_DR95) & (df_train['SeriousDlqin2yrs'] == df_train['MonthlyIncome'])]temp.to_csv('0314temp.csv')

count 150000.000000mean 353.005076std 2037.818523min 0.00000025% 0.17507450% 0.36650875% 0.868254max329664.000000Name: DebtRatio, dtype: float64 2449.0 DebtRatio MonthlyIncome SeriousDlqin2yrscount 7494.000000 7494.000000 7494.000000mean4417.958367 5126.9057910.055111std7875.314649 1183.3393770.228212min2450.000000 0.0000000.00000025%2893.250000 5400.0000000.00000050%3491.000000 5400.0000000.00000075%4620.000000 5400.0000000.000000max 329664.000000 5400.0000001.0000003315400.0 71150.0 3471.0 32Name: MonthlyIncome, dtype: int64Number of people who owe around 2449 or more times what they own and have same values for MonthlyIncome and SeriousDlqin2yrs: 3313489.024999999994 DebtRatio MonthlyIncome SeriousDlqin2yrscount 3750.0000003750.00000 3750.000000mean5917.4880005133.603200.064267std10925.5240111169.582390.245260min3490.000000 0.000000.00000025%3957.2500005400.000000.00000050%4619.0000005400.000000.00000075%5789.5000005400.000000.000000max 329664.0000005400.000001.0000001645400.0 35650.0 1731.0 12Name: MonthlyIncome, dtype: int64Number of people who owe around 3490 or more times what they own and have same values for MonthlyIncome and SeriousDlqin2yrs: 164

分析单个字段—MonthlyIncome

分析单个字段—NumberOfOpenCreditLinesAndLoans

分析单个字段—NumberRealEstateLoansOrLines

分析单个字段—NumberOfDependents

# 2.3、数据分箱

仅label没分箱处理

# 2.4、特征筛选:利用IV方法

bin_DebtRatio cal_IV: 0.0595bin_MonthlyIncome cal_IV: 0.0562bin_RevolvingUtilizationOfUnsecuredLines cal_IV: 1.0596bin_NumberOfOpenCreditLinesAndLoans cal_IV: 0.048bin_NumberRealEstateLoansOrLines cal_IV: 0.0121bin_age cal_IV: 0.2404bin_NumberOfDependents cal_IV: 0.0145bin_NumberOfTime30-59DaysPastDueNotWorse cal_IV: 0.4924bin_NumberOfTime60-89DaysPastDueNotWorse cal_IV: 0.2666bin_NumberOfTimes90DaysLate cal_IV: 0.4916

# 2.5、计算WOE值

# 2.5.1、基于筛选的特征,利用WOE函数把分箱转成WOE值

woe_cols: ['woe_bin_age', 'woe_bin_RevolvingUtilizationOfUnsecuredLines', 'woe_bin_NumberOfTime30-59DaysPastDueNotWorse', 'woe_bin_NumberOfTime60-89DaysPastDueNotWorse', 'woe_bin_NumberOfTimes90DaysLate']

------------- age<class 'pandas.core.frame.DataFrame'> df……final features bin woe0 age (40.0, 50.0] 0.2283431 age (25.0, 40.0] 0.4695475 age (70.0, inf] -1.1321456 age (50.0, 60.0] -0.08478215age (60.0, 70.0] -0.68900319age (-inf, 25.0] 0.562024------------- RevolvingUtilizationOfUnsecuredLines<class 'pandas.core.frame.DataFrame'> df……final featuresbin woe0 RevolvingUtilizationOfUnsecuredLines (0.699, 50708.0] 1.2422542 RevolvingUtilizationOfUnsecuredLines (0.271, 0.699] 0.0531643 RevolvingUtilizationOfUnsecuredLines (0.0832, 0.271] -0.86650211 RevolvingUtilizationOfUnsecuredLines (-0.001, 0.0192] -1.28661714 RevolvingUtilizationOfUnsecuredLines (0.0192, 0.0832] -1.447382------------- NumberOfTime30-59DaysPastDueNotWorse<class 'pandas.core.frame.DataFrame'> df……final featuresbin woe0NumberOfTime30-59DaysPastDueNotWorse (1.0, 2.0] 1.6167261NumberOfTime30-59DaysPastDueNotWorse (-inf, 1.0] -0.25782613NumberOfTime30-59DaysPastDueNotWorse (2.0, 3.0] 2.027495183 NumberOfTime30-59DaysPastDueNotWorse (3.0, 4.0] 2.336869191 NumberOfTime30-59DaysPastDueNotWorse (4.0, 5.0] 2.436786251 NumberOfTime30-59DaysPastDueNotWorse (6.0, 7.0] 2.710383423 NumberOfTime30-59DaysPastDueNotWorse (9.0, inf] 2.8464311052 NumberOfTime30-59DaysPastDueNotWorse (5.0, 6.0] 2.7506856909 NumberOfTime30-59DaysPastDueNotWorse (7.0, 8.0] 1.88250310822 NumberOfTime30-59DaysPastDueNotWorse (8.0, 9.0] 1.943128------------- NumberOfTime60-89DaysPastDueNotWorse<class 'pandas.core.frame.DataFrame'> df……final featuresbin woe0NumberOfTime60-89DaysPastDueNotWorse (-inf, 1.0] -0.097990186 NumberOfTime60-89DaysPastDueNotWorse (1.0, 2.0] 2.643431423 NumberOfTime60-89DaysPastDueNotWorse (4.0, 5.0] 3.1158481146 NumberOfTime60-89DaysPastDueNotWorse (2.0, 3.0] 2.9019781733 NumberOfTime60-89DaysPastDueNotWorse (9.0, inf] 2.8294662406 NumberOfTime60-89DaysPastDueNotWorse (3.0, 4.0] 3.1217836664 NumberOfTime60-89DaysPastDueNotWorse (5.0, 6.0] 3.73488716642 NumberOfTime60-89DaysPastDueNotWorse (6.0, 7.0] 2.85941923964 NumberOfTime60-89DaysPastDueNotWorse (7.0, 8.0] 2.63627568976 NumberOfTime60-89DaysPastDueNotWorse (8.0, 9.0] 2.636275------------- NumberOfTimes90DaysLate<class 'pandas.core.frame.DataFrame'> df……final featuresbin woe0NumberOfTimes90DaysLate (-inf, 1.0] -0.17667413 NumberOfTimes90DaysLate (2.0, 3.0] 2.947611186 NumberOfTimes90DaysLate (1.0, 2.0] 2.6324161298 NumberOfTimes90DaysLate (4.0, 5.0] 3.1839151713 NumberOfTimes90DaysLate (3.0, 4.0] 3.3449261733 NumberOfTimes90DaysLate (9.0, inf] 2.8211002910 NumberOfTimes90DaysLate (8.0, 9.0] 3.6658943400 NumberOfTimes90DaysLate (5.0, 6.0] 3.0417403929 NumberOfTimes90DaysLate (6.0, 7.0] 4.1243525684 NumberOfTimes90DaysLate (7.0, 8.0] 3.552566

# 2.5.2、解析不同bin对应woe值的一一对应情况

# 2.6、切分数据集:留25%作为模型的验证集

bad_rate: 0.06688333333333334X_train.shape: (120000, 5)

# 3、逻辑回归建模

# 3.1、建立模型

LoR_Score: 0.9368266666666667LoRC_pred_proba [0.0121424 0.15221691 0.02248172 ... 0.0528182 0.0121424 0.0952767 ]LoRC_coef_lists_ [0.46051155 0.76869053 0.59104431 0.36452944 0.56621256]

# 3.2、模型评估:计算AUC值、绘制ROC曲线、输出混淆矩阵

Auc_Score: 0.8226466762033763[[34827 200][ 2169 304]]

# 4、模型推理

# 4.1、设计评分卡规则表

# 4.1.1、求出两个刻度A、B:根据2个假设推导出评分卡的刻度参数A和B计算公式

650 72.13

# 4.1.2、设计评分卡规则表 :根据刻度B、对应分箱的WOE编码、模型系数,得到score_card_rule

# 4.2、结合刻度A计算样本评分卡得分

# 4.2.1、随机选取12个样本(6个好的和6个坏的)并计算每个样本的总评分并对比Label,可验证模型效果

# 4.2.2、结合刻度A计算样本评分卡得分

# 4.3、对比测试样本得分及其对应标签,进而设计评审策略

44377 754.0 --------- 直接接受!age 47.0RevolvingUtilizationOfUnsecuredLines1.0NumberOfTime30-59DaysPastDueNotWorse0.0NumberOfTime60-89DaysPastDueNotWorse0.0NumberOfTimes90DaysLate 1.0score594.0Name: 25143, dtype: float6425143 594.0 --------- 人工审核!age 54.000000RevolvingUtilizationOfUnsecuredLines0.015171NumberOfTime30-59DaysPastDueNotWorse0.000000NumberOfTime60-89DaysPastDueNotWorse0.000000NumberOfTimes90DaysLate 0.000000score745.000000Name: 67429, dtype: float6467429 745.0 --------- 直接接受!age 26.000000RevolvingUtilizationOfUnsecuredLines0.252252NumberOfTime30-59DaysPastDueNotWorse0.000000NumberOfTime60-89DaysPastDueNotWorse0.000000NumberOfTimes90DaysLate 0.000000score703.000000Name: 66689, dtype: float6466689 703.0 --------- 直接接受!age 40.000000RevolvingUtilizationOfUnsecuredLines0.916335NumberOfTime30-59DaysPastDueNotWorse1.000000NumberOfTime60-89DaysPastDueNotWorse0.000000NumberOfTimes90DaysLate 0.000000score586.000000Name: 42656, dtype: float6442656 586.0 --------- 人工审核!age 65.000000RevolvingUtilizationOfUnsecuredLines0.091478NumberOfTime30-59DaysPastDueNotWorse0.000000NumberOfTime60-89DaysPastDueNotWorse0.000000NumberOfTimes90DaysLate 0.000000score742.000000Name: 81903, dtype: float6481903 742.0 --------- 直接接受!

DataScience:基于GiveMeSomeCredit数据集利用特征工程处理 逻辑回归LoR算法实现构建风控中的金融评分卡模型

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。