100字范文,内容丰富有趣,生活中的好帮手!
100字范文 > 数据科学和人工智能技术笔记 十九 数据整理(上)

数据科学和人工智能技术笔记 十九 数据整理(上)

时间:2018-12-24 05:51:42

相关推荐

数据科学和人工智能技术笔记 十九 数据整理(上)

十九、数据整理(上)

作者:Chris Albon

译者:飞龙

协议:CC BY-NC-SA 4.0

在 Pandas 中通过分组应用函数

import pandas as pd# 创建示例数据帧data = {'Platoon': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],'Casualties': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}df = pd.DataFrame(data)df

# 按照 df.platoon 对 df 分组# 然后将滚动平均 lambda 函数应用于 df.casualtiesdf.groupby('Platoon')['Casualties'].apply(lambda x:x.rolling(center=False,window=2).mean())'''0NaN12.524.536.046.055.06NaN73.582.594.510 5.511 NaN12 5.513 5.014 5.015 5.0dtype: float64'''

在 Pandas 中向分组应用操作

# 导入模块import pandas as pd# 创建数据帧raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])df

# 创建一个 groupby 变量,按团队(regiment)对 preTestScores 分组groupby_regiment = df['preTestScore'].groupby(df['regiment'])groupby_regiment# <pandas.core.groupby.SeriesGroupBy object at 0x113ddb550>

“这个分组变量现在是GroupBy对象。 除了分组的键df ['key1']的一些中间数据之外,它实际上还没有计算任何东西。 我们的想法是,该对象具有将所有操作应用于每个分组所需的所有信息。” – PyDA

使用list()显示分组的样子。

list(df['preTestScore'].groupby(df['regiment']))'''[('Dragoons', 43546 247 31Name: preTestScore, dtype: int64), ('Nighthawks', 041 242 3132Name: preTestScore, dtype: int64), ('Scouts', 829310 211 3Name: preTestScore, dtype: int64)] '''df['preTestScore'].groupby(df['regiment']).describe()

# 每个团队的 preTestScore 均值groupby_regiment.mean()'''regimentDragoons15.50Nighthawks 15.25Scouts 2.50Name: preTestScore, dtype: float64 '''df['preTestScore'].groupby([df['regiment'], df['company']]).mean()'''regiment companyDragoons 1st 3.52nd 27.5Nighthawks 1st 14.02nd 16.5Scouts1st 2.52nd 2.5Name: preTestScore, dtype: float64 '''df['preTestScore'].groupby([df['regiment'], df['company']]).mean().unstack()

# 按团队和公司(company)对整个数据帧分组df.groupby(['regiment', 'company']).mean()

# 每个团队和公司的观测数量df.groupby(['regiment', 'company']).size()'''regiment companyDragoons 1st 22nd 2Nighthawks 1st 22nd 2Scouts1st 22nd 2dtype: int64 '''# 按团队对数据帧分组,对于每个团队,for name, group in df.groupby('regiment'): # 打印团队名称print(name)# 打印它的数据print(group)'''Dragoonsregiment company name preTestScore postTestScore4 Dragoons1st Cooze 3 705 Dragoons1st Jacon 4 256 Dragoons2nd Ryaner 24 947 Dragoons2nd Sone 31 57Nighthawksregiment companyname preTestScore postTestScore0 Nighthawks1st Miller 4 251 Nighthawks1st Jacobson 24 942 Nighthawks2nd Ali 31 573 Nighthawks2nd Milner 2 62Scoutsregiment company name preTestScore postTestScore8 Scouts1st Sloan 2 629 Scouts1st Piger 3 7010 Scouts2nd Riani 2 6211 Scouts2nd Ali 3 70 '''

按列分组:

特别是在这种情况下:按列对数据类型(即axis = 1)分组,然后使用list()查看该分组的外观。

list(df.groupby(df.dtypes, axis=1))'''[(dtype('int64'),preTestScore postTestScore0 4 251 24 942 31 573 2 624 3 705 4 256 24 947 31 578 2 629 3 7010 2 6211 3 70),(dtype('O'), regiment companyname0 Nighthawks1st Miller1 Nighthawks1st Jacobson2 Nighthawks2nd Ali3 Nighthawks2nd Milner4Dragoons1stCooze5Dragoons1stJacon6Dragoons2nd Ryaner7Dragoons2ndSone8 Scouts1stSloan9 Scouts1stPiger10Scouts2ndRiani11Scouts2nd Ali)] df.groupby('regiment').mean().add_prefix('mean_')

# 创建获取分组状态的函数def get_stats(group):return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}bins = [0, 25, 50, 75, 100]group_names = ['Low', 'Okay', 'Good', 'Great']df['categories'] = pd.cut(df['postTestScore'], bins, labels=group_names)df['postTestScore'].groupby(df['categories']).apply(get_stats).unstack()

在 Pandas 数据帧上应用操作

# 导入模型import pandas as pdimport numpy as npdata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [, , , , ], 'reports': [4, 24, 31, 2, 3],'coverage': [25, 94, 57, 62, 70]}df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])df

# 创建大写转换的 lambda 函数capitalizer = lambda x: x.upper()

capitalizer函数应用于name列。

apply()可以沿数据帧的任意轴应用函数。

df['name'].apply(capitalizer)'''Cochice JASONPimaMOLLYSanta CruzTINAMaricopa JAKEYuma AMYName: name, dtype: object '''

capitalizerlambda 函数映射到序列name中的每个元素。

map()对序列的每个元素应用操作。

df['name'].map(capitalizer)'''Cochice JASONPimaMOLLYSanta CruzTINAMaricopa JAKEYuma AMYName: name, dtype: object '''

将平方根函数应用于整个数据帧中的每个单元格。

applymap()将函数应用于整个数据帧中的每个元素。

# 删除字符串变量,以便 applymap() 可以运行df = df.drop('name', axis=1)# 返回数据帧每个单元格的平方根df.applymap(np.sqrt)

在数据帧上应用函数。

# 创建叫做 times100 的函数def times100(x):# 如果 x 是字符串,if type(x) is str:# 原样返回它return x# 如果不是,返回它乘上 100elif x:return 100 * x# 并留下其它东西else:returndf.applymap(times100)

向 Pandas 数据帧赋予新列

import pandas as pd# 创建空数据帧df = pd.DataFrame()# 创建一列df['name'] = ['John', 'Steve', 'Sarah']# 查看数据帧df

# 将一个新列赋予名为 age 的 df,它包含年龄列表df.assign(age = [31, 32, 19])

将列表拆分为大小为 N 的分块

在这个片段中,我们接受一个列表并将其分解为大小为 n 的块。 在处理具有最大请求大小的 API 时,这是一种非常常见的做法。

这个漂亮的函数由 Ned Batchelder 贡献,发布于 StackOverflow。

# 创建名称列表first_names = ['Steve', 'Jane', 'Sara', 'Mary','Jack','Bob', 'Bily', 'Boni', 'Chris','Sori', 'Will', 'Won','Li']# 创建叫做 chunks 的函数,有两个参数 l 和 ndef chunks(l, n):# 对于长度为 l 的范围中的项目 ifor i in range(0, len(l), n):# 创建索引范围yield l[i:i+n]# 从函数 chunks 的结果创建一个列表list(chunks(first_names, 5))'''[['Steve', 'Jane', 'Sara', 'Mary', 'Jack'],['Bob', 'Bily', 'Boni', 'Chris', 'Sori'],['Will', 'Won', 'Li']] '''

在 Pandas 中使用正则表达式将字符串分解为列

# 导入模块import reimport pandas as pd# 创建带有一列字符串的数据帧data = {'raw': ['Arizona 1 -12-23 3242.0','Iowa 1 -02-23 3453.7','Oregon 0 -06-20 2123.0','Maryland 0 -03-14 1123.6','Florida 1 -01-15 2134.0','Georgia 0 -07-14 2345.6']}df = pd.DataFrame(data, columns = ['raw'])df

# df['raw'] 的哪些行包含 'xxxx-xx-xx'?df['raw'].str.contains('....-..-..', regex=True)'''0 True1 True2 True3 True4 True5 TrueName: raw, dtype: bool '''# 在 raw 列中,提取字符串中的单个数字df['female'] = df['raw'].str.extract('(\d)', expand=True)df['female']'''0 11 12 03 04 15 0Name: female, dtype: object '''# 在 raw 列中,提取字符串中的 xxxx-xx-xxdf['date'] = df['raw'].str.extract('(....-..-..)', expand=True)df['date']'''0 -12-231 -02-232 -06-203 -03-144 -01-155 -07-14Name: date, dtype: object '''# 在 raw 列中,提取字符串中的 ####.##df['score'] = df['raw'].str.extract('(\d\d\d\d\.\d)', expand=True)df['score']'''0 3242.01 3453.72 2123.03 1123.64 2134.05 2345.6Name: score, dtype: object '''# 在 raw 列中,提取字符串中的单词df['state'] = df['raw'].str.extract('([A-Z]\w{0,})', expand=True)df['state']'''0Arizona1 Iowa2Oregon3 Maryland4Florida5GeorgiaName: state, dtype: object '''df

由两个数据帧贡献列

# 导入库import pandas as pd# 创建数据帧dataframe_one = pd.DataFrame()dataframe_one['1'] = ['1', '1', '1']dataframe_one['B'] = ['b', 'b', 'b']# 创建第二个数据帧dataframe_two = pd.DataFrame()dataframe_two['2'] = ['2', '2', '2']dataframe_two['B'] = ['b', 'b', 'b']# 将每个数据帧的列转换为集合,# 然后找到这两个集合的交集。# 这将是两个数据帧共享的列的集合。set.intersection(set(dataframe_one), set(dataframe_two))# {'B'}

从多个列表构建字典

# 创建官员名称的列表officer_names = ['Sodoni Dogla', 'Chris Jefferson', 'Jessica Billars', 'Michael Mulligan', 'Steven Johnson']# 创建官员军队的列表officer_armies = ['Purple Army', 'Orange Army', 'Green Army', 'Red Army', 'Blue Army']# 创建字典,它是两个列表的 zipdict(zip(officer_names, officer_armies))'''{'Chris Jefferson': 'Orange Army','Jessica Billars': 'Green Army','Michael Mulligan': 'Red Army','Sodoni Dogla': 'Purple Army','Steven Johnson': 'Blue Army'} '''

将 CSV 转换为 Python 代码来重建它

# 导入 pandas 包import pandas as pd# 将 csv 文件加载为数据帧df_original = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv')df = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv')# 打印创建数据帧的代码print('==============================')print('RUN THE CODE BELOW THIS LINE')print('==============================')print('raw_data =', df.to_dict(orient='list'))print('df = pd.DataFrame(raw_data, columns = ' + str(list(df_original)) + ')')'''==============================RUN THE CODE BELOW THIS LINE==============================raw_data = {'Sepal.Length': [5.0999999999999996, 4.9000000000000004, 4.7000000000000002, 4.5999999999999996, 5.0, 5.4000000000000004, 4.5999999999999996, 5.0, 4.4000000000000004, 4.9000000000000004, 5.4000000000000004, 4.7999999999999998, 4.7999999999999998, 4.2999999999999998, 5.7999999999999998, 5.7000000000000002, 5.4000000000000004, 5.0999999999999996, 5.7000000000000002, 5.0999999999999996, 5.4000000000000004, 5.0999999999999996, 4.5999999999999996, 5.0999999999999996, 4.7999999999999998, 5.0, 5.0, 5.2000000000000002, 5.2000000000000002, 4.7000000000000002, 4.7999999999999998, 5.4000000000000004, 5.2000000000000002, 5.5, 4.9000000000000004, 5.0, 5.5, 4.9000000000000004, 4.4000000000000004, 5.0999999999999996, 5.0, 4.5, 4.4000000000000004, 5.0, 5.0999999999999996, 4.7999999999999998, 5.0999999999999996, 4.5999999999999996, 5.2999999999999998, 5.0, 7.0, 6.4000000000000004, 6.9000000000000004, 5.5, 6.5, 5.7000000000000002, 6.2999999999999998, 4.9000000000000004, 6.5999999999999996, 5.2000000000000002, 5.0, 5.9000000000000004, 6.0, 6.0999999999999996, 5.5999999999999996, 6.7000000000000002, 5.5999999999999996, 5.7999999999999998, 6.2000000000000002, 5.5999999999999996, 5.9000000000000004, 6.0999999999999996, 6.2999999999999998, 6.0999999999999996, 6.4000000000000004, 6.5999999999999996, 6.7999999999999998, 6.7000000000000002, 6.0, 5.7000000000000002, 5.5, 5.5, 5.7999999999999998, 6.0, 5.4000000000000004, 6.0, 6.7000000000000002, 6.2999999999999998, 5.5999999999999996, 5.5, 5.5, 6.0999999999999996, 5.7999999999999998, 5.0, 5.5999999999999996, 5.7000000000000002, 5.7000000000000002, 6.2000000000000002, 5.0999999999999996, 5.7000000000000002, 6.2999999999999998, 5.7999999999999998, 7.0999999999999996, 6.2999999999999998, 6.5, 7.5999999999999996, 4.9000000000000004, 7.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.5, 6.4000000000000004, 6.7999999999999998, 5.7000000000000002, 5.7999999999999998, 6.4000000000000004, 6.5, 7.7000000000000002, 7.7000000000000002, 6.0, 6.9000000000000004, 5.5999999999999996, 7.7000000000000002, 6.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.2000000000000002, 6.0999999999999996, 6.4000000000000004, 7.2000000000000002, 7.4000000000000004, 7.9000000000000004, 6.4000000000000004, 6.2999999999999998, 6.0999999999999996, 7.7000000000000002, 6.2999999999999998, 6.4000000000000004, 6.0, 6.9000000000000004, 6.7000000000000002, 6.9000000000000004, 5.7999999999999998, 6.7999999999999998, 6.7000000000000002, 6.7000000000000002, 6.2999999999999998, 6.5, 6.2000000000000002, 5.9000000000000004], 'Petal.Width': [0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.10000000000000001, 0.20000000000000001, 0.40000000000000002, 0.40000000000000002, 0.29999999999999999, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.5, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.59999999999999998, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 1.3999999999999999, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6000000000000001, 1.0, 1.3, 1.3999999999999999, 1.0, 1.5, 1.0, 1.3999999999999999, 1.3, 1.3999999999999999, 1.5, 1.0, 1.5, 1.1000000000000001, 1.8, 1.3, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3999999999999999, 1.7, 1.5, 1.0, 1.1000000000000001, 1.0, 1.2, 1.6000000000000001, 1.5, 1.6000000000000001, 1.5, 1.3, 1.3, 1.3, 1.2, 1.3999999999999999, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1000000000000001, 1.3, 2.5, 1.8999999999999999, 2.1000000000000001, 1.8, 2.2000000000000002, 2.1000000000000001, 1.7, 1.8, 1.8, 2.5, 2.0, 1.8999999999999999, 2.1000000000000001, 2.0, 2.3999999999999999, 2.2999999999999998, 1.8, 2.2000000000000002, 2.2999999999999998, 1.5, 2.2999999999999998, 2.0, 2.0, 1.8, 2.1000000000000001, 1.8, 1.8, 1.8, 2.1000000000000001, 1.6000000000000001, 1.8999999999999999, 2.0, 2.2000000000000002, 1.5, 1.3999999999999999, 2.2999999999999998, 2.3999999999999999, 1.8, 1.8, 2.1000000000000001, 2.3999999999999999, 2.2999999999999998, 1.8999999999999999, 2.2999999999999998, 2.5, 2.2999999999999998, 1.8999999999999999, 2.0, 2.2999999999999998, 1.8], 'Petal.Length': [1.3999999999999999, 1.3999999999999999, 1.3, 1.5, 1.3999999999999999, 1.7, 1.3999999999999999, 1.5, 1.3999999999999999, 1.5, 1.5, 1.6000000000000001, 1.3999999999999999, 1.1000000000000001, 1.2, 1.5, 1.3, 1.3999999999999999, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.8999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.3999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.5, 1.3999999999999999, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6000000000000001, 1.8999999999999999, 1.3999999999999999, 1.6000000000000001, 1.3999999999999999, 1.5, 1.3999999999999999, 4.7000000000000002, 4.5, 4.9000000000000004, 4.0, 4.5999999999999996, 4.5, 4.7000000000000002, 3.2999999999999998, 4.5999999999999996, 3.8999999999999999, 3.5, 4.2000000000000002, 4.0, 4.7000000000000002, 3.6000000000000001, 4.4000000000000004, 4.5, 4.0999999999999996, 4.5, 3.8999999999999999, 4.7999999999999998, 4.0, 4.9000000000000004, 4.7000000000000002, 4.2999999999999998, 4.4000000000000004, 4.7999999999999998, 5.0, 4.5, 3.5, 3.7999999999999998, 3.7000000000000002, 3.8999999999999999, 5.0999999999999996, 4.5, 4.5, 4.7000000000000002, 4.4000000000000004, 4.0999999999999996, 4.0, 4.4000000000000004, 4.5999999999999996, 4.0, 3.2999999999999998, 4.2000000000000002, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 3.0, 4.0999999999999996, 6.0, 5.0999999999999996, 5.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.5999999999999996, 4.5, 6.2999999999999998, 5.7999999999999998, 6.0999999999999996, 5.0999999999999996, 5.2999999999999998, 5.5, 5.0, 5.0999999999999996, 5.2999999999999998, 5.5, 6.7000000000000002, 6.9000000000000004, 5.0, 5.7000000000000002, 4.9000000000000004, 6.7000000000000002, 4.9000000000000004, 5.7000000000000002, 6.0, 4.7999999999999998, 4.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.0999999999999996, 6.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.5999999999999996, 6.0999999999999996, 5.5999999999999996, 5.5, 4.7999999999999998, 5.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.0999999999999996, 5.9000000000000004, 5.7000000000000002, 5.2000000000000002, 5.0, 5.2000000000000002, 5.4000000000000004, 5.0999999999999996], 'Species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'], 'Sepal.Width': [3.5, 3.0, 3.2000000000000002, 3.1000000000000001, 3.6000000000000001, 3.8999999999999999, 3.3999999999999999, 3.3999999999999999, 2.8999999999999999, 3.1000000000000001, 3.7000000000000002, 3.3999999999999999, 3.0, 3.0, 4.0, 4.4000000000000004, 3.8999999999999999, 3.5, 3.7999999999999998, 3.7999999999999998, 3.3999999999999999, 3.7000000000000002, 3.6000000000000001, 3.2999999999999998, 3.3999999999999999, 3.0, 3.3999999999999999, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3.3999999999999999, 4.0999999999999996, 4.2000000000000002, 3.1000000000000001, 3.2000000000000002, 3.5, 3.6000000000000001, 3.0, 3.3999999999999999, 3.5, 2.2999999999999998, 3.2000000000000002, 3.5, 3.7999999999999998, 3.0, 3.7999999999999998, 3.2000000000000002, 3.7000000000000002, 3.2999999999999998, 3.2000000000000002, 3.2000000000000002, 3.1000000000000001, 2.2999999999999998, 2.7999999999999998, 2.7999999999999998, 3.2999999999999998, 2.3999999999999999, 2.8999999999999999, 2.7000000000000002, 2.0, 3.0, 2.2000000000000002, 2.8999999999999999, 2.8999999999999999, 3.1000000000000001, 3.0, 2.7000000000000002, 2.2000000000000002, 2.5, 3.2000000000000002, 2.7999999999999998, 2.5, 2.7999999999999998, 2.8999999999999999, 3.0, 2.7999999999999998, 3.0, 2.8999999999999999, 2.6000000000000001, 2.3999999999999999, 2.3999999999999999, 2.7000000000000002, 2.7000000000000002, 3.0, 3.3999999999999999, 3.1000000000000001, 2.2999999999999998, 3.0, 2.5, 2.6000000000000001, 3.0, 2.6000000000000001, 2.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 2.8999999999999999, 2.5, 2.7999999999999998, 3.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 3.0, 3.0, 2.5, 2.8999999999999999, 2.5, 3.6000000000000001, 3.2000000000000002, 2.7000000000000002, 3.0, 2.5, 2.7999999999999998, 3.2000000000000002, 3.0, 3.7999999999999998, 2.6000000000000001, 2.2000000000000002, 3.2000000000000002, 2.7999999999999998, 2.7999999999999998, 2.7000000000000002, 3.2999999999999998, 3.2000000000000002, 2.7999999999999998, 3.0, 2.7999999999999998, 3.0, 2.7999999999999998, 3.7999999999999998, 2.7999999999999998, 2.7999999999999998, 2.6000000000000001, 3.0, 3.3999999999999999, 3.1000000000000001, 3.0, 3.1000000000000001, 3.1000000000000001, 3.1000000000000001, 2.7000000000000002, 3.2000000000000002, 3.2999999999999998, 3.0, 2.5, 3.0, 3.3999999999999999, 3.0], 'Unnamed: 0': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150]}'''df = pd.DataFrame(raw_data, columns = ['Unnamed: 0', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']) # 如果你打算检查结果# 1\. 输入此单元格中上面单元格生成的代码raw_data = {'Petal.Width': [0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.10000000000000001, 0.20000000000000001, 0.40000000000000002, 0.40000000000000002, 0.29999999999999999, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.5, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.59999999999999998, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 1.3999999999999999, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6000000000000001, 1.0, 1.3, 1.3999999999999999, 1.0, 1.5, 1.0, 1.3999999999999999, 1.3, 1.3999999999999999, 1.5, 1.0, 1.5, 1.1000000000000001, 1.8, 1.3, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3999999999999999, 1.7, 1.5, 1.0, 1.1000000000000001, 1.0, 1.2, 1.6000000000000001, 1.5, 1.6000000000000001, 1.5, 1.3, 1.3, 1.3, 1.2, 1.3999999999999999, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1000000000000001, 1.3, 2.5, 1.8999999999999999, 2.1000000000000001, 1.8, 2.2000000000000002, 2.1000000000000001, 1.7, 1.8, 1.8, 2.5, 2.0, 1.8999999999999999, 2.1000000000000001, 2.0, 2.3999999999999999, 2.2999999999999998, 1.8, 2.2000000000000002, 2.2999999999999998, 1.5, 2.2999999999999998, 2.0, 2.0, 1.8, 2.1000000000000001, 1.8, 1.8, 1.8, 2.1000000000000001, 1.6000000000000001, 1.8999999999999999, 2.0, 2.2000000000000002, 1.5, 1.3999999999999999, 2.2999999999999998, 2.3999999999999999, 1.8, 1.8, 2.1000000000000001, 2.3999999999999999, 2.2999999999999998, 1.8999999999999999, 2.2999999999999998, 2.5, 2.2999999999999998, 1.8999999999999999, 2.0, 2.2999999999999998, 1.8], 'Sepal.Width': [3.5, 3.0, 3.2000000000000002, 3.1000000000000001, 3.6000000000000001, 3.8999999999999999, 3.3999999999999999, 3.3999999999999999, 2.8999999999999999, 3.1000000000000001, 3.7000000000000002, 3.3999999999999999, 3.0, 3.0, 4.0, 4.4000000000000004, 3.8999999999999999, 3.5, 3.7999999999999998, 3.7999999999999998, 3.3999999999999999, 3.7000000000000002, 3.6000000000000001, 3.2999999999999998, 3.3999999999999999, 3.0, 3.3999999999999999, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3.3999999999999999, 4.0999999999999996, 4.2000000000000002, 3.1000000000000001, 3.2000000000000002, 3.5, 3.6000000000000001, 3.0, 3.3999999999999999, 3.5, 2.2999999999999998, 3.2000000000000002, 3.5, 3.7999999999999998, 3.0, 3.7999999999999998, 3.2000000000000002, 3.7000000000000002, 3.2999999999999998, 3.2000000000000002, 3.2000000000000002, 3.1000000000000001, 2.2999999999999998, 2.7999999999999998, 2.7999999999999998, 3.2999999999999998, 2.3999999999999999, 2.8999999999999999, 2.7000000000000002, 2.0, 3.0, 2.2000000000000002, 2.8999999999999999, 2.8999999999999999, 3.1000000000000001, 3.0, 2.7000000000000002, 2.2000000000000002, 2.5, 3.2000000000000002, 2.7999999999999998, 2.5, 2.7999999999999998, 2.8999999999999999, 3.0, 2.7999999999999998, 3.0, 2.8999999999999999, 2.6000000000000001, 2.3999999999999999, 2.3999999999999999, 2.7000000000000002, 2.7000000000000002, 3.0, 3.3999999999999999, 3.1000000000000001, 2.2999999999999998, 3.0, 2.5, 2.6000000000000001, 3.0, 2.6000000000000001, 2.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 2.8999999999999999, 2.5, 2.7999999999999998, 3.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 3.0, 3.0, 2.5, 2.8999999999999999, 2.5, 3.6000000000000001, 3.2000000000000002, 2.7000000000000002, 3.0, 2.5, 2.7999999999999998, 3.2000000000000002, 3.0, 3.7999999999999998, 2.6000000000000001, 2.2000000000000002, 3.2000000000000002, 2.7999999999999998, 2.7999999999999998, 2.7000000000000002, 3.2999999999999998, 3.2000000000000002, 2.7999999999999998, 3.0, 2.7999999999999998, 3.0, 2.7999999999999998, 3.7999999999999998, 2.7999999999999998, 2.7999999999999998, 2.6000000000000001, 3.0, 3.3999999999999999, 3.1000000000000001, 3.0, 3.1000000000000001, 3.1000000000000001, 3.1000000000000001, 2.7000000000000002, 3.2000000000000002, 3.2999999999999998, 3.0, 2.5, 3.0, 3.3999999999999999, 3.0], 'Species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'], 'Unnamed: 0': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150], 'Sepal.Length': [5.0999999999999996, 4.9000000000000004, 4.7000000000000002, 4.5999999999999996, 5.0, 5.4000000000000004, 4.5999999999999996, 5.0, 4.4000000000000004, 4.9000000000000004, 5.4000000000000004, 4.7999999999999998, 4.7999999999999998, 4.2999999999999998, 5.7999999999999998, 5.7000000000000002, 5.4000000000000004, 5.0999999999999996, 5.7000000000000002, 5.0999999999999996, 5.4000000000000004, 5.0999999999999996, 4.5999999999999996, 5.0999999999999996, 4.7999999999999998, 5.0, 5.0, 5.2000000000000002, 5.2000000000000002, 4.7000000000000002, 4.7999999999999998, 5.4000000000000004, 5.2000000000000002, 5.5, 4.9000000000000004, 5.0, 5.5, 4.9000000000000004, 4.4000000000000004, 5.0999999999999996, 5.0, 4.5, 4.4000000000000004, 5.0, 5.0999999999999996, 4.7999999999999998, 5.0999999999999996, 4.5999999999999996, 5.2999999999999998, 5.0, 7.0, 6.4000000000000004, 6.9000000000000004, 5.5, 6.5, 5.7000000000000002, 6.2999999999999998, 4.9000000000000004, 6.5999999999999996, 5.2000000000000002, 5.0, 5.9000000000000004, 6.0, 6.0999999999999996, 5.5999999999999996, 6.7000000000000002, 5.5999999999999996, 5.7999999999999998, 6.2000000000000002, 5.5999999999999996, 5.9000000000000004, 6.0999999999999996, 6.2999999999999998, 6.0999999999999996, 6.4000000000000004, 6.5999999999999996, 6.7999999999999998, 6.7000000000000002, 6.0, 5.7000000000000002, 5.5, 5.5, 5.7999999999999998, 6.0, 5.4000000000000004, 6.0, 6.7000000000000002, 6.2999999999999998, 5.5999999999999996, 5.5, 5.5, 6.0999999999999996, 5.7999999999999998, 5.0, 5.5999999999999996, 5.7000000000000002, 5.7000000000000002, 6.2000000000000002, 5.0999999999999996, 5.7000000000000002, 6.2999999999999998, 5.7999999999999998, 7.0999999999999996, 6.2999999999999998, 6.5, 7.5999999999999996, 4.9000000000000004, 7.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.5, 6.4000000000000004, 6.7999999999999998, 5.7000000000000002, 5.7999999999999998, 6.4000000000000004, 6.5, 7.7000000000000002, 7.7000000000000002, 6.0, 6.9000000000000004, 5.5999999999999996, 7.7000000000000002, 6.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.2000000000000002, 6.0999999999999996, 6.4000000000000004, 7.2000000000000002, 7.4000000000000004, 7.9000000000000004, 6.4000000000000004, 6.2999999999999998, 6.0999999999999996, 7.7000000000000002, 6.2999999999999998, 6.4000000000000004, 6.0, 6.9000000000000004, 6.7000000000000002, 6.9000000000000004, 5.7999999999999998, 6.7999999999999998, 6.7000000000000002, 6.7000000000000002, 6.2999999999999998, 6.5, 6.2000000000000002, 5.9000000000000004], 'Petal.Length': [1.3999999999999999, 1.3999999999999999, 1.3, 1.5, 1.3999999999999999, 1.7, 1.3999999999999999, 1.5, 1.3999999999999999, 1.5, 1.5, 1.6000000000000001, 1.3999999999999999, 1.1000000000000001, 1.2, 1.5, 1.3, 1.3999999999999999, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.8999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.3999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.5, 1.3999999999999999, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6000000000000001, 1.8999999999999999, 1.3999999999999999, 1.6000000000000001, 1.3999999999999999, 1.5, 1.3999999999999999, 4.7000000000000002, 4.5, 4.9000000000000004, 4.0, 4.5999999999999996, 4.5, 4.7000000000000002, 3.2999999999999998, 4.5999999999999996, 3.8999999999999999, 3.5, 4.2000000000000002, 4.0, 4.7000000000000002, 3.6000000000000001, 4.4000000000000004, 4.5, 4.0999999999999996, 4.5, 3.8999999999999999, 4.7999999999999998, 4.0, 4.9000000000000004, 4.7000000000000002, 4.2999999999999998, 4.4000000000000004, 4.7999999999999998, 5.0, 4.5, 3.5, 3.7999999999999998, 3.7000000000000002, 3.8999999999999999, 5.0999999999999996, 4.5, 4.5, 4.7000000000000002, 4.4000000000000004, 4.0999999999999996, 4.0, 4.4000000000000004, 4.5999999999999996, 4.0, 3.2999999999999998, 4.2000000000000002, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 3.0, 4.0999999999999996, 6.0, 5.0999999999999996, 5.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.5999999999999996, 4.5, 6.2999999999999998, 5.7999999999999998, 6.0999999999999996, 5.0999999999999996, 5.2999999999999998, 5.5, 5.0, 5.0999999999999996, 5.2999999999999998, 5.5, 6.7000000000000002, 6.9000000000000004, 5.0, 5.7000000000000002, 4.9000000000000004, 6.7000000000000002, 4.9000000000000004, 5.7000000000000002, 6.0, 4.7999999999999998, 4.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.0999999999999996, 6.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.5999999999999996, 6.0999999999999996, 5.5999999999999996, 5.5, 4.7999999999999998, 5.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.0999999999999996, 5.9000000000000004, 5.7000000000000002, 5.2000000000000002, 5.0, 5.2000000000000002, 5.4000000000000004, 5.0999999999999996]}df = pd.DataFrame(raw_data, columns = ['Unnamed: 0', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'])# 查看原始数据帧的前几行df.head()

# 查看使用我们的代码创建的,数据帧的前几行df_original.head()

将分类变量转换为虚拟变量

# 导入模块import pandas as pd# 创建数据帧raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'sex': ['male', 'female', 'male', 'female', 'female']}df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'sex'])df

# 从 sex 变量创建一组虚拟变量df_sex = pd.get_dummies(df['sex'])# 将虚拟变量连接到主数据帧df_new = pd.concat([df, df_sex], axis=1)df_new

# 连接新列的替代方案df_new = df.join(df_sex)df_new

将分类变量转换为虚拟变量

# 导入模块import pandas as pdimport patsy# 创建数据帧raw_data = {'countrycode': [1, 2, 3, 2, 1]} df = pd.DataFrame(raw_data, columns = ['countrycode'])df

# 将 countrycode 变量转换为三个二元变量patsy.dmatrix('C(countrycode)-1', df, return_type='dataframe')

将字符串分类变量转换为数字变量

# 导入模块import pandas as pdraw_data = {'patient': [1, 1, 1, 2, 2], 'obs': [1, 2, 3, 1, 2], 'treatment': [0, 1, 0, 1, 0],'score': ['strong', 'weak', 'normal', 'weak', 'strong']} df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])df

# 创建一个函数,将 df['score'] 的所有值转换为数字def score_to_numeric(x):if x=='strong':return 3if x=='normal':return 2if x=='weak':return 1df['score_num'] = df['score'].apply(score_to_numeric)df

将变量转换为时间序列

# 导入库import pandas as pd# 创建索引为一组名称的数据集raw_data = {'date': ['-06-01T01:21:38.004053', '-06-02T01:21:38.004053', '-06-03T01:21:38.004053'],'score': [25, 94, 57]}df = pd.DataFrame(raw_data, columns = ['date', 'score'])df

# 转置数据集,使索引(在本例中为名称)为列df["date"] = pd.to_datetime(df["date"])df = df.set_index(df["date"])df

在 Pandas 数据帧中计数

# 导入库import pandas as pdyear = pd.Series([1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894])guardCorps = pd.Series([0,2,2,1,0,0,1,1,0,3,0,2,1,0,0,1,0,1,0,1])corps1 = pd.Series([0,0,0,2,0,3,0,2,0,0,0,1,1,1,0,2,0,3,1,0])corps2 = pd.Series([0,0,0,2,0,2,0,0,1,1,0,0,2,1,1,0,0,2,0,0])corps3 = pd.Series([0,0,0,1,1,1,2,0,2,0,0,0,1,0,1,2,1,0,0,0])corps4 = pd.Series([0,1,0,1,1,1,1,0,0,0,0,1,0,0,0,0,1,1,0,0])corps5 = pd.Series([0,0,0,0,2,1,0,0,1,0,0,1,0,1,1,1,1,1,1,0])corps6 = pd.Series([0,0,1,0,2,0,0,1,2,0,1,1,3,1,1,1,0,3,0,0])corps7 = pd.Series([1,0,1,0,0,0,1,0,1,1,0,0,2,0,0,2,1,0,2,0])corps8 = pd.Series([1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,1,0,1])corps9 = pd.Series([0,0,0,0,0,2,1,1,1,0,2,1,1,0,1,2,0,1,0,0])corps10 = pd.Series([0,0,1,1,0,1,0,2,0,2,0,0,0,0,2,1,3,0,1,1])corps11 = pd.Series([0,0,0,0,2,4,0,1,3,0,1,1,1,1,2,1,3,1,3,1])corps14 = pd.Series([ 1,1,2,1,1,3,0,4,0,1,0,3,2,1,0,2,1,1,0,0])corps15 = pd.Series([0,1,0,0,0,0,0,1,0,1,1,0,0,0,2,2,0,0,0,0])variables = dict(guardCorps = guardCorps, corps1 = corps1, corps2 = corps2, corps3 = corps3, corps4 = corps4, corps5 = corps5, corps6 = corps6, corps7 = corps7, corps8 = corps8, corps9 = corps9, corps10 = corps10, corps11 = corps11 , corps14 = corps14, corps15 = corps15)horsekick = pd.DataFrame(variables, columns = ['guardCorps', 'corps1', 'corps2', 'corps3', 'corps4', 'corps5', 'corps6', 'corps7', 'corps8', 'corps9', 'corps10', 'corps11', 'corps14', 'corps15'])horsekick.index = [1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894]horsekick

# 计算每个团队中每个死亡人数的次数result = horsekick.apply(pd.value_counts).fillna(0); result

| | guardCorps | corps1 | corps2 | corps3 | corps4 | corps5 | corps6 | corps7 | corps8 | corps9 | corps10 | corps11 | corps14 | corps15 |

| 0 | 9.0 | 11.0 | 12.0 | 11.0 | 12.0 | 10.0 | 9.0 | 11.0 | 13.0 | 10.0 | 10.0 | 6 | 6 | 14.0 |

| 1 | 7.0 | 4.0 | 4.0 | 6.0 | 8.0 | 9.0 | 7.0 | 6.0 | 7.0 | 7.0 | 6.0 | 8 | 8 | 4.0 |

| 2 | 3.0 | 3.0 | 4.0 | 3.0 | 0.0 | 1.0 | 2.0 | 3.0 | 0.0 | 3.0 | 3.0 | 2 | 3 | 2.0 |

| 3 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 | 2 | 0.0 |

| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 1 | 0.0 |

# 计算每个月死亡总数出现在 guardCorps 的次数pd.value_counts(horsekick['guardCorps'].values, sort=False)'''0 91 72 33 1dtype: int64 '''horsekick['guardCorps'].unique()# array([0, 2, 1, 3])

在 Pandas 中创建流水线

Pandas 的流水线功能允许你将 Python 函数串联在一起,来构建数据处理流水线。

import pandas as pd# 创建空数据帧df = pd.DataFrame()# Create a columndf['name'] = ['John', 'Steve', 'Sarah']df['gender'] = ['Male', 'Male', 'Female']df['age'] = [31, 32, 19]# 查看数据帧df

# 创建函数,def mean_age_by_group(dataframe, col):# 它按列分组数据,并返回每组的均值return dataframe.groupby(col).mean()# 创建函数,def uppercase_column_name(dataframe):# 它大写所有列标题dataframe.columns = dataframe.columns.str.upper()# 并返回它return dataframe# 创建流水线,它应用 mean_age_by_group 函数(df.pipe(mean_age_by_group, col='gender')# 之后应用 uppercase_column_name 函数.pipe(uppercase_column_name))

使用for循环创建 Pandas 列

import pandas as pdimport numpy as npraw_data = {'student_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'test_score': [76, 88, 84, 67, 53, 96, 64, 91, 77, 73, 52, np.NaN]}df = pd.DataFrame(raw_data, columns = ['student_name', 'test_score'])# 创建列表来储存数据grades = []# 对于列中的每一行for row in df['test_score']:# 如果大于某个值if row > 95:# 添加字母分数grades.append('A')# 或者,如果大于某个值elif row > 90:# 添加字母分数grades.append('A-')# 或者,如果大于某个值elif row > 85:# 添加字母分数grades.append('B')# 或者,如果大于某个值elif row > 80:# 添加字母分数grades.append('B-')# 或者,如果大于某个值elif row > 75:# 添加字母分数grades.append('C')# 或者,如果大于某个值elif row > 70:# 添加字母分数grades.append('C-')# 或者,如果大于某个值elif row > 65:# 添加字母分数grades.append('D')# 或者,如果大于某个值elif row > 60:# 添加字母分数grades.append('D-')# 否则else:# 添加不及格分数grades.append('Failed')# 从列表创建一列df['grades'] = grades# 查看新数据帧df

创建项目计数

from collections import Counter# 创建一个今天吃的水果的计数器fruit_eaten = Counter(['Apple', 'Apple', 'Apple', 'Banana', 'Pear', 'Pineapple'])# 查看计数器fruit_eaten# Counter({'Apple': 3, 'Banana': 1, 'Pear': 1, 'Pineapple': 1}) # 更新菠萝的计数(因为你只吃菠萝)fruit_eaten.update(['Pineapple'])# 查看计数器fruit_eaten# Counter({'Apple': 3, 'Banana': 1, 'Pear': 1, 'Pineapple': 2}) # 查看计数最大的三个项目fruit_eaten.most_common(3)# [('Apple', 3), ('Pineapple', 2), ('Banana', 1)]

基于条件创建一列

# 导入所需模块import pandas as pdimport numpy as npdata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]}df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])df

# 创建一个名为 df.elderly 的新列# 如果 df.age 大于 50 则值为 yes,否则为 nodf['elderly'] = np.where(df['age']>=50, 'yes', 'no')# 查看数据帧df

从词典键和值创建列表

# 创建字典dict = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'], 'year': [, , , , ], 'fireReports': [4, 24, 31, 2, 3]}# 创建键的列表list(dict.keys())# ['fireReports', 'year', 'county'] # 创建值的列表list(dict.values())'''[[4, 24, 31, 2, 3],[, , , , ],['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']] '''

Pandas 中的交叉表

# 导入库import pandas as pdraw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['infantry', 'infantry', 'cavalry', 'cavalry', 'infantry', 'infantry', 'cavalry', 'cavalry','infantry', 'infantry', 'cavalry', 'cavalry'], 'experience': ['veteran', 'rookie', 'veteran', 'rookie', 'veteran', 'rookie', 'veteran', 'rookie','veteran', 'rookie', 'veteran', 'rookie'],'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'experience', 'name', 'preTestScore', 'postTestScore'])df

按公司和团队创建交叉表。按公司和团队计算观测数量。

pd.crosstab(df.regiment, pany, margins=True)

# 为每个团队创建公司和经验的交叉表pd.crosstab([pany, df.experience], df.regiment, margins=True)

删除重复

# 导入模块import pandas as pdraw_data = {'first_name': ['Jason', 'Jason', 'Jason','Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Miller', 'Miller','Ali', 'Milner', 'Cooze'], 'age': [42, 42, 1111111, 36, 24, 73], 'preTestScore': [4, 4, 4, 31, 2, 3],'postTestScore': [25, 25, 25, 57, 62, 70]}df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])df

# 确定哪些观测是重复的df.duplicated()'''0 False1True2 False3 False4 False5 Falsedtype: bool '''df.drop_duplicates()

# 删除 first_name 列中的重复项# 但保留重复集中的最后一个观测df.drop_duplicates(['first_name'], keep='last')

Pandas 数据帧的描述性统计

# 导入模块import pandas as pddata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]}df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])df

5 rows × 4 columns

# 所有年龄之和df['age'].sum()# 227 df['preTestScore'].mean()# 12.800000000000001 df['preTestScore'].cumsum()'''041 282 593 614 64Name: preTestScore, dtype: int64 '''df['preTestScore'].describe()'''count5.000000mean12.800000std13.663821min 2.00000025% 3.00000050% 4.00000075%24.000000max31.000000Name: preTestScore, dtype: float64 '''df['preTestScore'].count()# 5 df['preTestScore'].min()# 2 df['preTestScore'].max()# 31 df['preTestScore'].median()# 4.0 df['preTestScore'].var()# 186.69999999999999 df['preTestScore'].std()# 13.663820841916802 df['preTestScore'].skew()# 0.74334524573267591 df['preTestScore'].kurt()# -2.4673543738411525 df.corr()

3 rows × 3 columns

# 协方差矩阵df.cov()

3 rows × 3 columns

丢弃行或者列

# 导入模块import pandas as pddata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [, , , , ], 'reports': [4, 24, 31, 2, 3]}df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])df

# 丢弃观测(行)df.drop(['Cochice', 'Pima'])

# 丢弃变量(列)# 注意:`axis = 1`表示我们指的是列,而不是行df.drop('reports', axis=1)

如果它包含某个值(这里是Tina),丢弃一行。

具体来说:创建一个名为df的新数据框,名称列中的单元格的值不等于Tina

df[df.name != 'Tina']

按照行号丢弃一行(在本例中为第 3 行)。

请注意,Pandas使用从零开始的编号,因此 0 是第一行,1 是第二行,等等。

df.drop(df.index[2])

可以扩展到范围。

df.drop(df.index[[2,3]])

或相对于 DF 的末尾来丢弃。

df.drop(df.index[-2])

你也可以选择相对于起始或末尾的范围。

df[:3] # 保留前三个

df[:-3] # 丢掉后三个

枚举列表

# 创建字符串列表data = ['One','Two','Three','Four','Five']# 对于 enumerate(data) 中的每个项目for item in enumerate(data):# 打印整个枚举的元素print(item)# 只打印值(没有索引)print(item[1])'''(0, 'One')One(1, 'Two')Two(2, 'Three')Three(3, 'Four')Four(4, 'Five')Five '''

在 Pandas 中将包含列表的单元扩展为自己的变量

# 导入 pandasimport pandas as pd# 创建数据集raw_data = {'score': [1,2,3], 'tags': [['apple','pear','guava'],['truck','car','plane'],['cat','dog','mouse']]}df = pd.DataFrame(raw_data, columns = ['score', 'tags'])# 查看数据集df

# 将 df.tags 扩展为自己的数据帧tags = df['tags'].apply(pd.Series)# 将每个变量重命名为标签tags = tags.rename(columns = lambda x : 'tag_' + str(x))# 查看 tags 数据帧tags

# 将 tags 数据帧添加回原始数据帧pd.concat([df[:], tags[:]], axis=1)

过滤 pandas 数据帧

# 导入模块import pandas as pddata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [, , , , ], 'reports': [4, 24, 31, 2, 3],'coverage': [25, 94, 57, 62, 70]}df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])df

# 查看列df['name']'''Cochice JasonPimaMollySanta CruzTinaMaricopa JakeYuma AmyName: name, dtype: object '''df[['name', 'reports']]

# 查看前两行df[:2]

# 查看 Coverage 大于 50 的行df[df['coverage'] > 50]

# 查看 Coverage 大于 50 并且 Reports 小于 4 的行df[(df['coverage'] > 50) & (df['reports'] < 4)]

寻找数据帧的列中的最大值

# 导入模块%matplotlib inlineimport pandas as pdimport matplotlib.pyplot as pltimport numpy as np# 创建数据帧raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]}df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])df

# 获取 preTestScore 列中的最大值的索引df['preTestScore'].idxmax()# 2

寻找数据帧中的唯一值

import pandas as pdimport numpy as npraw_data = {'regiment': ['51st', '29th', '2nd', '19th', '12th', '101st', '90th', '30th', '193th', '1st', '94th', '91th'], 'trucks': ['MAZ-7310', np.nan, 'MAZ-7310', 'MAZ-7310', 'Tatra 810', 'Tatra 810', 'Tatra 810', 'Tatra 810', 'ZIS-150', 'Tatra 810', 'ZIS-150', 'ZIS-150'],'tanks': ['Merkava Mark 4', 'Merkava Mark 4', 'Merkava Mark 4', 'Leopard 2A6M', 'Leopard 2A6M', 'Leopard 2A6M', 'Arjun MBT', 'Leopard 2A6M', 'Arjun MBT', 'Arjun MBT', 'Arjun MBT', 'Arjun MBT'],'aircraft': ['none', 'none', 'none', 'Harbin Z-9', 'Harbin Z-9', 'none', 'Harbin Z-9', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk']}df = pd.DataFrame(raw_data, columns = ['regiment', 'trucks', 'tanks', 'aircraft'])# 查看前几行df.head()

# 通过将 pandas 列转换为集合# 创建唯一值的列表list(set(df.trucks))# [nan, 'Tatra 810', 'MAZ-7310', 'ZIS-150'] # 创建 df.trucks 中的唯一值的列表list(df['trucks'].unique())# ['MAZ-7310', nan, 'Tatra 810', 'ZIS-150']

地理编码和反向地理编码

在使用地理数据时,地理编码(将物理地址或位置转换为经纬度)和反向地理编码(将经纬度转换为物理地址或位置)是常见任务。

Python 提供了许多软件包,使任务变得异常简单。 在下面的教程中,我使用 pygeocoder(Google 的 geo-API 的包装器)来进行地理编码和反向地理编码。

首先,我们要加载我们想要在脚本中使用的包。 具体来说,我正在为地理函数加载 pygeocoder,为数据帧结构加载 pandas,为缺失值(np.nan)函数加载 numpy。

# 加载包from pygeocoder import Geocoderimport pandas as pdimport numpy as np

地理数据有多种形式,在这种情况下,我们有一个 Python 字典,包含五个经纬度的字符串,每个坐标在逗号分隔的坐标对中。

# 创建原始数据的字典data = {'Site 1': '31.336968, -109.560959','Site 2': '31.347745, -108.229963','Site 3': '32.277621, -107.734724','Site 4': '31.655494, -106.420484','Site 5': '30.295053, -104.014528'}

虽然技术上没必要,因为我最初使用 R,我是数据帧的忠实粉丝,所以让我们把模拟的数据字典变成数据帧。

# 将字典转换为 pandas 数据帧df = pd.DataFrame.from_dict(data, orient='index')# 查看数据帧df

你现在可以看到,我们有了包含五行的数据帧,每行包含一个经纬度字符串。 在我们处理数据之前,我们需要1)将字符串分成纬度和经度,然后将它们转换为浮点数。以下代码就是这样。

# 为循环创建两个列表lat = []lon = []# 对于变量中的每一行for row in df[0]:# 尝试try:# 用逗号分隔行,转换为浮点# 并将逗号前的所有内容追加到 latlat.append(float(row.split(',')[0]))# 用逗号分隔行,转换为浮点# 并将逗号后的所有内容追加到 lonlon.append(float(row.split(',')[1]))# 但是如果你得到了错误except:# 向 lat 添加缺失值lat.append(np.NaN)# 向 lon 添加缺失值lon.append(np.NaN)# 从 lat 和 lon 创建新的两列df['latitude'] = latdf['longitude'] = lon

让我们看看现在有了什么。

# 查看数据帧df

真棒。这正是我们想要看到的,一列用于纬度的浮点和一列用于经度的浮点。

为了反转地理编码,我们将特定的经纬度对(这里为第一行,索引为0)提供给 pygeocoder 的reverse_geocoder函数。

# 将经度和纬度转换为某个位置results = Geocoder.reverse_geocode(df['latitude'][0], df['longitude'][0])

现在我们可以开始提取我们想要的数据了。

# 打印经纬度results.coordinates# (31.3372728, -109.5609559) # 打印城市results.city# 'Douglas' # 打印国家/地区results.country# 'United States' # 打印街道地址(如果可用)results.street_address# 打印行政区results.administrative_area_level_1# 'Arizona'

对于地理编码,我们需要将包含地址或位置(例如城市)的字符串,传入地理编码函数中。 但是,并非所有字符串的格式都是 Google 的 geo-API 可以理解的。 如果由.geocode().valid_address函数验证有效,我们可以转换。

# 验证地址是否有效(即在 Google 的系统中)Geocoder.geocode("4207 N Washington Ave, Douglas, AZ 85607").valid_address# True

因为输出是True,我们现在知道这是一个有效的地址,因此可以打印纬度和经度坐标。

# 打印经纬度results.coordinates# (31.3372728, -109.5609559)

但更有趣的是,一旦地址由 Google 地理 API 处理,我们就可以解析它并轻松地分隔街道号码,街道名称等。

# 寻找特定地址中的经纬度result = Geocoder.geocode("7250 South Tucson Boulevard, Tucson, AZ 85756")# 打印街道号码result.street_number# '7250' # 打印街道名result.route# 'South Tucson Boulevard'

你就实现了它。Python 使整个过程变得简单,只需几分钟即可完成分析。祝好运!

地理定位城市和国家

本教程创建一个函数,尝试获取城市和国家并返回其经纬度。 但是当城市不可用时(通常是这种情况),则返回该国中心的经纬度。

from geopy.geocoders import Nominatimgeolocator = Nominatim()import numpy as npdef geolocate(city=None, country=None):'''输入城市和国家,或仅输入国家。 如果可以的话,返回城市的经纬度坐标,否则返回该国家中心的经纬度。'''# 如果城市存在if city != None:# 尝试try:# 地理定位城市和国家loc = geolocator.geocode(str(city + ',' + country))# 并返回经纬度return (loc.latitude, loc.longitude)# 否则except:# 返回缺失值return np.nan# 如果城市不存在else:# 尝试try:# 地理定位国家中心loc = geolocator.geocode(country)# 返回经纬度return (loc.latitude, loc.longitude)# 否则except:# 返回缺失值return np.nan# 地理定位城市和国家geolocate(city='Austin', country='USA')# (30.2711286, -97.7436995) # 仅仅地理定位国家geolocate(country='USA')# (39.7837304, -100.4458824)

使用 pandas 分组时间序列

# 导入所需模块import pandas as pdimport numpy as npdf = pd.DataFrame()df['german_army'] = np.random.randint(low=20000, high=30000, size=100)df['allied_army'] = np.random.randint(low=20000, high=40000, size=100)df.index = pd.date_range('1/1/', periods=100, freq='H')df.head()

Truncate the dataframe

df.truncate(before='1/2/', after='1/3/')

# 设置数据帧的索引df.index = df.index + pd.DateOffset(months=4, days=5)df.head()

# 将变量提前一小时df.shift(1).head()

# 将变量延后一小时df.shift(-1).tail()

# 对每小时观测值求和来按天汇总df.resample('D').sum()

# 对每小时观测值求平均来按天汇总df.resample('D').mean()

# 对每小时观测值求最小值来按天汇总df.resample('D').min()

# 对每小时观测值求中值来按天汇总df.resample('D').median()

# 对每小时观测值取第一个值来按天汇总df.resample('D').first()

# 对每小时观测值取最后一个值来按天汇总df.resample('D').last()

# 对每小时观测值取第一个值,最后一个值,最高值,最低值来按天汇总df.resample('D').ohlc()

按时间分组数据

年 3 月 13 日,Pandas 版本 0.18.0 发布,重采样功能的运行方式发生了重大变化。 本教程遵循 v0.18.0,不适用于以前版本的 pandas。

首先让我们加载我们关心的模块。

# 导入所需模块import pandas as pdimport datetimeimport numpy as np

接下来,让我们创建一些样例数据,我们可以将它们按时间分组作为样本。 在这个例子中,我创建了一个包含两列 365 行的数据帧。一列是日期,第二列是数值。

# 为今天创建 datetime 变量base = datetime.datetime.today()# 创建一列变量# 包含 365 天的 datetime 值date_list = [base - datetime.timedelta(days=x) for x in range(0, 365)]# 创建 365 个数值的列表score_list = list(np.random.randint(low=1, high=1000, size=365))# 创建空数据帧df = pd.DataFrame()# 从 datetime 变量创建一列df['datetime'] = date_list# 将列转换为 datetime 类型df['datetime'] = pd.to_datetime(df['datetime'])# 将 datetime 列设为索引df.index = df['datetime'] # 为数值得分变量创建一列df['score'] = score_list# 让我们看看数据df.head()

在 pandas 中,按时间分组的最常用方法是使用.resample()函数。 在 v0.18.0 中,此函数是两阶段的。 这意味着df.resample('M')创建了一个对象,我们可以对其应用其他函数(meancountsum等)

# 按月对数据分组,并取每组(即每个月)的平均值df.resample('M').mean()

# 按月对数据分组,并获取每组(即每个月)的总和df.resample('M').sum()

分组有很多选项。 你可以在 Pandas 的时间序列文档中了解它们的更多信息,但是,为了你的方便,我也在下面列出了它们。

按小时分组数据

# 导入库import pandas as pdimport numpy as np# 创建 2000 个元素的时间序列# 每五分钟一个元素,起始于 2000.1.1time = pd.date_range('1/1/2000', periods=2000, freq='5min')# 创建 pandas 序列,带有 0 到 100 的随机值# 将 time 用于索引series = pd.Series(np.random.randint(100, size=2000), index=time)# 查看前几行series[0:10]'''2000-01-01 00:00:00 402000-01-01 00:05:00 132000-01-01 00:10:00 992000-01-01 00:15:00 722000-01-01 00:20:0042000-01-01 00:25:00 362000-01-01 00:30:00 242000-01-01 00:35:00 00-01-01 00:40:00 832000-01-01 00:45:00 44Freq: 5T, dtype: int64 '''# 按索引的小时值对数据分组,然后按平均值进行汇总series.groupby(series.index.hour).mean()'''050.380952149.380952249.904762353.273810447.178571546.095238649.047619744.297619853.119048948.26190510 45.16666711 54.21428612 50.71428613 56.13095214 50.91666715 42.42857116 46.88095217 56.89285718 54.07142919 47.60714320 50.94047621 50.51190522 44.55000023 50.250000dtype: float64 '''

对行分组

# 导入模块import pandas as pd# 示例数据帧raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])df

# 创建分组对象。 换句话说,# 创建一个表示该特定分组的对象。 # 这里,我们按照团队来分组 pre-test 得分。regiment_preScore = df['preTestScore'].groupby(df['regiment'])# 展示每个团队的 pre-test 得分的均值regiment_preScore.mean()'''regimentDragoons15.50Nighthawks 15.25Scouts 2.50Name: preTestScore, dtype: float64 '''

Pandas 中的分层数据

# 导入模块import pandas as pd# 创建数据帧raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])df

# 设置分层索引但将列保留在原位df = df.set_index(['regiment', 'company'], drop=False)df

# 将分层索引设置为团队然后公司df = df.set_index(['regiment', 'company'])df

# 查看索引df.indexMultiIndex(levels=[['Dragoons', 'Nighthawks', 'Scouts'], ['1st', '2nd']],labels=[[1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2], [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]],names=['regiment', 'company']) # 交换索引中的级别df.swaplevel('regiment', 'company')

# 按需求和数据df.sum(level='regiment')

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。