100字范文 > 数据科学和人工智能技术笔记十九数据整理（下）

数据科学和人工智能技术笔记十九数据整理（下）

时间：2019-05-04 16:11:53

十九、数据整理（下）

作者：Chris Albon
译者：飞龙
协议：CC BY-NC-SA 4.0

连接和合并数据帧

# 导入模块import pandas as pdfrom IPython.display import displayfrom IPython.display import Imageraw_data = {'subject_id': ['1', '2', '3', '4', '5'],'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])df_a

# 创建第二个数据帧raw_data = {'subject_id': ['4', '5', '6', '7', '8'],'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', 'last_name'])df_b

# 创建第三个数据帧raw_data = {'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}df_n = pd.DataFrame(raw_data, columns = ['subject_id','test_id'])df_n

# 将两个数据帧按行连接df_new = pd.concat([df_a, df_b])df_new

# 将两个数据帧按列连接pd.concat([df_a, df_b], axis=1)

# 按两个数据帧按 subject_id 连接pd.merge(df_new, df_n, on='subject_id')

# 将两个数据帧按照左和右数据帧的 subject_id 连接pd.merge(df_new, df_n, left_on='subject_id', right_on='subject_id')

使用外连接来合并。

“全外连接产生表 A 和表 B 中所有记录的集合，带有来自两侧的匹配记录。如果没有匹配，则缺少的一侧将包含空值。” – [来源](http://blog ./a-visual-explanation-of-sql-joins/)

pd.merge(df_a, df_b, on='subject_id', how='outer')

使用内连接来合并。

“内联接只生成匹配表 A 和表 B 的记录集。” – 来源

pd.merge(df_a, df_b, on='subject_id', how='inner')

# 使用右连接来合并pd.merge(df_a, df_b, on='subject_id', how='right')

使用左连接来合并。

“左外连接从表 A 中生成一组完整的记录，它们在表 B 中有匹配的记录。如果没有匹配，右侧将包含空。” – 来源

pd.merge(df_a, df_b, on='subject_id', how='left')

# 合并时添加后缀以复制列名称pd.merge(df_a, df_b, on='subject_id', how='left', suffixes=('_left', '_right'))

# 基于索引的合并pd.merge(df_a, df_b, right_index=True, left_index=True)

列出 pandas 列中的唯一值

特别感谢 Bob Haffner 指出了一种更好的方法。

# 导入模块import pandas as pd# 设置 ipython 的最大行显示pd.set_option('display.max_row', 1000)# 设置 ipython 的最大列宽pd.set_option('display.max_columns', 50)# 创建示例数据帧data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [, , , , ], 'reports': [4, 24, 31, 2, 3]}df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])df

# 列出 df['name'] 的唯一值df.name.unique()# array(['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], dtype=object)

加载 JSON 文件

# 加载库import pandas as pd# 创建 JSON 文件的 URL（或者可以是文件路径）url = '/chrisalbon/simulated_datasets/master/data.json'# 将 JSON 文件加载到数据框中df = pd.read_json(url, orient='columns')# 查看前十行df.head(10)

加载 Excel 文件

# 加载库import pandas as pd# 创建 Excel 文件的 URL（或者可以是文件路径）url = '/chrisalbon/simulated_datasets/master/data.xlsx'# 将 Excel 文件的第一页加载到数据框中df = pd.read_excel(url, sheetname=0, header=1)# 查看前十行df.head(10)

将 Excel 表格加载为数据帧

# 导入模块import pandas as pd# 加载 excel 文件并赋给 xls_filexls_file = pd.ExcelFile('../data/example.xls')xls_file# <pandas.io.excel.ExcelFile at 0x111912be0> # 查看电子表格的名称xls_file.sheet_names# ['Sheet1'] # 将 xls 文件的 Sheet1 加载为数据帧df = xls_file.parse('Sheet1')df

加载 CSV

# 导入模块import pandas as pdimport numpy as npraw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, ".", "."],'postTestScore': ["25,000", "94,000", 57, 62, 70]}df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])df

# 将数据帧保存为工作目录中的 csvdf.to_csv('pandas_dataframe_importing_csv/example.csv')df = pd.read_csv('pandas_dataframe_importing_csv/example.csv')df

# 加载无头 CSVdf = pd.read_csv('pandas_dataframe_importing_csv/example.csv', header=None)df

# 在加载 csv 时指定列名称df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])df

# 通过将索引列设置为 UID 来加载 csvdf = pd.read_csv('pandas_dataframe_importing_csv/example.csv', index_col='UID', names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])df

# 在加载 csv 时将索引列设置为名字和姓氏df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', index_col=['First Name', 'Last Name'], names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])df

# 在加载 csv 时指定 '.' 为缺失值df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=['.'])pd.isnull(df)

# 加载csv，同时指定 '.' 和 'NA' 为“姓氏”列的缺失值，指定 '.' 为 preTestScore 列的缺失值sentinels = {'Last Name': ['.', 'NA'], 'Pre-Test Score': ['.']}df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=sentinels)df

# 在加载 csv 时跳过前 3 行df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', na_values=sentinels, skiprows=3)df

# 加载 csv，同时将数字字符串中的 ',' 解释为千位分隔符df = pd.read_csv('pandas_dataframe_importing_csv/example.csv', thousands=',')df

长到宽的格式

# 导入模块import pandas as pdraw_data = {'patient': [1, 1, 1, 2, 2], 'obs': [1, 2, 3, 1, 2], 'treatment': [0, 1, 0, 1, 0],'score': [6252, 24243, 2345, 2342, 23525]} df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])df

制作“宽的”数据。

现在，我们将创建一个“宽的”数据帧，其中行数按患者编号，列按观测编号，单元格值为得分值。

df.pivot(index='patient', columns='obs', values='score')

在数据帧中小写列名

# 导入模块import pandas as pd# 设置 ipython 的最大行显示pd.set_option('display.max_row', 1000)# 设置 ipython 的最大列宽pd.set_option('display.max_columns', 50)# 创建示例数据帧data = {'NAME': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'YEAR': [, , , , ], 'REPORTS': [4, 24, 31, 2, 3]}df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])df

# 小写列名称# Map the lowering function to all column namesdf.columns = map(str.lower, df.columns)df

使用函数创建新列

# 导入模块import pandas as pd# 示例数据帧raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])df

# 创建一个接受两个输入，pre 和 post 的函数def pre_post_difference(pre, post):# 返回二者的差return post - pre# 创建一个变量，它是函数的输出df['score_change'] = pre_post_difference(df['preTestScore'], df['postTestScore'])# 查看数据帧df

# 创建一个接受一个输入 x 的函数def score_multipler_2x_and_3x(x):# 返回两个东西，2x 和 3xreturn x*2, x*3# 创建两个新变量，它是函数的两个输出df['post_score_x2'], df['post_score_x3'] = zip(*df['postTestScore'].map(score_multipler_2x_and_3x))df

将外部值映射为数据帧的值

# 导入模块import pandas as pdraw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'city': ['San Francisco', 'Baltimore', 'Miami', 'Douglas', 'Boston']}df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'city'])df

# 创建值的字典city_to_state = {'San Francisco' : 'California', 'Baltimore' : 'Maryland', 'Miami' : 'Florida', 'Douglas' : 'Arizona', 'Boston' : 'Massachusetts'}df['state'] = df['city'].map(city_to_state)df

数据帧中的缺失数据

# 导入模块import pandas as pdimport numpy as npraw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'], 'age': [42, np.nan, 36, 24, 73], 'sex': ['m', np.nan, 'f', 'm', 'f'], 'preTestScore': [4, np.nan, np.nan, 2, 3],'postTestScore': [25, np.nan, np.nan, 62, 70]}df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])df

# 丢弃缺失值df_no_missing = df.dropna()df_no_missing

# 删除所有单元格为 NA 的行df_cleaned = df.dropna(how='all')df_cleaned

# 创建一个缺失值填充的新列df['location'] = np.nandf

# 如果列仅包含缺失值，删除列df.dropna(axis=1, how='all')

# 删除少于五个观测值的行# 这对时间序列来说非常有用df.dropna(thresh=5)

# 用零填充缺失数据df.fillna(0)

# 使用 preTestScore 的平均值填充 preTestScore 中的缺失# inplace=True 表示更改会立即保存到 df 中df["preTestScore"].fillna(df["preTestScore"].mean(), inplace=True)df

# 使用 postTestScore 的每个性别的均值填充 postTestScore 中的缺失df["postTestScore"].fillna(df.groupby("sex")["postTestScore"].transform("mean"), inplace=True)df

# 选择年龄不是 NaN 且性别不是 NaN 的行df[df['age'].notnull() & df['sex'].notnull()]

pandas 中的移动平均

# 导入模块import pandas as pd# 创建数据data = {'score': [1,1,1,2,2,2,3,3,3]}# 创建数据帧df = pd.DataFrame(data)# 查看数据帧df

# 计算移动平均。也就是说，取前两个值，取平均值# 然后丢弃第一个，再加上第三个，以此类推。df.rolling(window=2).mean()

规范化一列

# 导入所需模块import pandas as pdfrom sklearn import preprocessing# 设置图表为内联%matplotlib inline# 创建示例数据帧，带有未规范化的一列data = {'score': [234,24,14,27,-74,46,73,-18,59,160]}df = pd.DataFrame(data)df

# 查看为未规范化的数据df['score'].plot(kind='bar')# <matplotlib.axes._subplots.AxesSubplot at 0x11b9c88d0>

# 创建 x，其中 x 的得分列的值为浮点数x = df[['score']].values.astype(float)# 创建 minmax 处理器对象min_max_scaler = preprocessing.MinMaxScaler()# 创建一个对象，转换数据，拟合 minmax 处理器x_scaled = min_max_scaler.fit_transform(x)# 在数据帧上运行规范化器df_normalized = pd.DataFrame(x_scaled)# 查看数据帧df_normalized

# 绘制数据帧df_normalized.plot(kind='bar')# <matplotlib.axes._subplots.AxesSubplot at 0x11ba31c50>

Pandas 中的级联表

# 导入模块import pandas as pdraw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 'TestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3]}df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'TestScore'])df

# 按公司和团队创建分组均值的透视表pd.pivot_table(df, index=['regiment','company'], aggfunc='mean')

# 按公司和团队创建分组计数的透视表df.pivot_table(index=['regiment','company'], aggfunc='count')

在 Pandas 中快速修改字符串列

我经常需要或想要改变一串字符串中所有项目的大小写（例如BRAZIL到Brazil等）。有很多方法可以实现这一目标，但我已经确定这是最容易和最快的方法。

# 导入 pandasimport pandas as pd# 创建名称的列表first_names = pd.Series(['Steve Murrey', 'Jane Fonda', 'Sara McGully', 'Mary Jane'])# 打印列first_names'''0 Steve Murrey1Jane Fonda2 Sara McGully3 Mary Janedtype: object '''# 打印列的小写first_names.str.lower()'''0 steve murrey1jane fonda2 sara mcgully3 mary janedtype: object '''# 打印列的大写first_names.str.upper()'''0 STEVE MURREY1JANE FONDA2 SARA MCGULLY3 MARY JANEdtype: object '''# 打印列的标题大小写first_names.str.title()'''0 Steve Murrey1Jane Fonda2 Sara Mcgully3 Mary Janedtype: object '''# 打印以空格分割的列first_names.str.split(" ")'''0 [Steve, Murrey]1[Jane, Fonda]2 [Sara, McGully]3 [Mary, Jane]dtype: object '''# 打印首字母大写的列first_names.str.capitalize()'''0 Steve murrey1Jane fonda2 Sara mcgully3 Mary janedtype: object '''

明白了吧。更多字符串方法在这里。

随机抽样数据帧

# 导入模块import pandas as pdimport numpy as npraw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]}df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])df

# 不放回选择大小为 2 的随机子集df.take(np.random.permutation(len(df))[:2])

对数据帧的行排名

# 导入模块import pandas as pd# 创建数据帧data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [, , , , ], 'reports': [4, 24, 31, 2, 3],'coverage': [25, 94, 57, 62, 70]}df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])df

5 rows × 4 columns

# 创建一个新列，该列是 coverage 值的升序排名df['coverageRanked'] = df['coverage'].rank(ascending=1)df

5 rows × 5 columns

正则表达式基础

# 导入正则包import reimport systext = 'The quick brown fox jumped over the lazy black bear.'three_letter_word = '\w{3}'pattern_re = pile(three_letter_word); pile(r'\w{3}', re.UNICODE) re_search = re.search('..own', text)if re_search:# 打印搜索结果print(re_search.group())# brown

re.match

re.match()仅用于匹配字符串的开头或整个字符串。对于其他任何内容，请使用re.search。

Match all three letter words in text

# 在文本中匹配所有三个字母的单词re_match = re.match('..own', text)if re_match:# 打印所有匹配print(re_match.group())else:# 打印这个print('No matches')# No matches

re.split

# 使用 'e' 作为分隔符拆分字符串。re_split = re.split('e', text); re_split# ['Th', ' quick brown fox jump', 'd ov', 'r th', ' lazy black b', 'ar.']

re.sub

用其他东西替换正则表达式模式串。3表示要进行的最大替换次数。

# 用 'E' 替换前三个 'e' 实例，然后打印出来re_sub = re.sub('e', 'E', text, 3); print(re_sub)# ThE quick brown fox jumpEd ovEr the lazy black bear.

正则表达式示例

# 导入 regeximport re# 创建一些数据text = 'A flock of 120 quick brown foxes jumped over 30 lazy brown, bears.'re.findall('^A', text)# ['A'] re.findall('bears.$', text)# ['bears.'] re.findall('f..es', text)# ['foxes'] # 寻找所有元音re.findall('[aeiou]', text)# ['o', 'o', 'u', 'i', 'o', 'o', 'e', 'u', 'e', 'o', 'e', 'a', 'o', 'e', 'a'] # 查找不是小写元音的所有字符re.findall('[^aeiou]', text)'''['A',' ','f','l','c','k',' ','f',' ','1','2','0',' ','q','c','k',' ','b','r','w','n',' ','f','x','s',' ','j','m','p','d',' ','v','r',' ','3','0',' ','l','z','y',' ','b','r','w','n',',',' ','b','r','s','.'] '''re.findall('a|A', text)# ['A', 'a', 'a'] # 寻找任何 'fox' 的实例re.findall('(foxes)', text)# ['foxes'] # 寻找所有五个字母的单词re.findall('\w\w\w\w\w', text)# ['flock', 'quick', 'brown', 'foxes', 'jumpe', 'brown', 'bears'] re.findall('\W\W', text)# [', '] re.findall('\s', text)# [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '] re.findall('\S\S', text)'''['fl','oc','of','12','qu','ic','br','ow','fo','xe','ju','mp','ed','ov','er','30','la','zy','br','ow','n,','be','ar','s.'] '''re.findall('\d\d\d', text)# ['120'] re.findall('\D\D\D\D\D', text)'''['A flo','ck of',' quic','k bro','wn fo','xes j','umped',' over',' lazy',' brow','n, be'] '''re.findall('\AA', text)# ['A'] re.findall('bears.\Z', text)# ['bears.'] re.findall('\b[foxes]', text)# [] re.findall('\n', text)# [] re.findall('[Ff]oxes', 'foxes Foxes Doxes')# ['foxes', 'Foxes'] re.findall('[Ff]oxes', 'foxes Foxes Doxes')# ['foxes', 'Foxes'] re.findall('[a-z]', 'foxes Foxes')# ['f', 'o', 'x', 'e', 's', 'o', 'x', 'e', 's'] re.findall('[A-Z]', 'foxes Foxes')# ['F'] re.findall('[a-zA-Z0-9]', 'foxes Foxes')# ['f', 'o', 'x', 'e', 's', 'F', 'o', 'x', 'e', 's'] re.findall('[^aeiou]', 'foxes Foxes')# ['f', 'x', 's', ' ', 'F', 'x', 's'] re.findall('[^0-9]', 'foxes Foxes')# ['f', 'o', 'x', 'e', 's', ' ', 'F', 'o', 'x', 'e', 's'] re.findall('foxes?', 'foxes Foxes')# ['foxes'] re.findall('ox*', 'foxes Foxes')# ['ox', 'ox'] re.findall('ox+', 'foxes Foxes')# ['ox', 'ox'] re.findall('\d{3}', text)# ['120'] re.findall('\d{2,}', text)# ['120', '30'] re.findall('\d{2,3}', text)# ['120', '30'] re.findall('^A', text)# ['A'] re.findall('bears.$', text)# ['bears.'] re.findall('\AA', text)# ['A'] re.findall('bears.\Z', text)# ['bears.'] re.findall('bears(?=.)', text)# ['bears'] re.findall('foxes(?!!)', 'foxes foxes!')# ['foxes'] re.findall('foxes|foxes!', 'foxes foxes!')# ['foxes', 'foxes'] re.findall('fox(es!)', 'foxes foxes!')# ['es!'] re.findall('foxes(!)', 'foxes foxes!')# ['!']

重索引序列和数据帧

# 导入模块import pandas as pdimport numpy as np# 创建亚利桑那州南部的火灾风险序列brushFireRisk = pd.Series([34, 23, 12, 23], index = ['Bisbee', 'Douglas', 'Sierra Vista', 'Tombstone'])brushFireRisk'''Bisbee34Douglas 23Sierra Vista 12Tombstone 23dtype: int64 '''# 重索引这个序列并创建一个新的序列变量brushFireRiskReindexed = brushFireRisk.reindex(['Tombstone', 'Douglas', 'Bisbee', 'Sierra Vista', 'Barley', 'Tucson'])brushFireRiskReindexed'''Tombstone 23.0Douglas 23.0Bisbee34.0Sierra Vista 12.0Barley NaNTucson NaNdtype: float64 '''# 重索引序列并在任何缺失的索引处填入 0brushFireRiskReindexed = brushFireRisk.reindex(['Tombstone', 'Douglas', 'Bisbee', 'Sierra Vista', 'Barley', 'Tucson'], fill_value = 0)brushFireRiskReindexed'''Tombstone 23Douglas 23Bisbee34Sierra Vista 12Barley 0Tucson 0dtype: int64 '''# 创建数据帧data = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'], 'year': [, , , , ], 'reports': [4, 24, 31, 2, 3]}df = pd.DataFrame(data)df

# 更改行的顺序（索引）df.reindex([4, 3, 2, 1, 0])

# 更改列的顺序（索引）columnsTitles = ['year', 'reports', 'county']df.reindex(columns=columnsTitles)

重命名列标题

来自 StackOverflow 上的 rgalbo。

# 导入所需模块import pandas as pd# 创建列表的字典，作为值raw_data = {'0': ['first_name', 'Molly', 'Tina', 'Jake', 'Amy'], '1': ['last_name', 'Jacobson', 'Ali', 'Milner', 'Cooze'], '2': ['age', 52, 36, 24, 73], '3': ['preTestScore', 24, 31, 2, 3]}# 创建数据帧df = pd.DataFrame(raw_data)# 查看数据帧df

# 从数据集的第一行创建一个名为 header 的新变量header = df.iloc[0]'''0first_name1 last_name2 age3 preTestScoreName: 0, dtype: object '''# 将数据帧替换为不包含第一行的新数据帧df = df[1:]# 使用标题变量重命名数据帧的列值df.rename(columns = header)

重命名多个数据帧的列名

# 导入模块import pandas as pd# 设置 ipython 的最大行显示pd.set_option('display.max_row', 1000)# 设置 ipython 的最大列宽pd.set_option('display.max_columns', 50)# 创建示例数据帧data = {'Commander': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'Date': [', 02, 08', ', 02, 08', ', 02, 08', ', 02, 08', ', 02, 08'], 'Score': [4, 24, 31, 2, 3]}df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])df

# 重命名列名df.columns = ['Leader', 'Time', 'Score']df

df.rename(columns={'Leader': 'Commander'}, inplace=True)df

替换值

# 导入模块import pandas as pdimport numpy as npraw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [-999, -999, -999, 2, 1],'postTestScore': [2, 2, -999, 2, -999]}df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])df

# 将所有 -999 替换为 NANdf.replace(-999, np.nan)

将数据帧保存为 CSV

# 导入模块import pandas as pdraw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]}df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])df

将名为df的数据帧保存为 csv。

df.to_csv('example.csv')

在列中搜索某个值

# 导入模块import pandas as pdraw_data = {'first_name': ['Jason', 'Jason', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Miller', 'Ali', 'Milner', 'Cooze'], 'age': [42, 42, 36, 24, 73], 'preTestScore': [4, 4, 31, 2, 3],'postTestScore': [25, 25, 57, 62, 70]}df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])df

# 在列中寻找值在哪里# 查看 postTestscore 大于 50 的地方df['preTestScore'].where(df['postTestScore'] > 50)'''0NaN1NaN2 31.032.043.0Name: preTestScore, dtype: float64 '''

选择包含特定值的行和列

# 按照列值抓取行value_list = ['Tina', 'Molly', 'Jason']df[df.name.isin(value_list)]

# 获取列值不是某个值的行df[~df.name.isin(value_list)]

选择具有特定值的行

import pandas as pd# 创建示例数据帧data = {'name': ['Jason', 'Molly'], 'country': [['Syria', 'Lebanon'],['Spain', 'Morocco']]}df = pd.DataFrame(data)df

df[df['country'].map(lambda country: 'Syria' in country)]

使用多个过滤器选择行

import pandas as pd# 创建示例数据帧data = {'name': ['A', 'B', 'C', 'D', 'E'], 'score': [1,2,3,4,5]}df = pd.DataFrame(data)df

# 选择数据帧的行，其中 df.score 大于 1 且小于 5df[(df['score'] > 1) & (df['score'] < 5)]

根据条件选择数据帧的行

# 导入模块import pandas as pdimport numpy as np# 创建数据帧raw_data = {'first_name': ['Jason', 'Molly', np.nan, np.nan, np.nan], 'nationality': ['USA', 'USA', 'France', 'UK', 'UK'], 'age': [42, 52, 36, 24, 70]}df = pd.DataFrame(raw_data, columns = ['first_name', 'nationality', 'age'])df

# 方法 1：使用布尔变量# 如果国籍是美国，则变量为 TRUEamerican = df['nationality'] == "USA"# 如果年龄大于 50，则变量为 TRUEelderly = df['age'] > 50# 选择所有国籍为美国且年龄大于 50 的案例df[american & elderly]

# 方法 2：使用变量属性# 选择所有不缺少名字且国籍为美国的案例df[df['first_name'].notnull() & (df['nationality'] == "USA")]

数据帧简单示例

# 导入模块import pandas as pdraw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]}df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])df

# 创建第二个数据帧raw_data_2 = {'first_name': ['Sarah', 'Gueniva', 'Know', 'Sara', 'Cat'], 'last_name': ['Mornig', 'Jaker', 'Alom', 'Ormon', 'Koozer'], 'age': [53, 26, 72, 73, 24], 'preTestScore': [13, 52, 72, 26, 26],'postTestScore': [82, 52, 56, 234, 254]}df_2 = pd.DataFrame(raw_data_2, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])df_2

# 创建第三个数据帧raw_data_3 = {'first_name': ['Sarah', 'Gueniva', 'Know', 'Sara', 'Cat'], 'last_name': ['Mornig', 'Jaker', 'Alom', 'Ormon', 'Koozer'],'postTestScore_2': [82, 52, 56, 234, 254]}df_3 = pd.DataFrame(raw_data_3, columns = ['first_name', 'last_name', 'postTestScore_2'])df_3

排序数据帧的行

# 导入模块import pandas as pddata = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [, , , , ], 'reports': [1, 2, 1, 2, 3],'coverage': [2, 2, 3, 3, 3]}df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])df

# 按报告对数据框的行降序排序df.sort_values(by='reports', ascending=0)

# 按 coverage 然后是报告对数据帧的行升序排序df.sort_values(by=['coverage', 'reports'])

将经纬度坐标变量拆分为单独的变量

import pandas as pdimport numpy as npraw_data = {'geo': ['40.0024, -105.4102', '40.0068, -105.266', '39.9318, -105.2813', np.nan]}df = pd.DataFrame(raw_data, columns = ['geo'])df

# 为要放置的循环结果创建两个列表lat = []lon = []# 对于变量中的每一行for row in df['geo']:# Try to,try:# 用逗号分隔行，转换为浮点# 并将逗号前的所有内容追加到 latlat.append(row.split(',')[0])# 用逗号分隔行，转换为浮点# 并将逗号后的所有内容追加到 lonlon.append(row.split(',')[1])# 但是如果你得到了错误except:# 向 lat 添加缺失值lat.append(np.NaN)# 向 lon 添加缺失值lon.append(np.NaN)# 从 lat 和 lon 创建新的两列df['latitude'] = latdf['longitude'] = londf

数据流水线

# 创建一些原始数据raw_data = [1,2,3,4,5,6,7,8,9,10]# 定义产生 input+6 的生成器def add_6(numbers):for x in numbers:output = x+6yield output# 定义产生 input-2 的生成器def subtract_2(numbers):for x in numbers:output = x-2yield output# 定义产生 input*100 的生成器def multiply_by_100(numbers):for x in numbers:output = x*100yield output# 流水线的第一步step1 = add_6(raw_data)# 流水线的第二步step2 = subtract_2(step1)# 流水线的第三步pipeline = multiply_by_100(step2)# 原始数据的第一个元素next(pipeline)# 500 # 原始数据的第二个元素next(pipeline)# 600 # 处理所有数据for raw_data in pipeline:print(raw_data)'''70080090010001100120013001400'''

数据帧中的字符串整理

# 导入模块import pandas as pdimport numpy as npimport re as reraw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'email': ['[[email protected]](/cdn-cgi/l/email-protection)', '[[email protected]](/cdn-cgi/l/email-protection)', np.NAN, '[[email protected]](/cdn-cgi/l/email-protection)', '[[email protected]](/cdn-cgi/l/email-protection)'], 'preTestScore': [4, 24, 31, 2, 3],'postTestScore': [25, 94, 57, 62, 70]}df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'email', 'preTestScore', 'postTestScore'])df

# 电子邮件列中的哪些字符串包含 'gmail'df['email'].str.contains('gmail')'''0True1True2NaN3 False4 FalseName: email, dtype: object '''pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'df['email'].str.findall(pattern, flags=re.IGNORECASE)'''0 [(jas203, gmail, com)]1 [(momomolly, gmail, com)]2NaN3[(battler, milner, com)]4[(Ames1234, yahoo, com)]Name: email, dtype: object '''matches = df['email'].str.match(pattern, flags=re.IGNORECASE)matches'''/Users/chrisralbon/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:1: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.if __name__ == '__main__':0 (jas203, gmail, com)1 (momomolly, gmail, com)2 NaN3(battler, milner, com)4(Ames1234, yahoo, com)Name: email, dtype: object '''matches.str[1]'''0gmail1gmail2 NaN3 milner4yahooName: email, dtype: object '''

和 Pandas 一起使用列表推导式

# 导入模块import pandas as pd# 设置 ipython 的最大行显示pd.set_option('display.max_row', 1000)# 设置 ipython 的最大列宽pd.set_option('display.max_columns', 50)data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'year': [, , , , ], 'reports': [4, 24, 31, 2, 3]}df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])df

作为循环的列表推导式。

# 创建变量next_year = []# 对于 df.years 的每一行for row in df['year']:# 为这一行添加 1 并将其附加到 next_yearnext_year.append(row + 1)# 创建 df.next_yeardf['next_year'] = next_year# 查看数据帧df

作为列表推导式。

# 对于 df.year 中的每一行，从行中减去 1df['previous_year'] = [row-1 for row in df['year']]df

使用 Seaborn 来可视化数据帧

import pandas as pd%matplotlib inlineimport randomimport matplotlib.pyplot as pltimport seaborn as snsdf = pd.DataFrame()df['x'] = random.sample(range(1, 100), 25)df['y'] = random.sample(range(1, 100), 25)df.head()

# 散点图sns.lmplot('x', 'y', data=df, fit_reg=False)# <seaborn.axisgrid.FacetGrid at 0x114563b00>

# 密度图sns.kdeplot(df.y)# <matplotlib.axes._subplots.AxesSubplot at 0x113ea2ef0>

sns.kdeplot(df.y, df.x)# <matplotlib.axes._subplots.AxesSubplot at 0x113d7fef0>

sns.distplot(df.x)# <matplotlib.axes._subplots.AxesSubplot at 0x114294160>

# 直方图plt.hist(df.x, alpha=.3)sns.rugplot(df.x);

# 箱形图sns.boxplot([df.y, df.x])# <matplotlib.axes._subplots.AxesSubplot at 0x1142b8b38>

# 提琴图sns.violinplot([df.y, df.x])# <matplotlib.axes._subplots.AxesSubplot at 0x114444a58>

# 热力图sns.heatmap([df.y, df.x], annot=True, fmt="d")# <matplotlib.axes._subplots.AxesSubplot at 0x114530c88>

# 聚类图sns.clustermap(df)# <seaborn.matrix.ClusterGrid at 0x116f313c8>

Pandas 数据结构

# 导入模块import pandas as pd

序列 101

序列是一维数组（类似 R 的向量）。

# 创建 floodingReports 数量的序列floodingReports = pd.Series([5, 6, 2, 9, 12])floodingReports'''051622394 12dtype: int64 '''

请注意，第一列数字（0 到 4）是索引。

# 将县名设置为 floodingReports 序列的索引floodingReports = pd.Series([5, 6, 2, 9, 12], index=['Cochise County', 'Pima County', 'Santa Cruz County', 'Maricopa County', 'Yuma County'])floodingReports'''Cochise County 5Pima County 6Santa Cruz County2Maricopa County 9Yuma County12dtype: int64 '''floodingReports['Cochise County']# 5 floodingReports[floodingReports > 6]'''Maricopa County9Yuma County 12dtype: int64 '''

从字典中创建 Pandas 序列。

注意：执行此操作时，字典的键将成为序列索引。

# 创建字典fireReports_dict = {'Cochise County': 12, 'Pima County': 342, 'Santa Cruz County': 13, 'Maricopa County': 42, 'Yuma County' : 52}# 将字典转换为 pd.Series，然后查看它fireReports = pd.Series(fireReports_dict); fireReports'''Cochise County 12Maricopa County 42Pima County342Santa Cruz County13Yuma County 52dtype: int64 '''fireReports.index = ["Cochice", "Pima", "Santa Cruz", "Maricopa", "Yuma"]fireReports'''Cochice 12Pima 42Santa Cruz 342Maricopa 13Yuma 52dtype: int64 '''

数据帧 101

数据帧就像 R 的数据帧。

# 从等长列表或 NumPy 数组的字典中创建数据帧data = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'], 'year': [, , , , ], 'reports': [4, 24, 31, 2, 3]}df = pd.DataFrame(data)df

# 使用 columns 属性设置列的顺序dfColumnOrdered = pd.DataFrame(data, columns=['county', 'year', 'reports'])dfColumnOrdered

# 添加一列dfColumnOrdered['newsCoverage'] = pd.Series([42.3, 92.1, 12.2, 39.3, 30.2])dfColumnOrdered

# 删除一列del dfColumnOrdered['newsCoverage']dfColumnOrdered

# 转置数据帧dfColumnOrdered.T

Pandas 时间序列基础

# 导入模块from datetime import datetimeimport pandas as pd%matplotlib inlineimport matplotlib.pyplot as pyplotdata = {'date': ['-05-01 18:47:05.069722', '-05-01 18:47:05.119994', '-05-02 18:47:05.178768', '-05-02 18:47:05.230071', '-05-02 18:47:05.230071', '-05-02 18:47:05.280592', '-05-03 18:47:05.332662', '-05-03 18:47:05.385109', '-05-04 18:47:05.436523', '-05-04 18:47:05.486877'], 'battle_deaths': [34, 25, 26, 15, 15, 14, 26, 25, 62, 41]}df = pd.DataFrame(data, columns = ['date', 'battle_deaths'])print(df)'''date battle_deaths0 -05-01 18:47:05.069722 341 -05-01 18:47:05.119994 252 -05-02 18:47:05.178768 263 -05-02 18:47:05.230071 154 -05-02 18:47:05.230071 155 -05-02 18:47:05.280592 146 -05-03 18:47:05.332662 267 -05-03 18:47:05.385109 258 -05-04 18:47:05.436523 629 -05-04 18:47:05.486877 41 '''df['date'] = pd.to_datetime(df['date'])df.index = df['date']del df['date']df

# 查看年的所有观测df['']

# 查看年 5 月的所有观测df['-05']

# 查看 .5.3 的所有观测df[datetime(, 5, 3):]

Observations between May 3rd and May 4th

# 查看 .5.3~4 的所有观测df['5/3/':'5/4/']

# 截断 .5.2 之后的观测df.truncate(after='5/3/')

# .5 的观测df['5-']

# 计算每个时间戳的观测数df.groupby(level=0).count()

# 每天的 battle_deaths 均值df.resample('D').mean()

# 每天的 battle_deaths 总数df.resample('D').sum()

# 绘制每天的总死亡人数df.resample('D').sum().plot()# <matplotlib.axes._subplots.AxesSubplot at 0x11187a940>

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。

数据科学和人工智能技术笔记 十九 数据整理（下）

十九、数据整理（下）

连接和合并数据帧

列出 pandas 列中的唯一值

加载 JSON 文件

加载 Excel 文件

将 Excel 表格加载为数据帧

加载 CSV

长到宽的格式

在数据帧中小写列名

使用函数创建新列

将外部值映射为数据帧的值

数据帧中的缺失数据

pandas 中的移动平均

规范化一列

Pandas 中的级联表

在 Pandas 中快速修改字符串列

随机抽样数据帧

对数据帧的行排名

正则表达式基础

Match all three letter words in text

正则表达式示例

重索引序列和数据帧

重命名列标题

重命名多个数据帧的列名

替换值

将数据帧保存为 CSV

在列中搜索某个值

选择包含特定值的行和列

选择具有特定值的行

使用多个过滤器选择行

根据条件选择数据帧的行

数据帧简单示例

排序数据帧的行

将经纬度坐标变量拆分为单独的变量

数据流水线

数据帧中的字符串整理

和 Pandas 一起使用列表推导式

使用 Seaborn 来可视化数据帧

Pandas 数据结构

序列 101

数据帧 101

Pandas 时间序列基础

Observations between May 3rd and May 4th

数据科学和人工智能技术笔记十九数据整理（下）