合并数据集
索引上的合并
import pandas as pd
import numpy as np
- 连接的方法:
- 通过列与索引合并
- 通过索引和索引合并
- 复合索引和多个列来合并
生成索引的两种方式
- 构建数据表(DataFrame)自动生成索引:使用pandas库的DataFrame对象,花括号中以键值对的形式,给出列名和值,索引会自动按序生成
left1 = pd.DataFrame({'key': ['a', 'b', 'a','a', 'b', 'c'], 'value': range(6)})
left1
|
key |
value |
0 |
a |
0 |
1 |
b |
1 |
2 |
a |
2 |
3 |
a |
3 |
4 |
b |
4 |
5 |
c |
5 |
- 可以通过设置index参数来指定索引创造DataFrame对象
right1 = pd.DataFrame({'group_val':[3.5, 7]}, index=['a', 'b'])
right1
连接的模式
- 根据索引和列来合并两个表,使用
merge()
函数,参数依次为:左表(DataFrame对象),右表,left_on参数指定左表的某一列用于合并,right_index参数指定使用右表的索引来合并(因为索引是唯一的,所以直接设置True或false即可),同理也有right_on和left_index参数,只是left_on和left_index参数不能同时使用
- 内连接:
merge()
函数的默认方式,求两张表的交集
- 外连接:求两种表的并集,如果有的元素,一个表有而另一个表没有,则在代表另一个表值的列使用
NaN
填充
内连接
merge1 = pd.merge(left1, right1, left_on='key', right_index=True)
merge1
|
key |
value |
group_val |
0 |
a |
0 |
3.5 |
2 |
a |
2 |
3.5 |
3 |
a |
3 |
3.5 |
1 |
b |
1 |
7.0 |
4 |
b |
4 |
7.0 |
外连接
- 外连接:求两张表的并集,
merge()
函数需要设置how
参数来启动外连接模式
merge2 = pd.merge(left1, right1, left_on='key', right_index=True,how='outer')
merge2
|
key |
value |
group_val |
0 |
a |
0 |
3.5 |
2 |
a |
2 |
3.5 |
3 |
a |
3 |
3.5 |
1 |
b |
1 |
7.0 |
4 |
b |
4 |
7.0 |
5 |
c |
5 |
NaN |
通过复合索引合并
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'key2': [2000, 2001, 2002, 2001, 2002],
'data': np.arange(5a)})
lefth
|
key1 |
key2 |
data |
0 |
Ohio |
2000 |
0 |
1 |
Ohio |
2001 |
1 |
2 |
Ohio |
2002 |
2 |
3 |
Nevada |
2001 |
3 |
4 |
Nevada |
2002 |
4 |
righth = pd.DataFrame(np.arange(6*2).reshape(6, 2), index = [['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'],
[2001, 2000, 2000, 2000, 2001, 2002]],
columns=['event1', 'event2'])
righth
|
|
event1 |
event2 |
Nevada |
2001 |
0 |
1 |
2000 |
2 |
3 |
Ohio |
2000 |
4 |
5 |
2000 |
6 |
7 |
2001 |
8 |
9 |
2002 |
10 |
11 |
merge3 = pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)
merge3
|
key1 |
key2 |
data |
event1 |
event2 |
0 |
Ohio |
2000 |
0 |
4 |
5 |
0 |
Ohio |
2000 |
0 |
6 |
7 |
1 |
Ohio |
2001 |
1 |
8 |
9 |
2 |
Ohio |
2002 |
2 |
10 |
11 |
3 |
Nevada |
2001 |
3 |
0 |
1 |
- 结果分析:因为Nevada 2000这个组合在左表中存在而在右表中不存在,所以在合并表中并没有出现
left2 = pd.DataFrame([[1., 2.,], [3., 4.], [5., 6.]], index=['a', 'c', 'e'],
columns=['Ohio', 'Nevada'])
left2
|
Ohio |
Nevada |
a |
1.0 |
2.0 |
c |
3.0 |
4.0 |
e |
5.0 |
6.0 |
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13., 14.]], index=['b', 'c', 'd', 'e'],
columns=['Missouri', 'Alabama'])
right2
|
Missouri |
Alabama |
b |
7.0 |
8.0 |
c |
9.0 |
10.0 |
d |
11.0 |
12.0 |
e |
13.0 |
14.0 |
merge4 = pd.merge(left2, right2, left_index = True, right_index = True)
merge4
|
Ohio |
Nevada |
Missouri |
Alabama |
c |
3.0 |
4.0 |
9.0 |
10.0 |
e |
5.0 |
6.0 |
13.0 |
14.0 |
通过join函数连接
- 通过
join
函数来合并:
- 有主表:左表->所以默认左连接(左表的都保留,右表和左表相同索引的才保留,右表不存在的用
NaN
填充)
- 同样可以通过
how
参数来设置连接方式
merge5 = left2.join(right2)
merge5
|
Ohio |
Nevada |
Missouri |
Alabama |
a |
1.0 |
2.0 |
NaN |
NaN |
c |
3.0 |
4.0 |
9.0 |
10.0 |
e |
5.0 |
6.0 |
13.0 |
14.0 |
merge6 = lefth.join(righth, on=['key1', 'key2'])
merge6
|
key1 |
key2 |
data |
event1 |
event2 |
0 |
Ohio |
2000 |
0 |
4.0 |
5.0 |
0 |
Ohio |
2000 |
0 |
6.0 |
7.0 |
1 |
Ohio |
2001 |
1 |
8.0 |
9.0 |
2 |
Ohio |
2002 |
2 |
10.0 |
11.0 |
3 |
Nevada |
2001 |
3 |
0.0 |
1.0 |
4 |
Nevada |
2002 |
4 |
NaN |
NaN |
轴向连接
- 轴向连接:将多个同样的表合并为一个表(增加行数)concatennation
arr = np.arange(12).reshape(3, 4)
arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
- 使用
concatenate()
函数横向或纵向合并两个arr
- 通过
axis
参数来控制合并的方向
arr1 = np.concatenate([arr, arr], axis=1) # 横向合并
arr1
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
arr2 = np.concatenate([arr, arr], axis=0) # 纵向合并
arr2
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
合并重叠数据
- 创建两个列一致,部分数据一致的一位序列Series
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan], index=['f', 'e', 'd', 'c', 'b', 'a'])
a
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
b = pd.Series(np.arange(len(a), dtype=np.float64), index=['f', 'e', 'd', 'c', 'b', 'a'])
b[-1] = np.nan #-1下标指代序列的最后一位元素
b
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a NaN
dtype: float64
- 合并两个数据集,以a表为标准,a表中存在的数据保持原样,a表中不存在(NaN)的数据从b中合并过来
c = np.where(pd.isnull(a), b, a) # 当a为空时取b的值,否则取a的值
c
array([0. , 2.5, 2. , 3.5, 4.5, nan])
数据转换
移除重复数据
data = pd.DataFrame({'k1':['one']*3+['two']*4, 'k2':[1,1,2,3,3,4,4]})
data
|
k1 |
k2 |
0 |
one |
1 |
1 |
one |
1 |
2 |
one |
2 |
3 |
two |
3 |
4 |
two |
3 |
5 |
two |
4 |
6 |
two |
4 |
- 使用DataFrame对象的
duplicated()
函数检测是否重复行(是为true),返回一个Series对象
dup = data.duplicated()
dup
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
drop_duplicates()
方法去重, 可以指定用于比较是否相同的某一(或多)列(不一定完全重复),默认保留index最小的行,可以通过设置keep='last'
参数保留index最大的数据
nodup = data.drop_duplicates(keep='last')
nodup
|
k1 |
k2 |
1 |
one |
1 |
2 |
one |
2 |
4 |
two |
3 |
6 |
two |
4 |
data['v1'] = range(7)
data
|
k1 |
k2 |
v1 |
0 |
one |
1 |
0 |
1 |
one |
1 |
1 |
2 |
one |
2 |
2 |
3 |
two |
3 |
3 |
4 |
two |
3 |
4 |
5 |
two |
4 |
5 |
6 |
two |
4 |
6 |
nodup2 = data.drop_duplicates(['k1', 'k2'], keep='last')
nodup2
|
k1 |
k2 |
v1 |
1 |
one |
1 |
1 |
2 |
one |
2 |
2 |
4 |
two |
3 |
4 |
6 |
two |
4 |
6 |
数据替换
data = pd.Series([1., -999, 2., -1000, 3.])
data
0 1.0
1 -999.0
2 2.0
3 -1000.0
4 3.0
dtype: float64
data.replace(-999, np.nan)
0 1.0
1 NaN
2 2.0
3 -1000.0
4 3.0
dtype: float64
- 可以同时替换多个值,例如将-999和-1000同时替换为NaN
data.replace([-999, -1000], np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 3.0
dtype: float64
- 可以实现不同的值对应替换替换为不同的值,例如将-999替换为NaN,-1000替换为0
data.replace([-999, -1000], [np.nan, 0])
0 1.0
1 NaN
2 2.0
3 0.0
4 3.0
dtype: float64
检测异常值
- 创建一个含有正态分布数据的DataFrame, 通过DataFrame对象的
describe()
方法来获取统计特征值
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()
|
0 |
1 |
2 |
3 |
count |
1000.000000 |
1000.000000 |
1000.000000 |
1000.000000 |
mean |
0.013166 |
-0.000481 |
-0.030189 |
-0.009518 |
std |
0.976413 |
1.016713 |
0.997048 |
1.008174 |
min |
-3.016676 |
-3.162513 |
-3.282324 |
-3.535175 |
25% |
-0.649836 |
-0.688303 |
-0.703782 |
-0.698088 |
50% |
0.004212 |
0.011474 |
0.009120 |
-0.001872 |
75% |
0.695183 |
0.679623 |
0.639148 |
0.649286 |
max |
4.407439 |
4.139005 |
2.763371 |
3.721676 |
data[(np.abs(data) >3).any(1)]
|
0 |
1 |
2 |
3 |
38 |
-0.795664 |
-0.330065 |
-3.016717 |
2.120561 |
50 |
-0.629000 |
0.860827 |
-0.125433 |
-3.535175 |
79 |
-0.544495 |
-1.288971 |
-3.029938 |
0.279044 |
113 |
-3.016676 |
0.869205 |
-1.010988 |
-0.573853 |
254 |
4.407439 |
1.526187 |
-1.255663 |
-0.069301 |
276 |
-0.646509 |
-2.390895 |
1.456119 |
-3.337455 |
308 |
1.139149 |
4.139005 |
-1.131373 |
0.575723 |
493 |
3.189728 |
0.442730 |
-1.561426 |
0.402213 |
509 |
0.267529 |
-3.162513 |
1.107551 |
-1.028598 |
535 |
-0.679389 |
3.022345 |
-0.983833 |
0.965855 |
758 |
-0.182049 |
0.552248 |
1.448889 |
3.721676 |
841 |
1.724263 |
1.075039 |
-3.282324 |
1.537926 |
895 |
-1.313186 |
3.063947 |
0.154505 |
1.813358 |
931 |
1.180028 |
3.300766 |
-0.809880 |
-1.084533 |
933 |
-0.037565 |
0.175728 |
-3.255756 |
1.189677 |
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()
|
0 |
1 |
2 |
3 |
count |
1000.000000 |
1000.000000 |
1000.000000 |
1000.000000 |
mean |
0.011585 |
-0.001845 |
-0.029604 |
-0.009367 |
std |
0.970417 |
1.011010 |
0.995229 |
1.002956 |
min |
-3.000000 |
-3.000000 |
-3.000000 |
-3.000000 |
25% |
-0.649836 |
-0.688303 |
-0.703782 |
-0.698088 |
50% |
0.004212 |
0.011474 |
0.009120 |
-0.001872 |
75% |
0.695183 |
0.679623 |
0.639148 |
0.649286 |
max |
3.000000 |
3.000000 |
2.763371 |
3.000000 |
排列和随机采样
df = pd.DataFrame(np.arange(5 * 4).reshape(5, 4))
df
|
0 |
1 |
2 |
3 |
0 |
0 |
1 |
2 |
3 |
1 |
4 |
5 |
6 |
7 |
2 |
8 |
9 |
10 |
11 |
3 |
12 |
13 |
14 |
15 |
4 |
16 |
17 |
18 |
19 |
- 再生成一个随机排列的5位数组,用于按这个顺序打乱矩阵
sampler = np.random.permutation(5)
sampler
array([3, 2, 0, 4, 1])
- 按照sampler排列的顺序排列矩阵,因为sampler是随机排列的,所以实现了打乱矩阵,随机采样的效果
take()
函数可以实现按照特定序列实现矩阵的行调换的功能
df.take(sampler)
|
0 |
1 |
2 |
3 |
3 |
12 |
13 |
14 |
15 |
2 |
8 |
9 |
10 |
11 |
0 |
0 |
1 |
2 |
3 |
4 |
16 |
17 |
18 |
19 |
1 |
4 |
5 |
6 |
7 |
df.take(np.random.permutation(len(df))[:3])
|
0 |
1 |
2 |
3 |
0 |
0 |
1 |
2 |
3 |
3 |
12 |
13 |
14 |
15 |
4 |
16 |
17 |
18 |
19 |