avatar

Pandas-高级处理

一、缺失值处理

1.1 缺失值处理方式

1.1.1 思路

1.删除含有缺失值的样本(nan)

2.替换/插补

1.1.2 处理nan

1.判断数据中是否存在nan

  • pd.isnull(df) 判断是否为空,是空的话置为True
  • pd.notnull(df) 判断是否为空,不是空的话置为True
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#返回True,数据中存在缺失值
np.any(movie.isnull())
True
#返回False,数据中存在缺失值
np.all(movie.notnull())
False

movie.isnull().any()
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) True
Metascore True
dtype: bool

pd.notnull(movie).all()
Rank True
Title True
Genre True
Description True
Director True
Actors True
Year True
Runtime (Minutes) True
Rating True
Votes True
Revenue (Millions) False
Metascore False
dtype: bool

2.处理nan

删除含有缺失值的样本

  • df.dropna( )
    • axis=’row’ 默认按行
    • inplace= (True:删除原始DataFrame缺失值样本,False:不改变原始数据,生成新的删除缺失值样本的数组,默认为False)

替换缺失值

  • df.fillna(value,inplace=)
    • value,替换成的值
    • inplace= (True:删除原始DataFrame缺失值样本,False:不改变原始数据,生成新的删除缺失值样本的数组,默认为False)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
movie["Revenue (Millions)"].fillna(movie["Revenue (Millions)"].mean(),inplace=True)
movie["Metascore"].fillna(movie["Metascore"].mean(),inplace=True)
pd.isnull(movie).any()
Rank False
Title False
Genre False
Description False
Director False
Actors False
Year False
Runtime (Minutes) False
Rating False
Votes False
Revenue (Millions) False
Metascore False
dtype: bool

1.1.3 缺失值不为nan,有默认标记

1.替换?为np.nan

  • df.replace(to_replace=”?”, value=np. nan)

2.处理np.nan

  • 与nan值处理方式相同

二、数据离散化

2.1 定义及意义

定义:连续属性的离散化就是将连续属性的值域上,将值域划分为若干个离散的区间,最后用不同的符号或整数值代表落在每个子区间中的属性值。

意义:连续属性离散化的目的是为了简化数据结构,数据离散化技术可以用来减少给定连续属性值

的个数。离散化方法经常作为数据挖掘的工具。

2.2 实现

2.2.1 分组

1.自动分组

  • sr=pd.qcut(data,bins) 数据,组数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
data = pd.Series([165,174,160,180,159,163,192,184],index=['No1:165','No1:174','No1:160','No1:180','No1:159','No1:163','No1:192','No1:184'])
No1:165 165
No1:174 174
No1:160 160
No1:180 180
No1:159 159
No1:163 163
No1:192 192
No1:184 184
dtype: int64
# 分组
sr = pd.qcut(data,3)
#查看每组个数
sr.value_counts()

(178.0, 192.0] 3
(158.999, 163.667] 3
(163.667, 178.0] 2
dtype: int64
#转换成one-hot编码
pd.get_dummies(sr,prefix="height")
height_(158.999, 163.667] height_(163.667, 178.0] height_(178.0, 192.0]
No1:165 0 1 0
No1:174 0 1 0
No1:160 1 0 0
No1:180 0 0 1
No1:159 1 0 0
No1:163 1 0 0
No1:192 0 0 1
No1:184 0 0 1

2.自定义分组

  • sr=pd.cut(data,[ ]) 数据,分组列表
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#自定义分组
bins = [150,165,180,195]
sr = pd.cut(data,bins)
No1:165 (150, 165]
No1:174 (165, 180]
No1:160 (150, 165]
No1:180 (165, 180]
No1:159 (150, 165]
No1:163 (150, 165]
No1:192 (180, 195]
No1:184 (180, 195]
dtype: category
Categories (3, interval[int64]): [(150, 165] < (165, 180] < (180, 195]]
pd.get_dummies(sr,prefix="身高")
身高_(150, 165] 身高_(165, 180] 身高_(180, 195]
No1:165 1 0 0
No1:174 0 1 0
No1:160 1 0 0
No1:180 0 1 0
No1:159 1 0 0
No1:163 1 0 0
No1:192 0 0 1
No1:184 0 0 1

3.将分组好的结果转换成one-hot编码

  • pd.get_dummies(sr,prefix= )

三、合并

3.1 按方向拼接

pd.concat([data1, data2], axis=1)

  • axis=0:按列拼接 纵向拼接
  • axis=1:按行拼接 横向拼接

3.2 按索引拼接

pd.merge(left, right, how=”inner”, on=[索引])

  • how=: left(左连接)、right(右连接)、outer(外连接)、inner(内连接)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
left=pd.DataFrame({"key1":["K0","K0","K1","K2"],
"key2":["K0","K1","K0","K1"],
"A":["A0","A1","A2","A3"],
"B":["B0","B1","B2","B3"]})
right = pd.DataFrame({'key1':["K0","K1","K1","K2"],
"key2":["K0","K0","K0","K0"],
"C":["C0","C1",'C2',"C3"],
"D":["D0","D1","D2","D3"]})
left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
right
keyl key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
#内连接
pd.merge(left,right,how="inner",on=["key1","key2"])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
#左连接
pd.merge(left,right,how="left",on=["key1","key2"])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
#右连接
pd.merge(left,right,how="right",on=["key1","key2"])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
#外连接
pd.merge(left,right,how="outer",on=["key1","key2"])
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3

四、交叉表与透视表

找到、探索两个变量之间的关系

4.1 使用crosstab(交叉表)实现

交叉表用于计算一列数据对于另外一列数据的分组个数(寻找两个列之间的关系)

pd.crosstab(value1, value2)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
date.weekday
Int64Index([1, 0, 4, 3, 2, 1, 0, 4, 3, 2,
...
4, 3, 2, 1, 0, 4, 3, 2, 1, 0],
dtype='int64', length=643)
stock["week"]= date.weekday
#涨跌幅数据列
stock["pona"] = np.where(stock["p_change"]> 0,1,0)
#星期列与涨跌幅数据列形成交叉表
data = pd.crosstab(stock["week"],stock["pona"])
#得到每天涨跌幅频数
pona 0 1
week
0 63 62
1 55 76
2 61 71
3 63 65
4 59 68
#得到百分比
data.div(data.sum(axis=1),axis=0)
pona 0 1
week
0 0.504000 0.496000
1 0.419847 0.580153
2 0.462121 0.537879
3 0.492188 0.507812
4 0.464567 0.535433

4.2 透视表piovt_table

DataFrame.pivot_table([ ], index=[ ])

直接得到百分比

1
2
3
4
5
6
7
8
stock.pivot_table(["pona"],index=["week"])
pona
week
0 0.496000
1 0.580153
2 0.537879
3 0.507812
4 0.535433

五、分组与聚合

5.1 DataFrame方法

DataFrame.groupby(key, as_ index=False)

key:分组的列数据,可以多个

注:分组并聚合后才能显示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
col = pd.DataFrame({'color':['white','red','green','red','green'],
'object':['pen','pencil','pencil','ashtray','pen'],
'price1':['5.56','4.20','1.30','0.56','2.75'],
'price2':['4.75','4.12','1.60','0.75','3.15']})

color object price1 price2
0 white pen 5.56 4.75
1 red pencil 4.20 4.12
2 green pencil 1.30 1.60
3 red ashtray 0.56 0.75
4 green pen 2.75 3.15
#对颜色分组,对price1进行聚合
#DataFrame的方法进行分组
col.groupby(by="color")["price1"].max()
color
green 2.75
red 4.20
white 5.56
Name: price1, dtype: object

5.2 Series方法

Series.groupby(key, as_ index=False)

1
2
3
4
5
6
7
#用Series方法
col["price1"].groupby(col["color"]).max()
color
green 2.75
red 4.20
white 5.56
Name: price1, dtype: object

评论