avatar

Pandas

一、简介

封装了Numpy和Matplotlib,具有便捷的数据处理能力,读取文件方便,专门用于数据挖掘。

二、三大数据结构

2.1 DataFrame

2.1.1 结构

既有行索引,又有列索引的二维数组

1. 生成DataFrame

pd.DataFrame(ndarrary)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import numpy as np
import pandas as pd
#生成二维数组
stock_change = np.random.normal(0,1,(10,5))
#创建DataFrame
pd.DataFrame(stock_change)
0 1 2 3 4
0 -0.268469 1.730915 -1.009817 -0.724267 1.070360
1 -1.774932 -1.477198 1.293078 0.111610 0.422240
2 -0.339734 -0.327221 2.087621 -1.105055 0.372288
3 -1.511658 -0.192350 1.115869 -0.668625 0.223991
4 1.234608 -0.030666 -1.296268 0.676021 0.229242
5 -1.829124 -1.455717 1.119276 -0.493858 0.429053
6 1.849348 -0.155816 0.195674 -1.608004 -0.456923
7 -0.833554 -0.301686 0.189869 0.664212 -0.364294
8 -0.524520 0.328688 1.453038 -0.950180 1.066928
9 -0.014796 0.046543 0.194047 -0.753442 -1.235741

2. 添加行索引

pd.DataFrame(ndarrary,index= )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#1.生成行索引
stock = ["股票{}".format(i) for i in range(10)]
['股票0', '股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9']

#2.添加行索引
pd.DataFrame(stock_change,index=stock)

0 1 2 3 4
股票0 -0.268469 1.730915 -1.009817 -0.724267 1.070360
股票1 -1.774932 -1.477198 1.293078 0.111610 0.422240
股票2 -0.339734 -0.327221 2.087621 -1.105055 0.372288
股票3 -1.511658 -0.192350 1.115869 -0.668625 0.223991
股票4 1.234608 -0.030666 -1.296268 0.676021 0.229242
股票5 -1.829124 -1.455717 1.119276 -0.493858 0.429053
股票6 1.849348 -0.155816 0.195674 -1.608004 -0.456923
股票7 -0.833554 -0.301686 0.189869 0.664212 -0.364294
股票8 -0.524520 0.328688 1.453038 -0.950180 1.066928
股票9 -0.014796 0.046543 0.194047 -0.753442 -1.235741

3. 添加列索引

pd.DataFrame(ndarrary,columns= )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#生成列索引
date = pd.date_range(start="20210101",periods=5,freq="B")
DatetimeIndex(['2021-01-01', '2021-01-04', '2021-01-05', '2021-01-06',
'2021-01-07'],dtype='datetime64[ns]', freq='B')
#添加列索引
pd.DataFrame(stock_change,index=stock,columns=date)
2021-01-01 2021-01-04 2021-01-05 2021-01-06 2021-01-07
股票0 -0.268469 1.730915 -1.009817 -0.724267 1.070360
股票1 -1.774932 -1.477198 1.293078 0.111610 0.422240
股票2 -0.339734 -0.327221 2.087621 -1.105055 0.372288
股票3 -1.511658 -0.192350 1.115869 -0.668625 0.223991
股票4 1.234608 -0.030666 -1.296268 0.676021 0.229242
股票5 -1.829124 -1.455717 1.119276 -0.493858 0.429053
股票6 1.849348 -0.155816 0.195674 -1.608004 -0.456923
股票7 -0.833554 -0.301686 0.189869 0.664212 -0.364294
股票8 -0.524520 0.328688 1.453038 -0.950180 1.066928
股票9 -0.014796 0.046543 0.194047 -0.753442 -1.235741

2.1.2 属性

1. 常用属性

shape 矩阵形状

index 行索引列表

columns 列索引列表

values 显示ndarrary,不显示行列索引

T 行列转置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
data.shape
(10, 5)

data.index
Index(['股票0', '股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9'], dtype='object')

data.columns
DatetimeIndex(['2021-01-01', '2021-01-04', '2021-01-05', '2021-01-06',
'2021-01-07'],
dtype='datetime64[ns]', freq='B')

data.values
array([[-0.26846949, 1.73091461, -1.00981674, -0.72426739, 1.0703596 ],
[-1.77493189, -1.4771981 , 1.29307836, 0.11161021, 0.42224003],
[-0.3397336 , -0.32722064, 2.08762105, -1.10505512, 0.37228785],
[-1.51165829, -0.19235048, 1.11586897, -0.66862519, 0.22399066],
[ 1.23460754, -0.03066557, -1.29626776, 0.67602135, 0.22924198],
[-1.8291236 , -1.45571725, 1.11927613, -0.49385831, 0.42905348],
[ 1.8493478 , -0.15581612, 0.19567402, -1.60800405, -0.45692313],
[-0.83355434, -0.30168627, 0.18986934, 0.66421236, -0.36429427],
[-0.52451968, 0.32868791, 1.45303844, -0.95018047, 1.06692751],
[-0.01479553, 0.04654277, 0.19404728, -0.75344158, -1.23574094]])

data.T

股票0 股票1 股票2 股票3 股票4 股票5 股票6 股票7 股票8 股票9
2021-01-01 -0.268469 -1.774932 -0.339734 -1.511658 1.234608 -1.829124 1.849348 -0.833554 -0.524520 -0.014796
2021-01-04 1.730915 -1.477198 -0.327221 -0.192350 -0.030666 -1.455717 -0.155816 -0.301686 0.328688 0.046543
2021-01-05 -1.009817 1.293078 2.087621 1.115869 -1.296268 1.119276 0.195674 0.189869 1.453038 0.194047
2021-01-06 -0.724267 0.111610 -1.105055 -0.668625 0.676021 -0.493858 -1.608004 0.664212 -0.950180 -0.753442
2021-01-07 1.070360 0.422240 0.372288 0.223991 0.229242 0.429053 -0.456923 -0.364294 1.066928 -1.235741

2. 方法

head( ) 显示开头几行数据,默认为5,可以指定

tail( ) 显示尾部几行数据,默认为5,可以指定

1
2
3
4
5
6
7
8
9
10
11
12
#显示开头三行
data.head(3)
2021-01-01 2021-01-04 2021-01-05 2021-01-06 2021-01-07
股票0 -0.268469 1.730915 -1.009817 -0.724267 1.070360
股票1 -1.774932 -1.477198 1.293078 0.111610 0.422240
股票2 -0.339734 -0.327221 2.087621 -1.105055 0.372288

#显示尾部两行
data.tail(2)
2021-01-01 2021-01-04 2021-01-05 2021-01-06 2021-01-07
股票8 -0.524520 0.328688 1.453038 -0.950180 1.066928
股票9 -0.014796 0.046543 0.194047 -0.753442 -1.235741

2.1.3 索引的设置

1. 修改行列索引

不能单个修改,只能全部重新生成

1
2
3
4
5
6
7
8
9
10
11
12
13
stock_code = ["股票_"+ str(i) for i in range(10)]
data.index = stock_code
2021-01-01 2021-01-04 2021-01-05 2021-01-06 2021-01-07
股票_0 -0.268469 1.730915 -1.009817 -0.724267 1.070360
股票_1 -1.774932 -1.477198 1.293078 0.111610 0.422240
股票_2 -0.339734 -0.327221 2.087621 -1.105055 0.372288
股票_3 -1.511658 -0.192350 1.115869 -0.668625 0.223991
股票_4 1.234608 -0.030666 -1.296268 0.676021 0.229242
股票_5 -1.829124 -1.455717 1.119276 -0.493858 0.429053
股票_6 1.849348 -0.155816 0.195674 -1.608004 -0.456923
股票_7 -0.833554 -0.301686 0.189869 0.664212 -0.364294
股票_8 -0.524520 0.328688 1.453038 -0.950180 1.066928
股票_9 -0.014796 0.046543 0.194047 -0.753442 -1.235741

2. 重设索引

rest_index(drop=False)

  • 设置新的下标索引
  • drop:默认为False,不删除原来的索引,如果为True,则会删除原来的索引值
1
2
3
4
5
6
7
8
data.reset_index()
index 2021-01-01 2021-01-04 2021-01-05 2021-01-06 2021-01-07
0 股票_0 -0.268469 1.730915 -1.009817 -0.724267 1.070360
1 股票_1 -1.774932 -1.477198 1.293078 0.111610 0.422240
2 股票_2 -0.339734 -0.327221 2.087621 -1.105055 0.372288
3 股票_3 -1.511658 -0.192350 1.115869 -0.668625 0.223991
4 股票_4 1.234608 -0.030666 -1.296268 0.676021 0.229242
5 股票_5 -1.829124 -1.455717 1.119276 -0.493858 0.429053

3.设置新索引

set_index(keys,drop=True)

  • keys:列索引名或者列索引名列表
  • drop:默认为True,当作新的索引,删除原来的列
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
df = pd.DataFrame({ 'month' : [1,4,7,10],'year':[2012,2014,2013,2014],'sale':[55,40,84,31]})
month year sale
0 1 2012 55
1 4 2014 40
2 7 2013 84
3 10 2014 31

# 将月份设置为新的索引
df.set_index("month")
year sale
month
1 2012 55
4 2014 40
7 2013 84
10 2014 31

df.set_index("month",False)
month year sale
month
1 1 2012 55
4 4 2014 40
7 7 2013 84
10 10 2014 31

#将year和month设置为索引
new_df = df.set_index(['year','month'])
#注:通过刚才的设置,这样 DataFrame就变成了一个具有 MultiIndex的 DataFrame。
new_df.index.names
MultiIndex([(2012, 1),
(2014, 4),
(2013, 7),
(2014, 10)],
names=['year', 'month'])

2.1.4 MultiIndex和Panel

1.Multilndex

用来储存三维数组

属性:

  • names:levels名称
  • levels:每个level的元组值
1
2
3
4
5
new_df.index.names
FrozenList(['year', 'month'])

new_df.index.levels
FrozenList([[2012, 2013, 2014], [1, 4, 7, 10]])

2.Panel

和Multilndex相同,但0.20.0弃用

2.2 Series

带索引的一维数组

2.2.1 创建

1.指定内容,索引为默认

1
2
3
4
5
6
pd.Series(np.arange(3,9,2))

0 3
1 5
2 7
dtype: int32

2.指定索引

1
2
3
4
5
6
pd.Series(np.arange(3,9,2),index=["a","b","c"])

a 3
b 5
c 7
dtype: int32

3.通过字典创建

1
2
3
4
5
6
7
pd.Series({'red':100,'bule':200,'green':500,'yellow':1000})

red 100
bule 200
green 500
yellow 1000
dtype: int64

2.2.2 属性

index 行索引列表

values 显示ndarrary,不显示行索引

三、基本数据操作

3.1 索引操作

3.1.1 直接使用索引操作

1
2
3
4
5
6
7
8
9
10
11
12
data.head()

open high close low volume price_change p_change turnover
2018-02-27 23.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39
2018-02-26 22.80 23.78 23.53 22.80 60985.11 0.69 3.02 1.53
2018-02-23 22.88 23.37 22.82 22.71 52914.01 0.54 2.42 1.32
2018-02-22 22.25 22.76 22.28 22.02 36105.01 0.36 1.64 0.90
2018-02-14 21.49 21.99 21.92 21.48 23331.04 0.44 2.05 0.58

#必须先列后行
data["open"]["2018-02-06"]
22.8

3.1.2 按行列名进行索引

ndarrary.loc

1
2
3
4
data.loc["2018-02-06"]["open"]
22.8
data.loc["2018-02-06","open"]
22.8

3.1.3 按数组位置进行索引

ndarrary.iloc

1
2
data.iloc[1,0]
22.8

3.1.4 组合索引

ndarrary.ix (已经失效)

1
2
3
4
5
6
7
8
data.loc[data.index[:4],['open','close','high','low']]
data iloc[0:4, data.columns.get_indexer(['open','close','high','low'])]

open close high low
2018-02-27 23.53 24.16 25.88 23.53
2018-02-26 22.80 23.53 23.78 22.80
2018-02-23 22.88 22.82 23.37 22.71
2018-02-22 22.25 22.28 22.76 22.02

3.2 赋值操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
data.open = 100
open high close low volume price_change p_change turnover
2018-02-27 100 25.88 24.16 23.53 95578.03 0.63 2.68 2.39
2018-02-26 100 23.78 23.53 22.80 60985.11 0.69 3.02 1.53
2018-02-23 100 23.37 22.82 22.71 52914.01 0.54 2.42 1.32
2018-02-22 100 22.76 22.28 22.02 36105.01 0.36 1.64 0.90
2018-02-14 100 21.99 21.92 21.48 23331.04 0.44 2.05 0.58

data.iloc[1,0] = 222
open high close low volume price_change p_change turnover
2018-02-27 100 25.88 24.16 23.53 95578.03 0.63 2.68 2.39
2018-02-26 222 23.78 23.53 22.80 60985.11 0.69 3.02 1.53
2018-02-23 100 23.37 22.82 22.71 52914.01 0.54 2.42 1.32
2018-02-22 100 22.76 22.28 22.02 36105.01 0.36 1.64 0.90
2018-02-14 100 21.99 21.92 21.48 23331.04 0.44 2.05 0.58

3.3 排序

3.3.1 对内容进行排序

1.DataFrame

data.sort_values(key=,ascending=)

  • 单个键或者多个键,进行排序,默认升序
  • ascending=False:降序
  • ascending=True:升序
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#按high列数据,升序排列
data.sort_values(by="high")
open high close low volume price_change p_change turnover
2015-03-02 12.25 12.67 12.52 12.20 96291.73 0.32 2.62 3.30
2015-03-04 12.80 12.92 12.90 12.61 67075.44 0.20 1.57 2.30
2015-03-03 12.52 13.06 12.70 12.52 139071.61 0.18 1.44 4.76
2015-09-07 12.92 13.38 12.77 12.63 52490.04 0.37 2.98 1.80
2015-03-05 12.88 13.45 13.16 12.87 93180.39 0.26 2.02 3.19

#high数据相同情况下,按price_change大小排序
data.sort_values(["high","price_change"],ascending = False)
open high close low volume price_change p_change turnover
2015-06-10 34.10 36.35 33.85 32.23 269033.12 0.51 1.53 9.21
2015-06-12 34.69 35.98 35.21 34.01 159825.88 0.82 2.38 5.47
2017-10-31 32.62 35.22 34.44 32.20 361660.88 2.38 7.42 9.05
2015-06-15 34.99 34.99 31.69 31.69 199369.53 -3.52 -10.00 6.82
2015-06-11 33.17 34.98 34.39 32.51 173075.73 0.54 1.59 5.92

2. Series

1
2
3
4
5
6
7
sr.sort_values(ascending=False)
2015-06-09 3.03
2017-10-26 2.68
2015-05-21 2.57
2017-10-31 2.38
2017-06-22 2.36
Name: price_change, dtype: float64

3.3.2 对索引排序

1.DataFrame

data.sort_index( )

1
2
3
4
5
6
7
data.sort_index()
open high close low volume price_change p_change turnover
2015-03-02 12.25 12.67 12.52 12.20 96291.73 0.32 2.62 3.30
2015-03-03 12.52 13.06 12.70 12.52 139071.61 0.18 1.44 4.76
2015-03-04 12.80 12.92 12.90 12.61 67075.44 0.20 1.57 2.30
2015-03-05 12.88 13.45 13.16 12.87 93180.39 0.26 2.02 3.19
2015-03-06 13.17 14.48 14.28 13.13 179831.72 1.12 8.51 6.16

2.Series

sr.sort_index()

1
2
3
4
5
6
7
8
9
10
11
12
13
sr.sort_index()
2015-03-02 0.32
2015-03-03 0.18
2015-03-04 0.20
2015-03-05 0.26
2015-03-06 1.12
...
2018-02-14 0.44
2018-02-22 0.36
2018-02-23 0.54
2018-02-26 0.69
2018-02-27 0.63
Name: price_change, Length: 643, dtype: float64

四、DataFrame运算

4.1 算术运算

1.add( )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
data
open high close low volume price_change p_change turnover
2018-02-27 23.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39
2018-02-26 22.80 23.78 23.53 22.80 60985.11 0.69 3.02 1.53
2018-02-23 22.88 23.37 22.82 22.71 52914.01 0.54 2.42 1.32
2018-02-22 22.25 22.76 22.28 22.02 36105.01 0.36 1.64 0.90
2018-02-14 21.49 21.99 21.92 21.48 23331.04 0.44 2.05 0.58

#两种写法都可以
data.open + 3
data.open.add(3)

2018-02-27 26.53
2018-02-26 25.80
2018-02-23 25.88
2018-02-22 25.25
2018-02-14 24.49
Name: open, Length: 643, dtype: float64

data + 10
open high close low volume price_change p_change turnover
2018-02-27 33.53 35.88 34.16 33.53 95588.03 10.63 12.68 12.39
2018-02-26 32.80 33.78 33.53 32.80 60995.11 10.69 13.02 11.53
2018-02-23 32.88 33.37 32.82 32.71 52924.01 10.54 12.42 11.32
2018-02-22 32.25 32.76 32.28 32.02 36115.01 10.36 11.64 10.90
2018-02-14 31.49 31.99 31.92 31.48 23341.04 10.44 12.05 10.58

2.sub( )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
data.sub(10)
open high close low volume price_change p_change turnover
2018-02-27 13.53 15.88 14.16 13.53 95568.03 -9.37 -7.32 -7.61
2018-02-26 12.80 13.78 13.53 12.80 60975.11 -9.31 -6.98 -8.47
2018-02-23 12.88 13.37 12.82 12.71 52904.01 -9.46 -7.58 -8.68
2018-02-22 12.25 12.76 12.28 12.02 36095.01 -9.64 -8.36 -9.10
2018-02-14 11.49 11.99 11.92 11.48 23321.04 -9.56 -7.95 -9.42

#close列数据减去open列数据
data["close"].sub(data.open)

2018-02-27 0.63
2018-02-26 0.73
2018-02-23 -0.06
2018-02-22 0.03
2018-02-14 0.43
Length: 643, dtype: float64

4.2 逻辑运算

4.2.1 逻辑运算符

<、>、 |、 &

1
2
3
4
5
6
7
8
data["p_change"] > 2

2018-02-27 True
2018-02-26 True
2018-02-23 True
2018-02-22 False
2018-02-14 True
Name: p_change, Length: 643, dtype: bool

布尔索引

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#显示p_change列数值大于2的数据
data[data.p_change>2]
open high close low volume price_change p_change turnover
2018-02-27 23.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39
2018-02-26 22.80 23.78 23.53 22.80 60985.11 0.69 3.02 1.53
2018-02-23 22.88 23.37 22.82 22.71 52914.01 0.54 2.42 1.32
2018-02-14 21.49 21.99 21.92 21.48 23331.04 0.44 2.05 0.58
2018-02-12 20.70 21.40 21.19 20.63 32445.39 0.82 4.03 0.81

#显示p_change>2并且low>15的数据
data[(data.p_change>2) & (data.low>15)]
open high close low volume price_change p_change turnover
2018-02-27 23.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39
2018-02-26 22.80 23.78 23.53 22.80 60985.11 0.69 3.02 1.53
2018-02-23 22.88 23.37 22.82 22.71 52914.01 0.54 2.42 1.32
2018-02-14 21.49 21.99 21.92 21.48 23331.04 0.44 2.05 0.58
2018-02-12 20.70 21.40 21.19 20.63 32445.39 0.82 4.03 0.81

4.2.2 逻辑运算函数

1.query( )

指令语句执行函数,直接指定列名即可

1
2
3
4
5
6
7
data.query("p_change>2 & low>15")
open high close low volume price_change p_change turnover
2018-02-27 23.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39
2018-02-26 22.80 23.78 23.53 22.80 60985.11 0.69 3.02 1.53
2018-02-23 22.88 23.37 22.82 22.71 52914.01 0.54 2.42 1.32
2018-02-14 21.49 21.99 21.92 21.48 23331.04 0.44 2.05 0.58
2018-02-12 20.70 21.40 21.19 20.63 32445.39 0.82 4.03 0.81

2.isin( )

筛选数值范围之内的数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#筛选turnover列数值介于4.19和2.39之间的数据
data.turnover.isin([4.19,2.39])
2018-02-27 True
2018-02-26 False
2018-02-23 False
2018-02-22 False
2018-02-14 False
Name: turnover, Length: 643, dtype: bool

data[data.turnover.isin([4.19,2.39])]
open high close low volume price_change p_change turnover
2018-02-27 23.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39
2017-07-25 23.07 24.20 23.70 22.64 167489.48 0.67 2.91 4.19
2016-09-28 19.88 20.98 20.86 19.71 95580.75 0.98 4.93 2.39
2015-04-07 16.54 17.98 17.54 16.50 122471.85 0.88 5.28 4.19

4.3 统计运算

4.3.1 常用统计运算符

min、 max、 mean、 median、 var、 std

describe 一次性获得所有类型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
data.describe()
open high close low volume price_change p_change turnover
count 643.000000 643.000000 643.000000 643.000000 643.000000 643.000000 643.000000 643.000000
mean 21.272706 21.900513 21.336267 20.771835 99905.519114 0.018802 0.190280 2.936190
std 3.930973 4.077578 3.942806 3.791968 73879.119354 0.898476 4.079698 2.079375
min 12.250000 12.670000 12.360000 12.200000 1158.120000 -3.520000 -10.030000 0.040000
25% 19.000000 19.500000 19.045000 18.525000 48533.210000 -0.390000 -1.850000 1.360000
50% 21.440000 21.970000 21.450000 20.980000 83175.930000 0.050000 0.260000 2.500000
75% 23.400000 24.065000 23.415000 22.850000 127580.055000 0.455000 2.305000 3.915000
max 34.990000 36.350000 35.210000 34.010000 501915.410000 3.030000 10.030000 12.560000

#按列获得最大值(默认)
data.max(0)
open 34.99
high 36.35
close 35.21
low 34.01
volume 501915.41
price_change 3.03
p_change 10.03
turnover 12.56

#按行获得最大值
data.max(1)
2018-02-27 95578.03
2018-02-26 60985.11
2018-02-23 52914.01
2018-02-22 36105.01
2018-02-14 23331.04

4.3.2 获取位置

idxmax( )、idxmin( )…

1
2
3
4
5
6
7
8
9
10
data.idxmax()
open 2015-06-15
high 2015-06-10
close 2015-06-12
low 2015-06-12
volume 2017-10-26
price_change 2015-06-09
p_change 2015-08-28
turnover 2017-10-26
dtype: object

4.3.3 累计统计函数

cumsum 计算前1/2/3/.n个数的和
cummax 计算前1/2/3.n个数的最大值
cummin 计算前1/2/3/.n个数的最小值
cumprod 计算前1/2/3/.n个数的积

1
2
3
4
5
6
7
8
9
10
11
12
13
data.p_change
2018-82-27 2.68
2018-92-26 3.82
2018-2-23 2.42
2018-02-22 1.64
2018-82-14 2.85

data.p_change.cumsum()
2018-92-27 2.68
2018-2-26 5.7
2018-92-23 8.12
2018-92-22 9.76
2018-92-14 11.81

4.4 自定义运算

apply(func,axis=0)

  • func:自定义函数
  • axis:0默认为列,1为行
1
2
3
4
5
6
7
8
9
10
11
data.apply(lambda x: x.max() - x.min())

open 22.74
high 23.68
close 22.85
low 21.81
volume 500757.29
price_change 6.55
p_change 20.06
turnover 12.52
dtype: float64

五、画图

5.1 Dataframe.plot( )

二维数组直接用plot( ),会自动生成统计图

1
2
#指定volume为x轴,turnover为y轴,类型为散点图
data.plot(x="volume",y="turnover",kind="scatter")

5.2 Series.plot( )

同Dataframe

六、文件读取与存储

6.1 CSV文件

1.读取

pd.read_csv(path)

  • usecols= 获取想要的列
  • names= 添加列索引
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
pd.read_csv("stock_day.csv",usecols=["high","low","open","close"])

open high close low
2018-02-27 23.53 25.88 24.16 23.53
2018-02-26 22.80 23.78 23.53 22.80
2018-02-23 22.88 23.37 22.82 22.71
2018-02-22 22.25 22.76 22.28 22.02
2018-02-14 21.49 21.99 21.92 21.48

#读取文件不含列索引,读取时添加列索引
data = pd.read_csv("stock_day2.csv",names=['open', 'high', 'close', 'low', 'volume', 'price_change', 'p_change','ma5', 'ma10', 'ma20', 'v_ma5', 'v_ma10', 'v_ma20', 'turnover'])
open high close low volume price_change p_change ma5
2018-02-27 23.53 25.88 24.16 23.53 95578.03 0.63 2.68 22.942
2018-02-26 22.80 23.78 23.53 22.80 60985.11 0.69 3.02 22.406
2018-02-23 22.88 23.37 22.82 22.71 52914.01 0.54 2.42 21.938
2018-02-22 22.25 22.76 22.28 22.02 36105.01 0.36 1.64 21.446
2018-02-14 21.49 21.99 21.92 21.48 23331.04 0.44 2.05 21.366

2.写入

DataFrame.to_csv(path)

  • columns=[ ] 需要写入文件的列
  • index=False 不保存行索引
  • header=False 不保存列索引
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#保存open列的数据
data[:10].to_csv("mytest.csv",columns=["open"])
Unnamed: 0 open
0 2018-02-27 23.53
1 2018-02-26 22.80
2 2018-02-23 22.88
3 2018-02-22 22.25
4 2018-02-14 21.49
5 2018-02-13 21.40
6 2018-02-12 20.70
7 2018-02-09 21.20
8 2018-02-08 21.79
9 2018-02-07 22.69

#设置index为False
data[:10].to_csv("mytest.csv",columns=["open"],index=False)
open
0 23.53
1 22.80
2 22.88
3 22.25
4 21.49
5 21.40
6 20.70
7 21.20
8 21.79
9 22.69

6.2 HDF5文件

hdf5存储三维数据的文件

key1 dataframe1 二维数据

key2 dataframe2 二维数据

key相当于存进大文件的小文件的文件名,在读取时必须指定key

1.读取

pd.read_hdf(path,key=)

1
2
3
#读取h5文件中,key值为close的数组
pd.read_hdf("mytest.h5",key="close")
pd.read_hdf("mytest.h5",key="open")

2.写入

pd.to_hdf(path,key=)

1
2
3
4
#将day_close数组写入h5文件中,key指定为close
day_close.to_hdf("mytest.h5",key="close")
#将day_open数组写入h5文件中,key指定为open
day_open.to_hdf("mytest.h5",key="open")

6.3 JSON文件

1.读取

pd.read_json(path)

  • orient=”records”
  • lines=True 按行读取,默认为False
1
2
3
4
5
6
7
8
sa = pd.read_json("Sarcasm_Headlines_Dataset.json",orient="records",lines=True)

article_link headline is_sarcastic
0 https://www.huffingtonpost.com/entry/versace-b... former versace store 0
1 https://www.huffingtonpost.com/entry/roseanne-... the 'roseanne' revival 0
2 https://local.theonion.com/mom-starting-to-fea... mom starting to fear 1
3 https://politics.theonion.com/boehner-just-wan... boehner just wants w 1
4 https://www.huffingtonpost.com/entry/jk-rowlin... j.k. rowling wishes 0

2.写入

pd.to_json( path)

1
2
#将数据按行写入json文件中
sa.to_json("mytest.json",orient="records",lines=True)

评论