本文接续注释版1,前文重点讲述了如何创建一个panads对象,本文重点讲述如何查看这些已经创建的对象。
【查看数据】
- See the top & bottom rows of the frame(查看frame头部和尾部的行)
>>> import pandas as pd>>> long_series = pd.Series(np.random.randn(1000))>>> import numpy as np>>> long_series = pd.Series(np.random.randn(1000))>>> long_series0 0.5265071 -0.0852102 1.2921133 -1.9481144 -1.3865825 -2.5968216 0.2689657 -0.6359058 -1.8399539 -1.24082010 0.122215.......
上面为完成的series,可以看到定义了一个10000个值,现在我们只取头部和尾部,因此可以使用head()和tail()两个方法,两个方法默认取的数据都是5个,当然你可以自己定义取几个,具体如下:
>>> long_series.head()0 0.5265071 -0.0852102 1.2921133 -1.9481144 -1.386582dtype: float64>>> long_series.tail(6) ----lst: 取最后6个值994 -1.300574995 0.659815996 -0.340045997 0.685664998 -0.972145999 0.410191dtype: float64
- 显示索引、列和底层numpy数据
pandas获取这些比较简单,直接采用属性的方式即可。如下:
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates,columns=list('ABCD'))>>> df A B C D2017-01-01 0.906245 1.815924 0.123356 -1.7985712017-01-02 -0.459646 0.520100 0.511138 0.1839752017-01-03 0.463326 -0.970487 -1.120780 -0.6144812017-01-04 1.505464 -1.743313 1.020903 -1.0490472017-01-05 -0.709366 1.378030 1.874955 -1.0175482017-01-06 1.113554 -0.951963 -1.266802 -0.586571>>> df.index 获取行索引DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05', '2017-01-06'], dtype='datetime64[ns]', freq='D')>>> df.columns 获取列索引Index([u'A', u'B', u'C', u'D'], dtype='object')>>> df.values 获取值array([[ 0.90624543, 1.81592368, 0.12335647, -1.79857091], [-0.45964616, 0.52009988, 0.51113763, 0.1839755 ], [ 0.46332631, -0.97048662, -1.12078016, -0.61448135], [ 1.50546445, -1.74331294, 1.02090281, -1.04904748], [-0.70936561, 1.37802983, 1.87495471, -1.01754786], [ 1.11355431, -0.95196258, -1.2668023 , -0.58657136]])
- 对数据的一些快速基本统计
>>> df.describe() A B C Dcount 6.000000 6.000000 6.000000 6.000000mean 0.469930 0.008049 0.190462 -0.813707std 0.886775 1.439019 1.222903 0.656284min -0.709366 -1.743313 -1.266802 -1.79857125% -0.228903 -0.965856 -0.809746 -1.04117350% 0.684786 -0.215931 0.317247 -0.81601575% 1.061727 1.163547 0.893462 -0.593549max 1.505464 1.815924 1.874955 0.183975
注意上述的统计,是按照不同维度(也就是列)进行统计。
- 数据的行列转换
>>> df.T 2017-01-01 2017-01-02 2017-01-03 2017-01-04 2017-01-05 2017-01-06A 0.906245 -0.4596 46 0.463326 1.505464 -0.709366 1.113554B 1.815924 0.520100 -0.970487 -1.743313 1.378030 -0.951963C 0.123356 0.511138 -1.120780 1.020903 1.874955 -1.266802D -1.798571 0.183975 -0.614481 -1.049047 -1.017548 -0.586571
- 按照某一个轴axis进行排序
>>> df.sort_index(axis=1,ascending=False) D C B A2017-01-01 -1.798571 0.123356 1.815924 0.9062452017-01-02 0.183975 0.511138 0.520100 -0.4596462017-01-03 -0.614481 -1.120780 -0.970487 0.4633262017-01-04 -1.049047 1.020903 -1.743313 1.5054642017-01-05 -1.017548 1.874955 1.378030 -0.7093662017-01-06 -0.586571 -1.266802 -0.951963 1.113554
- 按值进行排序 (lst:以前的版本是sort(columns=xxx),该方法将被废止,现在官方已经开始使用sort_values)
>>> df.sort_values(by='B') A B C D2017-01-04 1.505464 -1.743313 1.020903 -1.0490472017-01-03 0.463326 -0.970487 -1.120780 -0.6144812017-01-06 1.113554 -0.951963 -1.266802 -0.5865712017-01-02 -0.459646 0.520100 0.511138 0.1839752017-01-05 -0.709366 1.378030 1.874955 -1.0175482017-01-01 0.906245 1.815924 0.123356 -1.798571
【选择数据】
注意
:虽然标准的Python/Numpy表达式是直观且可用的,但是我们推荐使用优化后的pandas方法,例如:.at,.iat,.loc,.iloc以及.ix 详情请查看: 和
-
获取
获取一个单独的列
>>> df['A']2017-01-01 0.9062452017-01-02 -0.4596462017-01-03 0.4633262017-01-04 1.5054642017-01-05 -0.7093662017-01-06 1.113554Freq: D, Name: A, dtype: float64
通过切片获取数据
>>> df[1:3] A B C D2017-01-02 -0.459646 0.520100 0.511138 0.1839752017-01-03 0.463326 -0.970487 -1.120780 -0.614481
通过标签获取数据 (获取时间为2017-01-01的数据)
>> df.loc[dates[0]]A 0.906245B 1.815924C 0.123356D -1.798571Name: 2017-01-01 00:00:00, dtype: float64
通过标签获取多轴数据
>>> df.loc[:,['A','C']] A C2017-01-01 0.906245 0.1233562017-01-02 -0.459646 0.5111382017-01-03 0.463326 -1.1207802017-01-04 1.505464 1.0209032017-01-05 -0.709366 1.8749552017-01-06 1.113554 -1.266802
标签切片(Showing label slicing, both endpoints are included)
>>> df.loc['20170101':'20170103',['A','B']] A B2017-01-01 0.906245 1.8159242017-01-02 -0.459646 0.5201002017-01-03 0.463326 -0.970487
- 对返回的对象进行维度缩减
>>> df.loc['20170103',['A','B']]A 0.463326B -0.970487Name: 2017-01-03 00:00:00, dtype: float64
获取单个值
>>> df.loc[dates[0],'A']0.90624542800545049
快速访问单个值(与上相同,区别还不明白)
>>> df.at[dates[0],'A']0.90624542800545049
以上获取数据,大部分都是采用loc的方式获取的数据,下面将主要采用iloc的方式获取数据。两者主要的区别是:loc主要是通过行标签的方式获取,仔细观察上面的代码,可以发现我们变换的主要都是第一个参数,也就是行的标签,而下面获取的iloc主要变换的是行号。
- 位置式选择获取
数值选择获取
>>> df A B C D2017-01-01 0.906245 1.815924 0.123356 -1.7985712017-01-02 -0.459646 0.520100 0.511138 0.1839752017-01-03 0.463326 -0.970487 -1.120780 -0.6144812017-01-04 1.505464 -1.743313 1.020903 -1.0490472017-01-05 -0.709366 1.378030 1.874955 -1.0175482017-01-06 1.113554 -0.951963 -1.266802 -0.586571>>> df.iloc[3]A 1.505464B -1.743313C 1.020903D -1.049047Name: 2017-01-04 00:00:00, dtype: float64
数值切片
>>> df.iloc[3:5,0:2] 注意切片是左闭环 A B2017-01-04 1.505464 -1.7433132017-01-05 -0.709366 1.378030
获取指定列表位置数据
>>> df.iloc[[1,2,4],[0,2]] A C2017-01-02 -0.459646 0.5111382017-01-03 0.463326 -1.1207802017-01-05 -0.709366 1.874955>>>
行、列切片
>>> df.iloc[1:3,:] A B C D2017-01-02 -0.459646 0.520100 0.511138 0.1839752017-01-03 0.463326 -0.970487 -1.120780 -0.614481>>> df.iloc[:,1:3] B C2017-01-01 1.815924 0.1233562017-01-02 0.520100 0.5111382017-01-03 -0.970487 -1.1207802017-01-04 -1.743313 1.0209032017-01-05 1.378030 1.8749552017-01-06 -0.951963 -1.266802
获取特定值
>>> df.iloc[1,1]0.52009988180243594>>> df.iat[1,1]0.52009988180243594
- 布尔索引(通过增加条件判断的结果来获取数据)
使用一个单独列的值来选择数据
>>> df A B C D2017-01-01 0.906245 1.815924 0.123356 -1.7985712017-01-02 -0.459646 0.520100 0.511138 0.1839752017-01-03 0.463326 -0.970487 -1.120780 -0.6144812017-01-04 1.505464 -1.743313 1.020903 -1.0490472017-01-05 -0.709366 1.378030 1.874955 -1.0175482017-01-06 1.113554 -0.951963 -1.266802 -0.586571>>> df[df.A>0] A B C D2017-01-01 0.906245 1.815924 0.123356 -1.7985712017-01-03 0.463326 -0.970487 -1.120780 -0.6144812017-01-04 1.505464 -1.743313 1.020903 -1.0490472017-01-06 1.113554 -0.951963 -1.266802 -0.586571
Selecting values from a DataFrame where a boolean condition is met.
(获取所有DataFrame中满足条件的数据)
>>> df[df>0] A B C D2017-01-01 0.906245 1.815924 0.123356 NaN2017-01-02 NaN 0.520100 0.511138 0.1839752017-01-03 0.463326 NaN NaN NaN2017-01-04 1.505464 NaN 1.020903 NaN2017-01-05 NaN 1.378030 1.874955 NaN2017-01-06 1.113554 NaN NaN NaN
通过isin()过滤数据
>>> df2 = df.copy()>>> df2['E'] =['one','one','two','three','four','three']>>> df2 A B C D E2017-01-01 0.906245 1.815924 0.123356 -1.798571 one2017-01-02 -0.459646 0.520100 0.511138 0.183975 one2017-01-03 0.463326 -0.970487 -1.120780 -0.614481 two2017-01-04 1.505464 -1.743313 1.020903 -1.049047 three2017-01-05 -0.709366 1.378030 1.874955 -1.017548 four2017-01-06 1.113554 -0.951963 -1.266802 -0.586571 three>>> df2[df2['E'].isin(['two','four'])] A B C D E2017-01-03 0.463326 -0.970487 -1.120780 -0.614481 two2017-01-05 -0.709366 1.378030 1.874955 -1.017548 four
lst:此处官方的例子有点复杂。在Series的isin的方法中,其应该是返回一个包含布尔类型的Series对象,用以表示源对象是否包含传入的参数值才对(DataFrame也类似)。isin的官方定义如下:
>>> df3 = pd.DataFrame({ 'A':[1,2,3],'B':['a','b','c']})>>> df3 A B0 1 a1 2 b2 3 c>>> df3.isin([1,3]) A B0 True False1 False False2 True False>>> df
但在官方的例子中,返回的是一个DataFrame,主要原因是判断完毕two和four是否在df2中以后,如果为TRUE将判断结果传入df2,并返回符合的结果。
- 设置数据
通过索引新增一列数据
>>> s3 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20170101',periods=6))>>> s32017-01-01 12017-01-02 22017-01-03 32017-01-04 42017-01-05 52017-01-06 6Freq: D, dtype: int64>>> df['F']= s3>>> df A B C D E F2017-01-01 0.906245 1.815924 0.123356 -1.798571 NaN 12017-01-02 -0.459646 0.520100 0.511138 0.183975 NaN 22017-01-03 0.463326 -0.970487 -1.120780 -0.614481 NaN 32017-01-04 1.505464 -1.743313 1.020903 -1.049047 NaN 42017-01-05 -0.709366 1.378030 1.874955 -1.017548 NaN 52017-01-06 1.113554 -0.951963 -1.266802 -0.586571 NaN 6
通过标签更新值
>>> df.at[dates[0],'A'] =1.5>>> df.at[dates[0],'A']1.5
通过位置更新值
>>> df.iat[0,1]=2.5>>> df.iat[0,1]2.5>>> df A B C D E F2017-01-01 1.500000 2.500000 0.123356 -1.798571 NaN 12017-01-02 -0.459646 0.520100 0.511138 0.183975 NaN 22017-01-03 0.463326 -0.970487 -1.120780 -0.614481 NaN 32017-01-04 1.505464 -1.743313 1.020903 -1.049047 NaN 42017-01-05 -0.709366 1.378030 1.874955 -1.017548 NaN 52017-01-06 1.113554 -0.951963 -1.266802 -0.586571 NaN 6
通过数组更新
>>> df.loc[:,'E'] =np.array([5]*len(df))>>> df A B C D E F2017-01-01 1.500000 2.500000 0.123356 -1.798571 5 12017-01-02 -0.459646 0.520100 0.511138 0.183975 5 22017-01-03 0.463326 -0.970487 -1.120780 -0.614481 5 32017-01-04 1.505464 -1.743313 1.020903 -1.049047 5 42017-01-05 -0.709366 1.378030 1.874955 -1.017548 5 52017-01-06 1.113554 -0.951963 -1.266802 -0.586571 5 6
通过where条件更新值
>>> df4= df.copy()>>> df4[df4<0] = 3.6>>> df4 A B C D E F2017-01-01 1.500000 2.50000 0.123356 3.600000 5 12017-01-02 3.600000 0.52010 0.511138 0.183975 5 22017-01-03 0.463326 3.60000 3.600000 3.600000 5 32017-01-04 1.505464 3.60000 1.020903 3.600000 5 42017-01-05 3.600000 1.37803 1.874955 3.600000 5 52017-01-06 1.113554 3.60000 3.600000 3.600000 5 6