Reshaping and Pivot Tables

原文:http://pandas.pydata.org/pandas-docs/stable/reshaping.html

译者:飞龙 UsyiyiCN

校对:(虚位以待)

Reshaping by pivoting DataFrame objects

数据通常以所谓的“堆叠”或“记录”格式存储在CSV文件或数据库中:

In [1]: df
Out[1]: 
         date variable     value
0  2000-01-03        A  0.469112
1  2000-01-04        A -0.282863
2  2000-01-05        A -1.509059
3  2000-01-03        B -1.135632
4  2000-01-04        B  1.212112
5  2000-01-05        B -0.173215
6  2000-01-03        C  0.119209
7  2000-01-04        C -1.044236
8  2000-01-05        C -0.861849
9  2000-01-03        D -2.104569
10 2000-01-04        D -0.494929
11 2000-01-05        D  1.071804

对于好奇这里是如何创建上面的DataFrame:

import pandas.util.testing as tm; tm.N = 3
def unpivot(frame):
    N, K = frame.shape
    data = {'value' : frame.values.ravel('F'),
            'variable' : np.asarray(frame.columns).repeat(N),
            'date' : np.tile(np.asarray(frame.index), K)}
    return pd.DataFrame(data, columns=['date', 'variable', 'value'])
df = unpivot(tm.makeTimeDataFrame())

要选择变量A的所有内容,我们可以:

In [2]: df[df['variable'] == 'A']
Out[2]: 
        date variable     value
0 2000-01-03        A  0.469112
1 2000-01-04        A -0.282863
2 2000-01-05        A -1.509059

但是假设我们希望用变量进行时间序列运算。更好的表示是columns是唯一变量,index标识个别观察。要将数据重新整形为此表单,请使用pivot函数:

In [3]: df.pivot(index='date', columns='variable', values='value')
Out[3]: 
variable           A         B         C         D
date                                              
2000-01-03  0.469112 -1.135632  0.119209 -2.104569
2000-01-04 -0.282863  1.212112 -1.044236 -0.494929
2000-01-05 -1.509059 -0.173215 -0.861849  1.071804

如果省略values参数,并且输入DataFrame具有多个不用作pivot的列或索引输入的值列,则生成的“pivoted” DataFrame将具有hierarchical columns,其最高级别表示相应的值列:

In [4]: df['value2'] = df['value'] * 2

In [5]: pivoted = df.pivot('date', 'variable')

In [6]: pivoted
Out[6]: 
               value                                  value2            \
variable           A         B         C         D         A         B   
date                                                                     
2000-01-03  0.469112 -1.135632  0.119209 -2.104569  0.938225 -2.271265   
2000-01-04 -0.282863  1.212112 -1.044236 -0.494929 -0.565727  2.424224   
2000-01-05 -1.509059 -0.173215 -0.861849  1.071804 -3.018117 -0.346429   

                                
variable           C         D  
date                            
2000-01-03  0.238417 -4.209138  
2000-01-04 -2.088472 -0.989859  
2000-01-05 -1.723698  2.143608  

您当然可以从透视的DataFrame中选择子集:

In [7]: pivoted['value2']
Out[7]: 
variable           A         B         C         D
date                                              
2000-01-03  0.938225 -2.271265  0.238417 -4.209138
2000-01-04 -0.565727  2.424224 -2.088472 -0.989859
2000-01-05 -3.018117 -0.346429 -1.723698  2.143608

请注意,在数据是均匀类型的情况下,这将返回基础数据的视图。

Reshaping by stacking and unstacking

pivot函数紧密相关的是当前在Series和DataFrame上可用的相关stackunstack功能。这些函数设计为与MultiIndex对象一起使用(请参阅hierarchical indexing一节)。这些功能基本上是:

  • stack:“枢轴”(可能是层次结构)列标签的级别,返回具有新的最内层行标签的索引的DataFrame。
  • unstack:从stack的反向操作:将(可能是分层的)行索引的级别“枢转”到列轴,产生具有新的最内层级的重新形成的DataFrame的列标签。

最清楚的解释是通过例子。让我们从分层索引部分获取前面的示例数据集:

In [8]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
   ...:                      'foo', 'foo', 'qux', 'qux'],
   ...:                     ['one', 'two', 'one', 'two',
   ...:                      'one', 'two', 'one', 'two']]))
   ...: 

In [9]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [10]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

In [11]: df2 = df[:4]

In [12]: df2
Out[12]: 
                     A         B
first second                    
bar   one     0.721555 -0.706771
      two    -1.039575  0.271860
baz   one    -0.424972  0.567020
      two     0.276232 -1.087401

stack函数“压缩”DataFrame的列中的一个级别,以产生:

  • A系列,在一个简单的列索引的情况下
  • 在列中的MultiIndex的情况下为DataFrame

如果列具有MultiIndex,您可以选择要堆叠的级别。堆叠级别成为列上的MultiIndex中的新的最低级别:

In [13]: stacked = df2.stack()

In [14]: stacked
Out[14]: 
first  second   
bar    one     A    0.721555
               B   -0.706771
       two     A   -1.039575
               B    0.271860
baz    one     A   -0.424972
               B    0.567020
       two     A    0.276232
               B   -1.087401
dtype: float64

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack is unstack, which by default unstacks the last level:

In [15]: stacked.unstack()
Out[15]: 
                     A         B
first second                    
bar   one     0.721555 -0.706771
      two    -1.039575  0.271860
baz   one    -0.424972  0.567020
      two     0.276232 -1.087401

In [16]: stacked.unstack(1)
Out[16]: 
second        one       two
first                      
bar   A  0.721555 -1.039575
      B -0.706771  0.271860
baz   A -0.424972  0.276232
      B  0.567020 -1.087401

In [17]: stacked.unstack(0)
Out[17]: 
first          bar       baz
second                      
one    A  0.721555 -0.424972
       B -0.706771  0.567020
two    A -1.039575  0.276232
       B  0.271860 -1.087401

如果索引具有名称,则可以使用级别名称,而不是指定级别号:

In [18]: stacked.unstack('second')
Out[18]: 
second        one       two
first                      
bar   A  0.721555 -1.039575
      B -0.706771  0.271860
baz   A -0.424972  0.276232
      B  0.567020 -1.087401

请注意,stackunstack方法隐含地排序所涉及的索引级别。因此,调用stack然后unstack,反之亦然,将导致原始DataFrame或系列的排序副本:

In [19]: index = pd.MultiIndex.from_product([[2,1], ['a', 'b']])

In [20]: df = pd.DataFrame(np.random.randn(4), index=index, columns=['A'])

In [21]: df
Out[21]: 
            A
2 a -0.370647
  b -1.157892
1 a -1.344312
  b  0.844885

In [22]: all(df.unstack().stack() == df.sort_index())
Out[22]: True

而如果去除对sort_index的调用,则上述代码将产生TypeError

Multiple Levels

您还可以通过传递级别列表来一次堆叠或取消堆栈多个级别,在这种情况下,最终结果就好像单独处理列表中的每个级别一样。

In [23]: columns = pd.MultiIndex.from_tuples([
   ....:         ('A', 'cat', 'long'), ('B', 'cat', 'long'),
   ....:         ('A', 'dog', 'short'), ('B', 'dog', 'short')
   ....:     ],
   ....:     names=['exp', 'animal', 'hair_length']
   ....: )
   ....: 

In [24]: df = pd.DataFrame(np.random.randn(4, 4), columns=columns)

In [25]: df
Out[25]: 
exp                 A         B         A         B
animal            cat       cat       dog       dog
hair_length      long      long     short     short
0            1.075770 -0.109050  1.643563 -1.469388
1            0.357021 -0.674600 -1.776904 -0.968914
2           -1.294524  0.413738  0.276662 -0.472035
3           -0.013960 -0.362543 -0.006154 -0.923061

In [26]: df.stack(level=['animal', 'hair_length'])
Out[26]: 
exp                          A         B
  animal hair_length                    
0 cat    long         1.075770 -0.109050
  dog    short        1.643563 -1.469388
1 cat    long         0.357021 -0.674600
  dog    short       -1.776904 -0.968914
2 cat    long        -1.294524  0.413738
  dog    short        0.276662 -0.472035
3 cat    long        -0.013960 -0.362543
  dog    short       -0.006154 -0.923061

级别列表可以包含级别名称或级别号(但不能是两者的混合)。

# df.stack(level=['animal', 'hair_length'])
# from above is equivalent to:
In [27]: df.stack(level=[1, 2])
Out[27]: 
exp                          A         B
  animal hair_length                    
0 cat    long         1.075770 -0.109050
  dog    short        1.643563 -1.469388
1 cat    long         0.357021 -0.674600
  dog    short       -1.776904 -0.968914
2 cat    long        -1.294524  0.413738
  dog    short        0.276662 -0.472035
3 cat    long        -0.013960 -0.362543
  dog    short       -0.006154 -0.923061

Missing Data

这些函数在处理缺失数据方面是智能的,并且不期望分层索引中的每个子组具有相同的一组标签。它们还可以处理未排序的索引(但是您可以通过调用sort_index来排序)。这里有一个更复杂的例子:

In [28]: columns = pd.MultiIndex.from_tuples([('A', 'cat'), ('B', 'dog'),
   ....:                                      ('B', 'cat'), ('A', 'dog')],
   ....:                                     names=['exp', 'animal'])
   ....: 

In [29]: index = pd.MultiIndex.from_product([('bar', 'baz', 'foo', 'qux'),
   ....:                                     ('one', 'two')],
   ....:                                    names=['first', 'second'])
   ....: 

In [30]: df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)

In [31]: df2 = df.ix[[0, 1, 2, 4, 5, 7]]

In [32]: df2
Out[32]: 
exp                  A         B                   A
animal             cat       dog       cat       dog
first second                                        
bar   one     0.895717  0.805244 -1.206412  2.565646
      two     1.431256  1.340309 -1.170299 -0.226169
baz   one     0.410835  0.813850  0.132003 -0.827317
foo   one    -1.413681  1.607920  1.024180  0.569605
      two     0.875906 -2.211372  0.974466 -2.006747
qux   two    -1.226825  0.769804 -1.281247 -0.727707

如上所述,可以使用level参数调用stack,以选择要堆叠的列中的哪个级别:

In [33]: df2.stack('exp')
Out[33]: 
animal                 cat       dog
first second exp                    
bar   one    A    0.895717  2.565646
             B   -1.206412  0.805244
      two    A    1.431256 -0.226169
             B   -1.170299  1.340309
baz   one    A    0.410835 -0.827317
             B    0.132003  0.813850
foo   one    A   -1.413681  0.569605
             B    1.024180  1.607920
      two    A    0.875906 -2.006747
             B    0.974466 -2.211372
qux   two    A   -1.226825 -0.727707
             B   -1.281247  0.769804

In [34]: df2.stack('animal')
Out[34]: 
exp                         A         B
first second animal                    
bar   one    cat     0.895717 -1.206412
             dog     2.565646  0.805244
      two    cat     1.431256 -1.170299
             dog    -0.226169  1.340309
baz   one    cat     0.410835  0.132003
             dog    -0.827317  0.813850
foo   one    cat    -1.413681  1.024180
             dog     0.569605  1.607920
      two    cat     0.875906  0.974466
             dog    -2.006747 -2.211372
qux   two    cat    -1.226825 -1.281247
             dog    -0.727707  0.769804

如果子组不具有相同的标签集,则解除堆叠可能导致缺失值。默认情况下,缺少的值将替换为该数据类型的默认填充值,NaN用于float,NaT用于datetimelike等。对于整数类型,默认情况下,数据将转换为float,缺少的值将设置为NaN

In [35]: df3 = df.iloc[[0, 1, 4, 7], [1, 2]]

In [36]: df3
Out[36]: 
exp                  B          
animal             dog       cat
first second                    
bar   one     0.805244 -1.206412
      two     1.340309 -1.170299
foo   one     1.607920  1.024180
qux   two     0.769804 -1.281247

In [37]: df3.unstack()
Out[37]: 
exp            B                              
animal       dog                 cat          
second       one       two       one       two
first                                         
bar     0.805244  1.340309 -1.206412 -1.170299
foo     1.607920       NaN  1.024180       NaN
qux          NaN  0.769804       NaN -1.281247

或者,unstack采用可选的fill_value参数,用于指定缺少的数据的值。

In [38]: df3.unstack(fill_value=-1e9)
Out[38]: 
exp                B                                          
animal           dog                         cat              
second           one           two           one           two
first                                                         
bar     8.052440e-01  1.340309e+00 -1.206412e+00 -1.170299e+00
foo     1.607920e+00 -1.000000e+09  1.024180e+00 -1.000000e+09
qux    -1.000000e+09  7.698036e-01 -1.000000e+09 -1.281247e+00

With a MultiIndex

当列为MultiIndex时解除堆叠也小心做正确的事:

In [39]: df[:3].unstack(0)
Out[39]: 
exp            A                   B                                      A  \
animal       cat                 dog                cat                 dog   
first        bar       baz       bar      baz       bar       baz       bar   
second                                                                        
one     0.895717  0.410835  0.805244  0.81385 -1.206412  0.132003  2.565646   
two     1.431256       NaN  1.340309      NaN -1.170299       NaN -0.226169   

exp               
animal            
first        baz  
second            
one    -0.827317  
two          NaN  

In [40]: df2.unstack(1)
Out[40]: 
exp            A                   B                                       A  \
animal       cat                 dog                 cat                 dog   
second       one       two       one       two       one       two       one   
first                                                                          
bar     0.895717  1.431256  0.805244  1.340309 -1.206412 -1.170299  2.565646   
baz     0.410835       NaN  0.813850       NaN  0.132003       NaN -0.827317   
foo    -1.413681  0.875906  1.607920 -2.211372  1.024180  0.974466  0.569605   
qux          NaN -1.226825       NaN  0.769804       NaN -1.281247       NaN   

exp               
animal            
second       two  
first             
bar    -0.226169  
baz          NaN  
foo    -2.006747  
qux    -0.727707  

Reshaping by Melt

melt()函数有助于将DataFrame压缩为一个或多个列是标识符变量的格式,而所有其他列(被认为是测量变量)都与行轴“不相关”,只留下两个非标识符列,“variable”和“value”。可以通过提供var_namevalue_name参数来自定义这些列的名称。

例如,

In [41]: cheese = pd.DataFrame({'first' : ['John', 'Mary'],
   ....:                        'last' : ['Doe', 'Bo'],
   ....:                        'height' : [5.5, 6.0],
   ....:                        'weight' : [130, 150]})
   ....: 

In [42]: cheese
Out[42]: 
  first  height last  weight
0  John     5.5  Doe     130
1  Mary     6.0   Bo     150

In [43]: pd.melt(cheese, id_vars=['first', 'last'])
Out[43]: 
  first last variable  value
0  John  Doe   height    5.5
1  Mary   Bo   height    6.0
2  John  Doe   weight  130.0
3  Mary   Bo   weight  150.0

In [44]: pd.melt(cheese, id_vars=['first', 'last'], var_name='quantity')
Out[44]: 
  first last quantity  value
0  John  Doe   height    5.5
1  Mary   Bo   height    6.0
2  John  Doe   weight  130.0
3  Mary   Bo   weight  150.0

另一种变换方式是使用wide_to_long面板数据便利功能。

In [45]: dft = pd.DataFrame({"A1970" : {0 : "a", 1 : "b", 2 : "c"},
   ....:                     "A1980" : {0 : "d", 1 : "e", 2 : "f"},
   ....:                     "B1970" : {0 : 2.5, 1 : 1.2, 2 : .7},
   ....:                     "B1980" : {0 : 3.2, 1 : 1.3, 2 : .1},
   ....:                     "X"     : dict(zip(range(3), np.random.randn(3)))
   ....:                    })
   ....: 

In [46]: dft["id"] = dft.index

In [47]: dft
Out[47]: 
  A1970 A1980  B1970  B1980         X  id
0     a     d    2.5    3.2 -0.121306   0
1     b     e    1.2    1.3 -0.097883   1
2     c     f    0.7    0.1  0.695775   2

In [48]: pd.wide_to_long(dft, ["A", "B"], i="id", j="year")
Out[48]: 
                X  A    B
id year                  
0  1970 -0.121306  a  2.5
1  1970 -0.097883  b  1.2
2  1970  0.695775  c  0.7
0  1980 -0.121306  d  3.2
1  1980 -0.097883  e  1.3
2  1980  0.695775  f  0.1

Combining with stats and GroupBy

pivot / stack / unstack与GroupBy和基本的Series和DataFrame统计函数组合可以产生一些非常有表现力和快速数据操作。

In [49]: df
Out[49]: 
exp                  A         B                   A
animal             cat       dog       cat       dog
first second                                        
bar   one     0.895717  0.805244 -1.206412  2.565646
      two     1.431256  1.340309 -1.170299 -0.226169
baz   one     0.410835  0.813850  0.132003 -0.827317
      two    -0.076467 -1.187678  1.130127 -1.436737
foo   one    -1.413681  1.607920  1.024180  0.569605
      two     0.875906 -2.211372  0.974466 -2.006747
qux   one    -0.410001 -0.078638  0.545952 -1.219217
      two    -1.226825  0.769804 -1.281247 -0.727707

In [50]: df.stack().mean(1).unstack()
Out[50]: 
animal             cat       dog
first second                    
bar   one    -0.155347  1.685445
      two     0.130479  0.557070
baz   one     0.271419 -0.006733
      two     0.526830 -1.312207
foo   one    -0.194750  1.088763
      two     0.925186 -2.109060
qux   one     0.067976 -0.648927
      two    -1.254036  0.021048

# same result, another way
In [51]: df.groupby(level=1, axis=1).mean()
Out[51]: 
animal             cat       dog
first second                    
bar   one    -0.155347  1.685445
      two     0.130479  0.557070
baz   one     0.271419 -0.006733
      two     0.526830 -1.312207
foo   one    -0.194750  1.088763
      two     0.925186 -2.109060
qux   one     0.067976 -0.648927
      two    -1.254036  0.021048

In [52]: df.stack().groupby(level=1).mean()
Out[52]: 
exp            A         B
second                    
one     0.071448  0.455513
two    -0.424186 -0.204486

In [53]: df.mean().unstack(0)
Out[53]: 
exp            A         B
animal                    
cat     0.060843  0.018596
dog    -0.413580  0.232430

Pivot tables

函数pandas.pivot_table可用于创建电子表格样式的枢轴表。有关某些高级策略,请参阅cookbook

它需要一些参数

考虑一个像这样的数据集:

In [54]: import datetime

In [55]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 6,
   ....:                    'B': ['A', 'B', 'C'] * 8,
   ....:                    'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4,
   ....:                    'D': np.random.randn(24),
   ....:                    'E': np.random.randn(24),
   ....:                    'F': [datetime.datetime(2013, i, 1) for i in range(1, 13)] +
   ....:                         [datetime.datetime(2013, i, 15) for i in range(1, 13)]})
   ....: 

In [56]: df
Out[56]: 
        A  B    C         D         E          F
0     one  A  foo  0.341734 -0.317441 2013-01-01
1     one  B  foo  0.959726 -1.236269 2013-02-01
2     two  C  foo -1.110336  0.896171 2013-03-01
3   three  A  bar -0.619976 -0.487602 2013-04-01
4     one  B  bar  0.149748 -0.082240 2013-05-01
5     one  C  bar -0.732339 -2.182937 2013-06-01
6     two  A  foo  0.687738  0.380396 2013-07-01
..    ... ..  ...       ...       ...        ...
17    one  C  bar -0.345352  0.206053 2013-06-15
18    two  A  foo  1.314232 -0.251905 2013-07-15
19  three  B  foo  0.690579 -2.213588 2013-08-15
20    one  C  foo  0.995761  1.063327 2013-09-15
21    one  A  bar  2.396780  1.266143 2013-10-15
22    two  B  bar  0.014871  0.299368 2013-11-15
23  three  C  bar  3.357427 -0.863838 2013-12-15

[24 rows x 6 columns]

我们可以非常容易地从这些数据生成数据透视表:

In [57]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[57]: 
C             bar       foo
A     B                    
one   A  1.120915 -0.514058
      B -0.338421  0.002759
      C -0.538846  0.699535
three A -1.181568       NaN
      B       NaN  0.433512
      C  0.588783       NaN
two   A       NaN  1.000985
      B  0.158248       NaN
      C       NaN  0.176180

In [58]: pd.pivot_table(df, values='D', index=['B'], columns=['A', 'C'], aggfunc=np.sum)
Out[58]: 
A       one               three                 two          
C       bar       foo       bar       foo       bar       foo
B                                                            
A  2.241830 -1.028115 -2.363137       NaN       NaN  2.001971
B -0.676843  0.005518       NaN  0.867024  0.316495       NaN
C -1.077692  1.399070  1.177566       NaN       NaN  0.352360

In [59]: pd.pivot_table(df, values=['D','E'], index=['B'], columns=['A', 'C'], aggfunc=np.sum)
Out[59]: 
          D                                                           E  \
A       one               three                 two                 one   
C       bar       foo       bar       foo       bar       foo       bar   
B                                                                         
A  2.241830 -1.028115 -2.363137       NaN       NaN  2.001971  2.786113   
B -0.676843  0.005518       NaN  0.867024  0.316495       NaN  1.368280   
C -1.077692  1.399070  1.177566       NaN       NaN  0.352360 -1.976883   

                                                     
A               three                 two            
C       foo       bar       foo       bar       foo  
B                                                    
A -0.043211  1.922577       NaN       NaN  0.128491  
B -1.103384       NaN -2.128743 -0.194294       NaN  
C  1.495717 -0.263660       NaN       NaN  0.872482  

结果对象是在行和列上具有潜在分层索引的DataFrame。如果未给出values列名称,则数据透视表将包括可以在列中的附加层次级别中聚合的所有数据:

In [60]: pd.pivot_table(df, index=['A', 'B'], columns=['C'])
Out[60]: 
                D                   E          
C             bar       foo       bar       foo
A     B                                        
one   A  1.120915 -0.514058  1.393057 -0.021605
      B -0.338421  0.002759  0.684140 -0.551692
      C -0.538846  0.699535 -0.988442  0.747859
three A -1.181568       NaN  0.961289       NaN
      B       NaN  0.433512       NaN -1.064372
      C  0.588783       NaN -0.131830       NaN
two   A       NaN  1.000985       NaN  0.064245
      B  0.158248       NaN -0.097147       NaN
      C       NaN  0.176180       NaN  0.436241

此外,您还可以对indexcolumns关键字使用Grouper有关Grouper的详细信息,请参阅Grouping with a Grouper specification

In [61]: pd.pivot_table(df, values='D', index=pd.Grouper(freq='M', key='F'), columns='C')
Out[61]: 
C                bar       foo
F                             
2013-01-31       NaN -0.514058
2013-02-28       NaN  0.002759
2013-03-31       NaN  0.176180
2013-04-30 -1.181568       NaN
2013-05-31 -0.338421       NaN
2013-06-30 -0.538846       NaN
2013-07-31       NaN  1.000985
2013-08-31       NaN  0.433512
2013-09-30       NaN  0.699535
2013-10-31  1.120915       NaN
2013-11-30  0.158248       NaN
2013-12-31  0.588783       NaN

如果您愿意,可以通过调用to_string来呈现表的一个不错的输出,省略缺少的值:

In [62]: table = pd.pivot_table(df, index=['A', 'B'], columns=['C'])

In [63]: print(table.to_string(na_rep=''))
                D                   E          
C             bar       foo       bar       foo
A     B                                        
one   A  1.120915 -0.514058  1.393057 -0.021605
      B -0.338421  0.002759  0.684140 -0.551692
      C -0.538846  0.699535 -0.988442  0.747859
three A -1.181568            0.961289          
      B            0.433512           -1.064372
      C  0.588783           -0.131830          
two   A            1.000985            0.064245
      B  0.158248           -0.097147          
      C            0.176180            0.436241

请注意,pivot_table也可用作DataFrame上的实例方法。

Adding margins

如果您将margins=True传递到pivot_table,则特殊的All列和行将在行和列的类别中添加部分组聚合:

In [64]: df.pivot_table(index=['A', 'B'], columns='C', margins=True, aggfunc=np.std)
Out[64]: 
                D                             E                    
C             bar       foo       All       bar       foo       All
A     B                                                            
one   A  1.804346  1.210272  1.569879  0.179483  0.418374  0.858005
      B  0.690376  1.353355  0.898998  1.083825  0.968138  1.101401
      C  0.273641  0.418926  0.771139  1.689271  0.446140  1.422136
three A  0.794212       NaN  0.794212  2.049040       NaN  2.049040
      B       NaN  0.363548  0.363548       NaN  1.625237  1.625237
      C  3.915454       NaN  3.915454  1.035215       NaN  1.035215
two   A       NaN  0.442998  0.442998       NaN  0.447104  0.447104
      B  0.202765       NaN  0.202765  0.560757       NaN  0.560757
      C       NaN  1.819408  1.819408       NaN  0.650439  0.650439
All      1.556686  0.952552  1.246608  1.250924  0.899904  1.059389

Cross tabulations

使用crosstab函数计算两个(或多个)因子的交叉表。默认情况下,crosstab计算因子的频率表,除非传递值数组和聚合函数。

它需要一些参数

任何传递的系列将使用其名称属性,除非指定交叉列表的行或列名称

例如:

In [65]: foo, bar, dull, shiny, one, two = 'foo', 'bar', 'dull', 'shiny', 'one', 'two'

In [66]: a = np.array([foo, foo, bar, bar, foo, foo], dtype=object)

In [67]: b = np.array([one, one, two, one, two, one], dtype=object)

In [68]: c = np.array([dull, dull, shiny, dull, dull, shiny], dtype=object)

In [69]: pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
Out[69]: 
b    one        two      
c   dull shiny dull shiny
a                        
bar    1     0    0     1
foo    2     1    1     0

如果crosstab只接收两个Series,它将提供一个频率表。

In [70]: df = pd.DataFrame({'A': [1, 2, 2, 2, 2], 'B': [3, 3, 4, 4, 4],
   ....:                    'C': [1, 1, np.nan, 1, 1]})
   ....: 

In [71]: df
Out[71]: 
   A  B    C
0  1  3  1.0
1  2  3  1.0
2  2  4  NaN
3  2  4  1.0
4  2  4  1.0

In [72]: pd.crosstab(df.A, df.B)
Out[72]: 
B  3  4
A      
1  1  0
2  1  3

包含Categorical数据的任何输入都将包含在交叉列表中的类别中包含所有,即使实际数据不包含特定类别的任何实例。

In [73]: foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])

In [74]: bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])

In [75]: pd.crosstab(foo, bar)
Out[75]: 
col_0  d  e  f
row_0         
a      1  0  0
b      0  1  0
c      0  0  0

Normalization

版本0.18.1中的新功能。

频率表也可以使用normalize参数进行标准化,以显示百分比而不是计数:

In [76]: pd.crosstab(df.A, df.B, normalize=True)
Out[76]: 
B    3    4
A          
1  0.2  0.0
2  0.2  0.6

normalize还可以标准化每行或每列中的值:

In [77]: pd.crosstab(df.A, df.B, normalize='columns')
Out[77]: 
B    3    4
A          
1  0.5  0.0
2  0.5  1.0

crosstab还可以传递第三个系列和聚合函数(aggfunc),将应用于由前两个系列定义的每个组中的第三个系列的值:

In [78]: pd.crosstab(df.A, df.B, values=df.C, aggfunc=np.sum)
Out[78]: 
B    3    4
A          
1  1.0  NaN
2  1.0  2.0

Adding Margins

最后,还可以添加边距或规范化此输出。

In [79]: pd.crosstab(df.A, df.B, values=df.C, aggfunc=np.sum, normalize=True,
   ....:             margins=True)
   ....: 
Out[79]: 
B       3    4   All
A                   
1    0.25  0.0  0.25
2    0.25  0.5  0.75
All  0.50  0.5  1.00

Tiling

cut函数计算输入数组值的分组,通常用于将连续变量转换为离散变量或分类变量:

In [80]: ages = np.array([10, 15, 13, 12, 23, 25, 28, 59, 60])

In [81]: pd.cut(ages, bins=3)
Out[81]: 
[(9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (26.667, 43.333], (43.333, 60], (43.333, 60]]
Categories (3, object): [(9.95, 26.667] < (26.667, 43.333] < (43.333, 60]]

如果bins关键字是一个整数,那么将形成等宽字节。或者,我们可以指定自定义bin边:

In [82]: pd.cut(ages, bins=[0, 18, 35, 70])
Out[82]: 
[(0, 18], (0, 18], (0, 18], (0, 18], (18, 35], (18, 35], (18, 35], (35, 70], (35, 70]]
Categories (3, object): [(0, 18] < (18, 35] < (35, 70]]

Computing indicator / dummy variables

为了将分类变量转换为“虚拟”或“指示符”DataFrame,例如具有k不同值的DataFrame(a Series)中的列,可以导出包含k

In [83]: df = pd.DataFrame({'key': list('bbacab'), 'data1': range(6)})

In [84]: pd.get_dummies(df['key'])
Out[84]: 
   a  b  c
0  0  1  0
1  0  1  0
2  1  0  0
3  0  0  1
4  1  0  0
5  0  1  0

有时,使用列名称前缀是有用的,例如在将结果与原始DataFrame合并时:

In [85]: dummies = pd.get_dummies(df['key'], prefix='key')

In [86]: dummies
Out[86]: 
   key_a  key_b  key_c
0      0      1      0
1      0      1      0
2      1      0      0
3      0      0      1
4      1      0      0
5      0      1      0

In [87]: df[['data1']].join(dummies)
Out[87]: 
   data1  key_a  key_b  key_c
0      0      0      1      0
1      1      0      1      0
2      2      1      0      0
3      3      0      0      1
4      4      1      0      0
5      5      0      1      0

此函数通常与cut等离散函数一起使用:

In [88]: values = np.random.randn(10)

In [89]: values
Out[89]: 
array([ 0.4082, -1.0481, -0.0257, -0.9884,  0.0941,  1.2627,  1.29  ,
        0.0824, -0.0558,  0.5366])

In [90]: bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

In [91]: pd.get_dummies(pd.cut(values, bins))
Out[91]: 
   (0, 0.2]  (0.2, 0.4]  (0.4, 0.6]  (0.6, 0.8]  (0.8, 1]
0         0           0           1           0         0
1         0           0           0           0         0
2         0           0           0           0         0
3         0           0           0           0         0
4         1           0           0           0         0
5         0           0           0           0         0
6         0           0           0           0         0
7         1           0           0           0         0
8         0           0           0           0         0
9         0           0           1           0         0

另请参见Series.str.get_dummies

版本0.15.0中的新功能。

get_dummies()也接受一个DataFrame。默认情况下,所有类别变量(在统计学意义上为分类变量,具有对象分类 t> dtype)被编码为虚拟变量。

In [92]: df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
   ....:                    'C': [1, 2, 3]})
   ....: 

In [93]: pd.get_dummies(df)
Out[93]: 
   C  A_a  A_b  B_b  B_c
0  1    1    0    0    1
1  2    0    1    0    1
2  3    1    0    1    0

在输出中包含所有非对象列。

您可以控制使用columns关键字编码的列。

In [94]: pd.get_dummies(df, columns=['A'])
Out[94]: 
   B  C  A_a  A_b
0  c  1    1    0
1  c  2    0    1
2  b  3    1    0

请注意,B列仍包含在输出中,它只是没有被编码。如果您不想将其包含在输出中,则可以在调用get_dummies之前拖动B

与Series版本一样,您可以传递prefixprefix_sep的值。默认情况下,列名称用作前缀,“_”用作前缀分隔符。您可以通过3种方式指定prefixprefix_sep

In [95]: simple = pd.get_dummies(df, prefix='new_prefix')

In [96]: simple
Out[96]: 
   C  new_prefix_a  new_prefix_b  new_prefix_b  new_prefix_c
0  1             1             0             0             1
1  2             0             1             0             1
2  3             1             0             1             0

In [97]: from_list = pd.get_dummies(df, prefix=['from_A', 'from_B'])

In [98]: from_list
Out[98]: 
   C  from_A_a  from_A_b  from_B_b  from_B_c
0  1         1         0         0         1
1  2         0         1         0         1
2  3         1         0         1         0

In [99]: from_dict = pd.get_dummies(df, prefix={'B': 'from_B', 'A': 'from_A'})

In [100]: from_dict
Out[100]: 
   C  from_A_a  from_A_b  from_B_b  from_B_c
0  1         1         0         0         1
1  2         0         1         0         1
2  3         1         0         1         0

版本0.18.0中的新功能。

有时,在将结果馈送到统计模型时,仅保留k-1级别的分类变量以避免共线性将是有用的。您可以通过打开drop_first切换到此模式。

In [101]: s = pd.Series(list('abcaa'))

In [102]: pd.get_dummies(s)
Out[102]: 
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0
4  1  0  0

In [103]: pd.get_dummies(s, drop_first=True)
Out[103]: 
   b  c
0  0  0
1  1  0
2  0  1
3  0  0
4  0  0

当列只包含一个级别时,将在结果中省略。

In [104]: df = pd.DataFrame({'A':list('aaaaa'),'B':list('ababc')})

In [105]: pd.get_dummies(df)
Out[105]: 
   A_a  B_a  B_b  B_c
0    1    1    0    0
1    1    0    1    0
2    1    1    0    0
3    1    0    1    0
4    1    0    0    1

In [106]: pd.get_dummies(df, drop_first=True)
Out[106]: 
   B_b  B_c
0    0    0
1    1    0
2    0    0
3    1    0
4    0    1

Factorizing values

要将1-d值编码为枚举类型,请使用factorize

In [107]: x = pd.Series(['A', 'A', np.nan, 'B', 3.14, np.inf])

In [108]: x
Out[108]: 
0       A
1       A
2     NaN
3       B
4    3.14
5     inf
dtype: object

In [109]: labels, uniques = pd.factorize(x)

In [110]: labels
Out[110]: array([ 0,  0, -1,  1,  2,  3])

In [111]: uniques
Out[111]: Index([u'A', u'B', 3.14, inf], dtype='object')

注意,factorize类似于numpy.unique,但在处理NaN时不同:

注意

由于排序错误,以下numpy.unique将在Python 3和TypeError下失败。另见此处

In [112]: pd.factorize(x, sort=True)
Out[112]: 
(array([ 2,  2, -1,  3,  0,  1]),
 Index([3.14, inf, u'A', u'B'], dtype='object'))

In [113]: np.unique(x, return_inverse=True)[::-1]
Out[113]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))

注意

如果你只想处理一个列作为一个分类变量(如R的因子),你可以使用df [“cat_col”] = 分类(df [“col”])df [“cat_col”] = df [“col” ] .astype(“category”)有关Categorical的完整文档,请参阅Categorical introductionAPI documentation此功能在0.15版中引入。