系列和索引都配备了一组字符串处理方法，使其易于对数组的每个元素进行操作。也许最重要的是，这些方法自动排除丢失/ NA值。这些通过str属性访问，通常具有与等效（标量）内置字符串方法匹配的名称：

Index上的字符串方法对清理或转换DataFrame列特别有用。例如，您可能有具有前导或尾随空格的列：

然后可以根据需要使用这些字符串方法来清理列。这里我们删除前导和尾随空格，缩小所有名称，并用下划线替换任何剩余的空白：

注意

假如你有一个许多元素都重复的 Series (i.e. Series 中唯一元素的数量远小于Series的长度),将原始的 Series 转换成category 然后使用 .str.<method> or .dt.<property>将会更快.性能差异来自于对category类型的Series，字符串操作在.categories上完成，而不是在每个元素的Series。

请注意，类型字符串Series的比较类型category与字符串.categories的Series （例如，您不能向对方添加字符串：s + “ ” 如果s是类型category的Series，则）。此外，对类型list的元素进行操作的.str方法在这种Series上不可用。

Splitting and Replacing Strings

split等方法返回一系列列表：

In [15]: s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])

In [16]: s2.str.split('_')
Out[16]: 
0    [a, b, c]
1    [c, d, e]
2          NaN
3    [f, g, h]
dtype: object

可以使用get或[]符号访问拆分列表中的元素：

In [17]: s2.str.split('_').str.get(1)
Out[17]: 
0      b
1      d
2    NaN
3      g
dtype: object

In [18]: s2.str.split('_').str[1]
Out[18]: 
0      b
1      d
2    NaN
3      g
dtype: object

使用expand可以轻松扩展此操作以返回DataFrame。

In [19]: s2.str.split('_', expand=True)
Out[19]: 
   1     2
  a     b     c
  c     d     e
NaN  None  None
  f     g     h

也可以限制分割数：

In [20]: s2.str.split('_', expand=True, n=1)
Out[20]: 
   1
  a   b_c
  c   d_e
NaN  None
  f   g_h

rsplit类似于split，除了它在反向工作，即从字符串的末尾到字符串的开头：

In [21]: s2.str.rsplit('_', expand=True, n=1)
Out[21]: 
   1
a_b     c
c_d     e
NaN  None
f_g     h

类似replace和findall的方法也可以使用正则表达式：

In [22]: s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca',
   ....:                '', np.nan, 'CABA', 'dog', 'cat'])
   ....: 

In [23]: s3
Out[23]: 
0       A
1       B
2       C
3    Aaba
4    Baca
5        
6     NaN
7    CABA
8     dog
9     cat
dtype: object

In [24]: s3.str.replace('^.a|dog', 'XX-XX ', case=False)
Out[24]: 
0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6         NaN
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: object

必须注意保持正则表达式！例如，以下代码会因为$的正则表达式含义而导致麻烦：

# Consider the following badly formatted financial data
In [25]: dollars = pd.Series(['12', '-$10', '$10,000'])

# This does what you'd naively expect:
In [26]: dollars.str.replace('$', '')
Out[26]: 
0        12
1       -10
2    10,000
dtype: object

# But this doesn't:
In [27]: dollars.str.replace('-$', '-')
Out[27]: 
0         12
1       -$10
2    $10,000
dtype: object

# We need to escape the special character (for >1 len patterns)
In [28]: dollars.str.replace(r'-\$', '-')
Out[28]: 
0         12
1        -10
2    $10,000
dtype: object

Indexing with `.str`

您可以使用[]表示法直接通过位置位置索引。如果索引超过字符串的末尾，结果将是NaN。

In [29]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan,
   ....:                'CABA', 'dog', 'cat'])
   ....: 

In [30]: s.str[0]
Out[30]: 
0      A
1      B
2      C
3      A
4      B
5    NaN
6      C
7      d
8      c
dtype: object

In [31]: s.str[1]
Out[31]: 
0    NaN
1    NaN
2    NaN
3      a
4      a
5    NaN
6      A
7      o
8      a
dtype: object

Extracting Substrings

Extract first match in each subject (extract)

版本0.13.0中的新功能。

警告

在版本0.18.0中，extract获得了expand参数。当expand=False时，根据主题和正则表达式模式，它返回Series，Index或DataFrame （与0.18.0之前的行为相同）。当expand=True时，它始终返回一个DataFrame，这从用户的角度来看更一致，更少混淆。

extract方法接受具有至少一个捕获组的正则表达式。

使用多个组提取正则表达式会返回每个组一个列的DataFrame。

In [32]: pd.Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)', expand=False)
Out[32]: 
     0    1
0    a    1
1    b    2
2  NaN  NaN

不匹配的元素返回填充有NaN的行。因此，一系列乱码字符串可以被“转换”为清理过的或更有用的字符串的索引相同的系列或数据帧，而不需要get()来访问元组或re.match对象。结果的dtype始终为对象，即使未找到匹配项，结果只包含NaN。

命名组喜欢

In [33]: pd.Series(['a1', 'b2', 'c3']).str.extract('(?P<letter>[ab])(?P<digit>\d)', expand=False)
Out[33]: 
  letter digit
0      a     1
1      b     2
2    NaN   NaN

和可选组

In [34]: pd.Series(['a1', 'b2', '3']).str.extract('([ab])?(\d)', expand=False)
Out[34]: 
     0  1
0    a  1
1    b  2
2  NaN  3

也可以使用。请注意，正则表达式中的任何捕获组名称都将用于列名称；否则将使用捕获组编号。

如果expand=True，则提取具有一个组的正则表达式将返回一个具有一列的DataFrame。

In [35]: pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=True)
Out[35]: 
     0
0    1
1    2
2  NaN

如果expand=False，则返回一个系列。

In [36]: pd.Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)', expand=False)
Out[36]: 
0      1
1      2
2    NaN
dtype: object

调用具有正好一个捕获组的正则表达式的Index，如果expand=True，则返回一个具有一列的DataFrame

In [37]: s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"])

In [38]: s
Out[38]: 
A11    a1
B22    b2
C33    c3
dtype: object

In [39]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)
Out[39]: 
  letter
0      A
1      B
2      C

如果expand=False，则返回Index。

In [40]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)
Out[40]: Index([u'A', u'B', u'C'], dtype='object', name=u'letter')

使用具有多个捕获组的正则表达式调用Index，如果expand=True，则会返回DataFrame。

In [41]: s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)
Out[41]: 
  letter   1
0      A  11
1      B  22
2      C  33

如果expand=False，则会引发ValueError。

>>> s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
ValueError: only one regex group is supported with Index

下表总结了extract(expand=False)（第一列中的输入主题，第一行中正则表达式中的组数）

	1组	> 1组
指数	指数	ValueError
系列	系列	DataFrame

Extract all matches in each subject (extractall)

版本0.18.0中的新功能。

与extract（仅返回第一个匹配项）不同，

In [42]: s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"])

In [43]: s
Out[43]: 
A    a1a2
B      b1
C      c1
dtype: object

In [44]: two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'

In [45]: s.str.extract(two_groups, expand=True)
Out[45]: 
  letter digit
A      a     1
B      b     1
C      c     1

extractall方法返回每个匹配。extractall的结果始终是其行上具有MultiIndex的DataFrame。MultiIndex的最后一个级别命名为match，并指示主题中的顺序。

In [46]: s.str.extractall(two_groups)
Out[46]: 
        letter digit
  match             
A 0          a     1
  1          a     2
B 0          b     1
C 0          c     1

当系列中的每个主题字符串完全匹配一个时，

In [47]: s = pd.Series(['a3', 'b3', 'c2'])

In [48]: s
Out[48]: 
0    a3
1    b3
2    c2
dtype: object

then extractall(pat).xs(0, level='match') gives the same result as extract(pat).

In [49]: extract_result = s.str.extract(two_groups, expand=True)

In [50]: extract_result
Out[50]: 
  letter digit
0      a     3
1      b     3
2      c     2

In [51]: extractall_result = s.str.extractall(two_groups)

In [52]: extractall_result
Out[52]: 
        letter digit
  match             
0 0          a     3
1 0          b     3
2 0          c     2

In [53]: extractall_result.xs(0, level="match")
Out[53]: 
  letter digit
0      a     3
1      b     3
2      c     2

Index也支持.str.extractall。它返回一个DataFrame，其结果与具有默认索引（从0开始）的Series.str.extractall相同。

版本0.19.0中的新功能。

In [54]: pd.Index(["a1a2", "b1", "c1"]).str.extractall(two_groups)
Out[54]: 
        letter digit
  match             
0 0          a     1
  1          a     2
1 0          b     1
2 0          c     1

In [55]: pd.Series(["a1a2", "b1", "c1"]).str.extractall(two_groups)
Out[55]: 
        letter digit
  match             
0 0          a     1
  1          a     2
1 0          b     1
2 0          c     1

Testing for Strings that Match or Contain a Pattern

您可以检查元素是否包含模式：

In [56]: pattern = r'[a-z][0-9]'

In [57]: pd.Series(['1', '2', '3a', '3b', '03c']).str.contains(pattern)
Out[57]: 
0    False
1    False
2    False
3    False
4    False
dtype: bool

或匹配模式：

In [58]: pd.Series(['1', '2', '3a', '3b', '03c']).str.match(pattern, as_indexer=True)
Out[58]: 
0    False
1    False
2    False
3    False
4    False
dtype: bool

match和contains是strictness：match依赖于strict re.match，而contains依赖于re.search。

警告

在先前版本中，match用于提取组，返回不那么方便的系列元组。现在优选新方法extract（在上一部分中描述）。

match的旧的，已弃用的行为仍是默认行为。如上所述，通过设置as_indexer=True来使用新的行为。在此模式下，match类似于contains，返回一个布尔系列。新行为将成为未来版本中的默认行为。

match，contains，startswith和endswith take: 额外的na参数，因此缺少的值可以被视为True或False：

In [59]: s4 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

In [60]: s4.str.contains('A', na=False)
Out[60]: 
0     True
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
dtype: bool

Creating Indicator Variables

您可以从字符串列中提取虚拟变量。例如，如果它们由'|'分隔：

In [61]: s = pd.Series(['a', 'a|b', np.nan, 'a|c'])

In [62]: s.str.get_dummies(sep='|')
Out[62]: 
   a  b  c
0  1  0  0
1  1  1  0
2  0  0  0
3  1  0  1

字符串Index还支持get_dummies，它返回MultiIndex。

版本0.18.1中的新功能。

In [63]: idx = pd.Index(['a', 'a|b', np.nan, 'a|c'])

In [64]: idx.str.get_dummies(sep='|')
Out[64]: 
MultiIndex(levels=[[0, 1], [0, 1], [0, 1]],
           labels=[[1, 1, 0, 1], [0, 1, 0, 0], [0, 0, 0, 1]],
           names=[u'a', u'b', u'c'])

另请参见get_dummies()。

Method Summary

方法	描述
`cat()`	串联字符串
`split()`	拆分分隔符上的字符串
`rsplit()`	从字符串末尾拆分分隔符上的字符串
`get()`	索引到每个元素（检索第i个元素）
`join()`	使用传递的分隔符在系列的每个元素中连接字符串
`get_dummies()`	在分隔符上分割字符串，返回虚拟变量的DataFrame
`contains()`	如果每个字符串包含pattern / regex，则返回布尔数组
`replace()`	用一些其他字符串替换模式/正则表达式的出现
`repeat()`	重复的值（`s.str.repeat(3)`等效于`x * 3 t2 >）`
`pad()`	向字符串的左侧，右侧或两侧添加空格
`center()`	等效于`str.center`
`ljust()`	等效于`str.ljust`
`rjust()`	等效于`str.rjust`
`zfill()`	等效于`str.zfill`
`wrap()`	将长字符串拆分成长度小于给定宽度的行
`slice()`	切割系列中的每个字符串
`slice_replace()`	使用传递的值替换每个字符串中的slice
`count()`	计算模式的出现次数
`startswith()`	对于每个元素，等于`str.startswith(pat)`
`endswith()`	对于每个元素，等于`str.endswith(pat)`
`findall()`	计算每个字符串的所有匹配模式/正则表达式的列表
`match()`	在每个元素上调用`re.match`，返回匹配的组作为列表
`extract()`	在每个元素上调用`re.search`，返回DataFrame，每个元素使用一行，每个正则表达式捕获组使用一列
`extractall()`	在每个元素上调用`re.findall`，返回DataFrame，每个匹配包含一行，每个正则表达式捕获组包含一个列
`len()`	计算字符串长度
`strip()`	等效于`str.strip`
`rstrip()`	等效于`str.rstrip`
`lstrip()`	等同于`str.lstrip`
`partition()`	等效于`str.partition`
`rpartition()`	等效于`str.rpartition`
`lower()`	等效于`str.lower`
`upper()`	等效于`str.upper`
`find()`	等同于`str.find`
`rfind()`	等效于`str.rfind`
`index()`	等效于`str.index`
`rindex()`	等效于`str.rindex`
`capitalize()`	等效于`str.capitalize`
`swapcase()`	等效于`str.swapcase`
`normalize()`	返回Unicode正常表单。等同于`unicodedata.normalize`
`translate()`	等效于`str.translate`
`isalnum()`	等效于`str.isalnum`
`isalpha()`	等效于`str.isalpha`
`isdigit()`	等效于`str.isdigit`
`isspace()`	等效于`str.isspace`
`islower()`	等效于`str.islower`
`isupper()`	等效于`str.isupper`
`istitle()`	等同于`str.istitle`
`isnumeric()`	等效于`str.isnumeric`
`isdecimal()`	等效于`str.isdecimal`