What’s new in 1.3.0 (July 2, 2021)

These are the changes in pandas 1.3.0. See Release notes for a full changelog including other versions of pandas.

Warning

When reading new Excel 2007+ (.xlsx) files, the default argument engine=None to read_excel() will now result in using the openpyxl engine in all cases when the option io.excel.xlsx.reader is set to "auto". Previously, some cases would use the xlrd engine instead. See What’s new 1.2.0 for background on this change.

Enhancements

Custom HTTP(s) headers when reading csv or json files

When reading from a remote URL that is not handled by fsspec (e.g. HTTP and HTTPS) the dictionary passed to storage_options will be used to create the headers included in the request. This can be used to control the User-Agent header or send other custom headers (GH 36688). For example:

In [1]: headers = {"User-Agent": "pandas"}

In [2]: df = pd.read_csv(
   ...:     "https://download.bls.gov/pub/time.series/cu/cu.item",
   ...:     sep="\t",
   ...:     storage_options=headers
   ...: )
   ...: 
---------------------------------------------------------------------------
ConnectionRefusedError                    Traceback (most recent call last)
/usr/lib/python3.11/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1347             try:
-> 1348                 h.request(req.get_method(), req.selector, req.data, headers,
   1349                           encode_chunked=req.has_header('Transfer-encoding'))

/usr/lib/python3.11/http/client.py in request(self, method, url, body, headers, encode_chunked)
   1285         """Send a complete request to the server."""
-> 1286         self._send_request(method, url, body, headers, encode_chunked)
   1287 

/usr/lib/python3.11/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
   1331             body = _encode(body, 'body')
-> 1332         self.endheaders(body, encode_chunked=encode_chunked)
   1333 

/usr/lib/python3.11/http/client.py in endheaders(self, message_body, encode_chunked)
   1280             raise CannotSendHeader()
-> 1281         self._send_output(message_body, encode_chunked=encode_chunked)
   1282 

/usr/lib/python3.11/http/client.py in _send_output(self, message_body, encode_chunked)
   1040         del self._buffer[:]
-> 1041         self.send(msg)
   1042 

/usr/lib/python3.11/http/client.py in send(self, data)
    978             if self.auto_open:
--> 979                 self.connect()
    980             else:

/usr/lib/python3.11/http/client.py in connect(self)
   1450 
-> 1451             super().connect()
   1452 

/usr/lib/python3.11/http/client.py in connect(self)
    944         sys.audit("http.client.connect", self, self.host, self.port)
--> 945         self.sock = self._create_connection(
    946             (self.host,self.port), self.timeout, self.source_address)

/usr/lib/python3.11/socket.py in create_connection(address, timeout, source_address, all_errors)
    850             if not all_errors:
--> 851                 raise exceptions[0]
    852             raise ExceptionGroup("create_connection failed", exceptions)

/usr/lib/python3.11/socket.py in create_connection(address, timeout, source_address, all_errors)
    835                 sock.bind(source_address)
--> 836             sock.connect(sa)
    837             # Break explicitly a reference cycle

ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

URLError                                  Traceback (most recent call last)
<ipython-input-2-2dee62044b9a> in <module>
----> 1 df = pd.read_csv(
      2     "https://download.bls.gov/pub/time.series/cu/cu.item",
      3     sep="\t",
      4     storage_options=headers
      5 )

/usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    209                 else:
    210                     kwargs[new_arg_name] = new_arg_value
--> 211             return func(*args, **kwargs)
    212 
    213         return cast(F, wrapper)

/usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    329                     stacklevel=find_stack_level(),
    330                 )
--> 331             return func(*args, **kwargs)
    332 
    333         # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no

/usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    948     kwds.update(kwds_defaults)
    949 
--> 950     return _read(filepath_or_buffer, kwds)
    951 
    952 

/usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
    603 
    604     # Create the parser.
--> 605     parser = TextFileReader(filepath_or_buffer, **kwds)
    606 
    607     if chunksize or iterator:

/usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
   1440 
   1441         self.handles: IOHandles | None = None
-> 1442         self._engine = self._make_engine(f, self.engine)
   1443 
   1444     def close(self) -> None:

/usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine)
   1733                 if "b" not in mode:
   1734                     mode += "b"
-> 1735             self.handles = get_handle(
   1736                 f,
   1737                 mode,

/usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    711 
    712     # open URLs
--> 713     ioargs = _get_filepath_or_buffer(
    714         path_or_buf,
    715         encoding=encoding,

/usr/lib/python3/dist-packages/pandas/io/common.py in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    361         # assuming storage_options is to be interpreted as headers
    362         req_info = urllib.request.Request(filepath_or_buffer, headers=storage_options)
--> 363         with urlopen(req_info) as req:
    364             content_encoding = req.headers.get("Content-Encoding", None)
    365             if content_encoding == "gzip":

/usr/lib/python3/dist-packages/pandas/io/common.py in urlopen(*args, **kwargs)
    263     import urllib.request
    264 
--> 265     return urllib.request.urlopen(*args, **kwargs)
    266 
    267 

/usr/lib/python3.11/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    214     else:
    215         opener = _opener
--> 216     return opener.open(url, data, timeout)
    217 
    218 def install_opener(opener):

/usr/lib/python3.11/urllib/request.py in open(self, fullurl, data, timeout)
    517 
    518         sys.audit('urllib.Request', req.full_url, req.data, req.headers, req.get_method())
--> 519         response = self._open(req, data)
    520 
    521         # post-process response

/usr/lib/python3.11/urllib/request.py in _open(self, req, data)
    534 
    535         protocol = req.type
--> 536         result = self._call_chain(self.handle_open, protocol, protocol +
    537                                   '_open', req)
    538         if result:

/usr/lib/python3.11/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    494         for handler in handlers:
    495             func = getattr(handler, meth_name)
--> 496             result = func(*args)
    497             if result is not None:
    498                 return result

/usr/lib/python3.11/urllib/request.py in https_open(self, req)
   1389 
   1390         def https_open(self, req):
-> 1391             return self.do_open(http.client.HTTPSConnection, req,
   1392                 context=self._context, check_hostname=self._check_hostname)
   1393 

/usr/lib/python3.11/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1349                           encode_chunked=req.has_header('Transfer-encoding'))
   1350             except OSError as err: # timeout error
-> 1351                 raise URLError(err)
   1352             r = h.getresponse()
   1353         except:

URLError: <urlopen error [Errno 111] Connection refused>

Read and write XML documents

We added I/O support to read and render shallow versions of XML documents with read_xml() and DataFrame.to_xml(). Using lxml as parser, both XPath 1.0 and XSLT 1.0 are available. (GH 27554)

In [1]: xml = """<?xml version='1.0' encoding='utf-8'?>
   ...: <data>
   ...:  <row>
   ...:     <shape>square</shape>
   ...:     <degrees>360</degrees>
   ...:     <sides>4.0</sides>
   ...:  </row>
   ...:  <row>
   ...:     <shape>circle</shape>
   ...:     <degrees>360</degrees>
   ...:     <sides/>
   ...:  </row>
   ...:  <row>
   ...:     <shape>triangle</shape>
   ...:     <degrees>180</degrees>
   ...:     <sides>3.0</sides>
   ...:  </row>
   ...:  </data>"""

In [2]: df = pd.read_xml(xml)
In [3]: df
Out[3]:
      shape  degrees  sides
0    square      360    4.0
1    circle      360    NaN
2  triangle      180    3.0

In [4]: df.to_xml()
Out[4]:
<?xml version='1.0' encoding='utf-8'?>
<data>
  <row>
    <index>0</index>
    <shape>square</shape>
    <degrees>360</degrees>
    <sides>4.0</sides>
  </row>
  <row>
    <index>1</index>
    <shape>circle</shape>
    <degrees>360</degrees>
    <sides/>
  </row>
  <row>
    <index>2</index>
    <shape>triangle</shape>
    <degrees>180</degrees>
    <sides>3.0</sides>
  </row>
</data>

For more, see Writing XML in the user guide on IO tools.

Styler enhancements

We provided some focused development on Styler. See also the Styler documentation which has been revised and improved (GH 39720, GH 39317, GH 40493).

DataFrame constructor honors copy=False with dict

When passing a dictionary to DataFrame with copy=False, a copy will no longer be made (GH 32960).

In [3]: arr = np.array([1, 2, 3])

In [4]: df = pd.DataFrame({"A": arr, "B": arr.copy()}, copy=False)

In [5]: df
Out[5]: 
   A  B
0  1  1
1  2  2
2  3  3

df["A"] remains a view on arr:

In [6]: arr[0] = 0

In [7]: assert df.iloc[0, 0] == 0

The default behavior when not passing copy will remain unchanged, i.e. a copy will be made.

PyArrow backed string data type

We’ve enhanced the StringDtype, an extension type dedicated to string data. (GH 39908)

It is now possible to specify a storage keyword option to StringDtype. Use pandas options or specify the dtype using dtype='string[pyarrow]' to allow the StringArray to be backed by a PyArrow array instead of a NumPy array of Python objects.

The PyArrow backed StringArray requires pyarrow 1.0.0 or greater to be installed.

Warning

string[pyarrow] is currently considered experimental. The implementation and parts of the API may change without warning.

In [8]: pd.Series(['abc', None, 'def'], dtype=pd.StringDtype(storage="pyarrow"))
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-8-86a1ffc97a89> in <module>
----> 1 pd.Series(['abc', None, 'def'], dtype=pd.StringDtype(storage="pyarrow"))

/usr/lib/python3/dist-packages/pandas/core/arrays/string_.py in __init__(self, storage)
    107             )
    108         if storage == "pyarrow" and pa_version_under1p01:
--> 109             raise ImportError(
    110                 "pyarrow>=1.0.0 is required for PyArrow backed StringArray."
    111             )

ImportError: pyarrow>=1.0.0 is required for PyArrow backed StringArray.

You can use the alias "string[pyarrow]" as well.

In [9]: s = pd.Series(['abc', None, 'def'], dtype="string[pyarrow]")
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-9-ae5f9effa4d3> in <module>
----> 1 s = pd.Series(['abc', None, 'def'], dtype="string[pyarrow]")

/usr/lib/python3/dist-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    402                 data = {}
    403             if dtype is not None:
--> 404                 dtype = self._validate_dtype(dtype)
    405 
    406             if isinstance(data, MultiIndex):

/usr/lib/python3/dist-packages/pandas/core/generic.py in _validate_dtype(cls, dtype)
    448         """validate the passed dtype"""
    449         if dtype is not None:
--> 450             dtype = pandas_dtype(dtype)
    451 
    452             # a compound dtype

/usr/lib/python3/dist-packages/pandas/core/dtypes/common.py in pandas_dtype(dtype)
   1772 
   1773     # registered extension types
-> 1774     result = registry.find(dtype)
   1775     if result is not None:
   1776         return result

/usr/lib/python3/dist-packages/pandas/core/dtypes/base.py in find(self, dtype)
    519         for dtype_type in self.dtypes:
    520             try:
--> 521                 return dtype_type.construct_from_string(dtype)
    522             except TypeError:
    523                 pass

/usr/lib/python3/dist-packages/pandas/core/arrays/string_.py in construct_from_string(cls, string)
    153             return cls(storage="python")
    154         elif string == "string[pyarrow]":
--> 155             return cls(storage="pyarrow")
    156         else:
    157             raise TypeError(f"Cannot construct a '{cls.__name__}' from '{string}'")

/usr/lib/python3/dist-packages/pandas/core/arrays/string_.py in __init__(self, storage)
    107             )
    108         if storage == "pyarrow" and pa_version_under1p01:
--> 109             raise ImportError(
    110                 "pyarrow>=1.0.0 is required for PyArrow backed StringArray."
    111             )

ImportError: pyarrow>=1.0.0 is required for PyArrow backed StringArray.

In [10]: s
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-ded5ba42480f> in <module>
----> 1 s

NameError: name 's' is not defined

You can also create a PyArrow backed string array using pandas options.

In [11]: with pd.option_context("string_storage", "pyarrow"):
   ....:     s = pd.Series(['abc', None, 'def'], dtype="string")
   ....: 
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-11-f9263a40a12c> in <module>
      1 with pd.option_context("string_storage", "pyarrow"):
----> 2     s = pd.Series(['abc', None, 'def'], dtype="string")

/usr/lib/python3/dist-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    402                 data = {}
    403             if dtype is not None:
--> 404                 dtype = self._validate_dtype(dtype)
    405 
    406             if isinstance(data, MultiIndex):

/usr/lib/python3/dist-packages/pandas/core/generic.py in _validate_dtype(cls, dtype)
    448         """validate the passed dtype"""
    449         if dtype is not None:
--> 450             dtype = pandas_dtype(dtype)
    451 
    452             # a compound dtype

/usr/lib/python3/dist-packages/pandas/core/dtypes/common.py in pandas_dtype(dtype)
   1772 
   1773     # registered extension types
-> 1774     result = registry.find(dtype)
   1775     if result is not None:
   1776         return result

/usr/lib/python3/dist-packages/pandas/core/dtypes/base.py in find(self, dtype)
    519         for dtype_type in self.dtypes:
    520             try:
--> 521                 return dtype_type.construct_from_string(dtype)
    522             except TypeError:
    523                 pass

/usr/lib/python3/dist-packages/pandas/core/arrays/string_.py in construct_from_string(cls, string)
    149             )
    150         if string == "string":
--> 151             return cls()
    152         elif string == "string[python]":
    153             return cls(storage="python")

/usr/lib/python3/dist-packages/pandas/core/arrays/string_.py in __init__(self, storage)
    107             )
    108         if storage == "pyarrow" and pa_version_under1p01:
--> 109             raise ImportError(
    110                 "pyarrow>=1.0.0 is required for PyArrow backed StringArray."
    111             )

ImportError: pyarrow>=1.0.0 is required for PyArrow backed StringArray.

In [12]: s
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-ded5ba42480f> in <module>
----> 1 s

NameError: name 's' is not defined

The usual string accessor methods work. Where appropriate, the return type of the Series or columns of a DataFrame will also have string dtype.

In [13]: s.str.upper()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-7881d6993237> in <module>
----> 1 s.str.upper()

NameError: name 's' is not defined

In [14]: s.str.split('b', expand=True).dtypes
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-1c9ab6b19f40> in <module>
----> 1 s.str.split('b', expand=True).dtypes

NameError: name 's' is not defined

String accessor methods returning integers will return a value with Int64Dtype

In [15]: s.str.count("a")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-d3dc78d805ed> in <module>
----> 1 s.str.count("a")

NameError: name 's' is not defined

Centered datetime-like rolling windows

When performing rolling calculations on DataFrame and Series objects with a datetime-like index, a centered datetime-like window can now be used (GH 38780). For example:

In [16]: df = pd.DataFrame(
   ....:     {"A": [0, 1, 2, 3, 4]}, index=pd.date_range("2020", periods=5, freq="1D")
   ....: )
   ....: 

In [17]: df
Out[17]: 
            A
2020-01-01  0
2020-01-02  1
2020-01-03  2
2020-01-04  3
2020-01-05  4

In [18]: df.rolling("2D", center=True).mean()
Out[18]: 
              A
2020-01-01  0.5
2020-01-02  1.5
2020-01-03  2.5
2020-01-04  3.5
2020-01-05  4.0

Other enhancements

Notable bug fixes

These are bug fixes that might have notable behavior changes.

Categorical.unique now always maintains same dtype as original

Previously, when calling Categorical.unique() with categorical data, unused categories in the new array would be removed, making the dtype of the new array different than the original (GH 18291)

As an example of this, given:

In [19]: dtype = pd.CategoricalDtype(['bad', 'neutral', 'good'], ordered=True)

In [20]: cat = pd.Categorical(['good', 'good', 'bad', 'bad'], dtype=dtype)

In [21]: original = pd.Series(cat)

In [22]: unique = original.unique()

Previous behavior:

In [1]: unique
['good', 'bad']
Categories (2, object): ['bad' < 'good']
In [2]: original.dtype == unique.dtype
False

New behavior:

In [23]: unique
Out[23]: 
['good', 'bad']
Categories (3, object): ['bad' < 'neutral' < 'good']

In [24]: original.dtype == unique.dtype
Out[24]: True

Preserve dtypes in DataFrame.combine_first()

DataFrame.combine_first() will now preserve dtypes (GH 7509)

In [25]: df1 = pd.DataFrame({"A": [1, 2, 3], "B": [1, 2, 3]}, index=[0, 1, 2])

In [26]: df1
Out[26]: 
   A  B
0  1  1
1  2  2
2  3  3

In [27]: df2 = pd.DataFrame({"B": [4, 5, 6], "C": [1, 2, 3]}, index=[2, 3, 4])

In [28]: df2
Out[28]: 
   B  C
2  4  1
3  5  2
4  6  3

In [29]: combined = df1.combine_first(df2)

Previous behavior:

In [1]: combined.dtypes
Out[2]:
A    float64
B    float64
C    float64
dtype: object

New behavior:

In [30]: combined.dtypes
Out[30]: 
A    float64
B      int64
C    float64
dtype: object

Groupby methods agg and transform no longer changes return dtype for callables

Previously the methods DataFrameGroupBy.aggregate(), SeriesGroupBy.aggregate(), DataFrameGroupBy.transform(), and SeriesGroupBy.transform() might cast the result dtype when the argument func is callable, possibly leading to undesirable results (GH 21240). The cast would occur if the result is numeric and casting back to the input dtype does not change any values as measured by np.allclose. Now no such casting occurs.

In [31]: df = pd.DataFrame({'key': [1, 1], 'a': [True, False], 'b': [True, True]})

In [32]: df
Out[32]: 
   key      a     b
0    1   True  True
1    1  False  True

Previous behavior:

In [5]: df.groupby('key').agg(lambda x: x.sum())
Out[5]:
        a  b
key
1    True  2

New behavior:

In [33]: df.groupby('key').agg(lambda x: x.sum())
Out[33]: 
     a  b
key      
1    1  2

float result for GroupBy.mean(), GroupBy.median(), and GroupBy.var()

Previously, these methods could result in different dtypes depending on the input values. Now, these methods will always return a float dtype. (GH 41137)

In [34]: df = pd.DataFrame({'a': [True], 'b': [1], 'c': [1.0]})

Previous behavior:

In [5]: df.groupby(df.index).mean()
Out[5]:
        a  b    c
0    True  1  1.0

New behavior:

In [35]: df.groupby(df.index).mean()
Out[35]: 
     a    b    c
0  1.0  1.0  1.0

Try operating inplace when setting values with loc and iloc

When setting an entire column using loc or iloc, pandas will try to insert the values into the existing data rather than create an entirely new array.

In [36]: df = pd.DataFrame(range(3), columns=["A"], dtype="float64")

In [37]: values = df.values

In [38]: new = np.array([5, 6, 7], dtype="int64")

In [39]: df.loc[[0, 1, 2], "A"] = new

In both the new and old behavior, the data in values is overwritten, but in the old behavior the dtype of df["A"] changed to int64.

Previous behavior:

In [1]: df.dtypes
Out[1]:
A    int64
dtype: object
In [2]: np.shares_memory(df["A"].values, new)
Out[2]: False
In [3]: np.shares_memory(df["A"].values, values)
Out[3]: False

In pandas 1.3.0, df continues to share data with values

New behavior:

In [40]: df.dtypes
Out[40]: 
A    float64
dtype: object

In [41]: np.shares_memory(df["A"], new)
Out[41]: False

In [42]: np.shares_memory(df["A"], values)
Out[42]: True

Never operate inplace when setting frame[keys] = values

When setting multiple columns using frame[keys] = values new arrays will replace pre-existing arrays for these keys, which will not be over-written (GH 39510). As a result, the columns will retain the dtype(s) of values, never casting to the dtypes of the existing arrays.

In [43]: df = pd.DataFrame(range(3), columns=["A"], dtype="float64")

In [44]: df[["A"]] = 5

In the old behavior, 5 was cast to float64 and inserted into the existing array backing df:

Previous behavior:

In [1]: df.dtypes
Out[1]:
A    float64

In the new behavior, we get a new array, and retain an integer-dtyped 5:

New behavior:

In [45]: df.dtypes
Out[45]: 
A    int64
dtype: object

Consistent casting with setting into Boolean Series

Setting non-boolean values into a Series with dtype=bool now consistently casts to dtype=object (GH 38709)

In [46]: orig = pd.Series([True, False])

In [47]: ser = orig.copy()

In [48]: ser.iloc[1] = np.nan

In [49]: ser2 = orig.copy()

In [50]: ser2.iloc[1] = 2.0

Previous behavior:

In [1]: ser
Out [1]:
0    1.0
1    NaN
dtype: float64

In [2]:ser2
Out [2]:
0    True
1     2.0
dtype: object

New behavior:

In [51]: ser
Out[51]: 
0    True
1     NaN
dtype: object

In [52]: ser2
Out[52]: 
0    True
1     2.0
dtype: object

GroupBy.rolling no longer returns grouped-by column in values

The group-by column will now be dropped from the result of a groupby.rolling operation (GH 32262)

In [53]: df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})

In [54]: df
Out[54]: 
   A  B
0  1  0
1  1  1
2  2  2
3  3  3

Previous behavior:

In [1]: df.groupby("A").rolling(2).sum()
Out[1]:
       A    B
A
1 0  NaN  NaN
1    2.0  1.0
2 2  NaN  NaN
3 3  NaN  NaN

New behavior:

In [55]: df.groupby("A").rolling(2).sum()
Out[55]: 
       B
A       
1 0  NaN
  1  1.0
2 2  NaN
3 3  NaN

Removed artificial truncation in rolling variance and standard deviation

Rolling.std() and Rolling.var() will no longer artificially truncate results that are less than ~1e-8 and ~1e-15 respectively to zero (GH 37051, GH 40448, GH 39872).

However, floating point artifacts may now exist in the results when rolling over larger values.

In [56]: s = pd.Series([7, 5, 5, 5])

In [57]: s.rolling(3).var()
Out[57]: 
0         NaN
1         NaN
2    1.333333
3    0.000000
dtype: float64

GroupBy.rolling with MultiIndex no longer drops levels in the result

GroupBy.rolling() will no longer drop levels of a DataFrame with a MultiIndex in the result. This can lead to a perceived duplication of levels in the resulting MultiIndex, but this change restores the behavior that was present in version 1.1.3 (GH 38787, GH 38523).

In [58]: index = pd.MultiIndex.from_tuples([('idx1', 'idx2')], names=['label1', 'label2'])

In [59]: df = pd.DataFrame({'a': [1], 'b': [2]}, index=index)

In [60]: df
Out[60]: 
               a  b
label1 label2      
idx1   idx2    1  2

Previous behavior:

In [1]: df.groupby('label1').rolling(1).sum()
Out[1]:
          a    b
label1
idx1    1.0  2.0

New behavior:

In [61]: df.groupby('label1').rolling(1).sum()
Out[61]: 
                        a    b
label1 label1 label2          
idx1   idx1   idx2    1.0  2.0

Backwards incompatible API changes

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package

Minimum Version

Required

Changed

numpy

1.17.3

X

X

pytz

2017.3

X

python-dateutil

2.7.3

X

bottleneck

1.2.1

numexpr

2.7.0

X

pytest (dev)

6.0

X

mypy (dev)

0.812

X

setuptools

38.6.0

X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package

Minimum Version

Changed

beautifulsoup4

4.6.0

fastparquet

0.4.0

X

fsspec

0.7.4

gcsfs

0.6.0

lxml

4.3.0

matplotlib

2.2.3

numba

0.46.0

openpyxl

3.0.0

X

pyarrow

0.17.0

X

pymysql

0.8.1

X

pytables

3.5.1

s3fs

0.4.0

scipy

1.2.0

sqlalchemy

1.3.0

X

tabulate

0.8.7

X

xarray

0.12.0

xlrd

1.2.0

xlsxwriter

1.0.2

xlwt

1.3.0

pandas-gbq

0.12.0

See Dependencies and Optional dependencies for more.

Other API changes

  • Partially initialized CategoricalDtype objects (i.e. those with categories=None) will no longer compare as equal to fully initialized dtype objects (GH 38516)

  • Accessing _constructor_expanddim on a DataFrame and _constructor_sliced on a Series now raise an AttributeError. Previously a NotImplementedError was raised (GH 38782)

  • Added new engine and **engine_kwargs parameters to DataFrame.to_sql() to support other future “SQL engines”. Currently we still only use SQLAlchemy under the hood, but more engines are planned to be supported such as turbodbc (GH 36893)

  • Removed redundant freq from PeriodIndex string representation (GH 41653)

  • ExtensionDtype.construct_array_type() is now a required method instead of an optional one for ExtensionDtype subclasses (GH 24860)

  • Calling hash on non-hashable pandas objects will now raise TypeError with the built-in error message (e.g. unhashable type: 'Series'). Previously it would raise a custom message such as 'Series' objects are mutable, thus they cannot be hashed. Furthermore, isinstance(<Series>, abc.collections.Hashable) will now return False (GH 40013)

  • Styler.from_custom_template() now has two new arguments for template names, and removed the old name, due to template inheritance having been introducing for better parsing (GH 42053). Subclassing modifications to Styler attributes are also needed.

Build

  • Documentation in .pptx and .pdf formats are no longer included in wheels or source distributions. (GH 30741)

Deprecations

Deprecated dropping nuisance columns in DataFrame reductions and DataFrameGroupBy operations

Calling a reduction (e.g. .min, .max, .sum) on a DataFrame with numeric_only=None (the default), columns where the reduction raises a TypeError are silently ignored and dropped from the result.

This behavior is deprecated. In a future version, the TypeError will be raised, and users will need to select only valid columns before calling the function.

For example:

In [62]: df = pd.DataFrame({"A": [1, 2, 3, 4], "B": pd.date_range("2016-01-01", periods=4)})

In [63]: df
Out[63]: 
   A          B
0  1 2016-01-01
1  2 2016-01-02
2  3 2016-01-03
3  4 2016-01-04

Old behavior:

In [3]: df.prod()
Out[3]:
Out[3]:
A    24
dtype: int64

Future behavior:

In [4]: df.prod()
...
TypeError: 'DatetimeArray' does not implement reduction 'prod'

In [5]: df[["A"]].prod()
Out[5]:
A    24
dtype: int64

Similarly, when applying a function to DataFrameGroupBy, columns on which the function raises TypeError are currently silently ignored and dropped from the result.

This behavior is deprecated. In a future version, the TypeError will be raised, and users will need to select only valid columns before calling the function.

For example:

In [64]: df = pd.DataFrame({"A": [1, 2, 3, 4], "B": pd.date_range("2016-01-01", periods=4)})

In [65]: gb = df.groupby([1, 1, 2, 2])

Old behavior:

In [4]: gb.prod(numeric_only=False)
Out[4]:
A
1   2
2  12

Future behavior:

In [5]: gb.prod(numeric_only=False)
...
TypeError: datetime64 type does not support prod operations

In [6]: gb[["A"]].prod(numeric_only=False)
Out[6]:
    A
1   2
2  12

Other Deprecations

Performance improvements

Bug fixes

Categorical

  • Bug in CategoricalIndex incorrectly failing to raise TypeError when scalar data is passed (GH 38614)

  • Bug in CategoricalIndex.reindex failed when the Index passed was not categorical but whose values were all labels in the category (GH 28690)

  • Bug where constructing a Categorical from an object-dtype array of date objects did not round-trip correctly with astype (GH 38552)

  • Bug in constructing a DataFrame from an ndarray and a CategoricalDtype (GH 38857)

  • Bug in setting categorical values into an object-dtype column in a DataFrame (GH 39136)

  • Bug in DataFrame.reindex() was raising an IndexError when the new index contained duplicates and the old index was a CategoricalIndex (GH 38906)

  • Bug in Categorical.fillna() with a tuple-like category raising NotImplementedError instead of ValueError when filling with a non-category tuple (GH 41914)

Datetimelike

Timedelta

  • Bug in constructing Timedelta from np.timedelta64 objects with non-nanosecond units that are out of bounds for timedelta64[ns] (GH 38965)

  • Bug in constructing a TimedeltaIndex incorrectly accepting np.datetime64("NaT") objects (GH 39462)

  • Bug in constructing Timedelta from an input string with only symbols and no digits failed to raise an error (GH 39710)

  • Bug in TimedeltaIndex and to_timedelta() failing to raise when passed non-nanosecond timedelta64 arrays that overflow when converting to timedelta64[ns] (GH 40008)

Timezones

  • Bug in different tzinfo objects representing UTC not being treated as equivalent (GH 39216)

  • Bug in dateutil.tz.gettz("UTC") not being recognized as equivalent to other UTC-representing tzinfos (GH 39276)

Numeric

Conversion

  • Bug in Series.to_dict() with orient='records' now returns Python native types (GH 25969)

  • Bug in Series.view() and Index.view() when converting between datetime-like (datetime64[ns], datetime64[ns, tz], timedelta64, period) dtypes (GH 39788)

  • Bug in creating a DataFrame from an empty np.recarray not retaining the original dtypes (GH 40121)

  • Bug in DataFrame failing to raise a TypeError when constructing from a frozenset (GH 40163)

  • Bug in Index construction silently ignoring a passed dtype when the data cannot be cast to that dtype (GH 21311)

  • Bug in StringArray.astype() falling back to NumPy and raising when converting to dtype='categorical' (GH 40450)

  • Bug in factorize() where, when given an array with a numeric NumPy dtype lower than int64, uint64 and float64, the unique values did not keep their original dtype (GH 41132)

  • Bug in DataFrame construction with a dictionary containing an array-like with ExtensionDtype and copy=True failing to make a copy (GH 38939)

  • Bug in qcut() raising error when taking Float64DType as input (GH 40730)

  • Bug in DataFrame and Series construction with datetime64[ns] data and dtype=object resulting in datetime objects instead of Timestamp objects (GH 41599)

  • Bug in DataFrame and Series construction with timedelta64[ns] data and dtype=object resulting in np.timedelta64 objects instead of Timedelta objects (GH 41599)

  • Bug in DataFrame construction when given a two-dimensional object-dtype np.ndarray of Period or Interval objects failing to cast to PeriodDtype or IntervalDtype, respectively (GH 41812)

  • Bug in constructing a Series from a list and a PandasDtype (GH 39357)

  • Bug in creating a Series from a range object that does not fit in the bounds of int64 dtype (GH 30173)

  • Bug in creating a Series from a dict with all-tuple keys and an Index that requires reindexing (GH 41707)

  • Bug in infer_dtype() not recognizing Series, Index, or array with a Period dtype (GH 23553)

  • Bug in infer_dtype() raising an error for general ExtensionArray objects. It will now return "unknown-array" instead of raising (GH 37367)

  • Bug in DataFrame.convert_dtypes() incorrectly raised a ValueError when called on an empty DataFrame (GH 40393)

Strings

Interval

  • Bug in IntervalIndex.intersection() and IntervalIndex.symmetric_difference() always returning object-dtype when operating with CategoricalIndex (GH 38653, GH 38741)

  • Bug in IntervalIndex.intersection() returning duplicates when at least one of the Index objects have duplicates which are present in the other (GH 38743)

  • IntervalIndex.union(), IntervalIndex.intersection(), IntervalIndex.difference(), and IntervalIndex.symmetric_difference() now cast to the appropriate dtype instead of raising a TypeError when operating with another IntervalIndex with incompatible dtype (GH 39267)

  • PeriodIndex.union(), PeriodIndex.intersection(), PeriodIndex.symmetric_difference(), PeriodIndex.difference() now cast to object dtype instead of raising IncompatibleFrequency when operating with another PeriodIndex with incompatible dtype (GH 39306)

  • Bug in IntervalIndex.is_monotonic(), IntervalIndex.get_loc(), IntervalIndex.get_indexer_for(), and IntervalIndex.__contains__() when NA values are present (GH 41831)

Indexing

  • Bug in Index.union() and MultiIndex.union() dropping duplicate Index values when Index was not monotonic or sort was set to False (GH 36289, GH 31326, GH 40862)

  • Bug in CategoricalIndex.get_indexer() failing to raise InvalidIndexError when non-unique (GH 38372)

  • Bug in IntervalIndex.get_indexer() when target has CategoricalDtype and both the index and the target contain NA values (GH 41934)

  • Bug in Series.loc() raising a ValueError when input was filtered with a Boolean list and values to set were a list with lower dimension (GH 20438)

  • Bug in inserting many new columns into a DataFrame causing incorrect subsequent indexing behavior (GH 38380)

  • Bug in DataFrame.__setitem__() raising a ValueError when setting multiple values to duplicate columns (GH 15695)

  • Bug in DataFrame.loc(), Series.loc(), DataFrame.__getitem__() and Series.__getitem__() returning incorrect elements for non-monotonic DatetimeIndex for string slices (GH 33146)

  • Bug in DataFrame.reindex() and Series.reindex() with timezone aware indexes raising a TypeError for method="ffill" and method="bfill" and specified tolerance (GH 38566)

  • Bug in DataFrame.reindex() with datetime64[ns] or timedelta64[ns] incorrectly casting to integers when the fill_value requires casting to object dtype (GH 39755)

  • Bug in DataFrame.__setitem__() raising a ValueError when setting on an empty DataFrame using specified columns and a nonempty DataFrame value (GH 38831)

  • Bug in DataFrame.loc.__setitem__() raising a ValueError when operating on a unique column when the DataFrame has duplicate columns (GH 38521)

  • Bug in DataFrame.iloc.__setitem__() and DataFrame.loc.__setitem__() with mixed dtypes when setting with a dictionary value (GH 38335)

  • Bug in Series.loc.__setitem__() and DataFrame.loc.__setitem__() raising KeyError when provided a Boolean generator (GH 39614)

  • Bug in Series.iloc() and DataFrame.iloc() raising a KeyError when provided a generator (GH 39614)

  • Bug in DataFrame.__setitem__() not raising a ValueError when the right hand side is a DataFrame with wrong number of columns (GH 38604)

  • Bug in Series.__setitem__() raising a ValueError when setting a Series with a scalar indexer (GH 38303)

  • Bug in DataFrame.loc() dropping levels of a MultiIndex when the DataFrame used as input has only one row (GH 10521)

  • Bug in DataFrame.__getitem__() and Series.__getitem__() always raising KeyError when slicing with existing strings where the Index has milliseconds (GH 33589)

  • Bug in setting timedelta64 or datetime64 values into numeric Series failing to cast to object dtype (GH 39086, GH 39619)

  • Bug in setting Interval values into a Series or DataFrame with mismatched IntervalDtype incorrectly casting the new values to the existing dtype (GH 39120)

  • Bug in setting datetime64 values into a Series with integer-dtype incorrectly casting the datetime64 values to integers (GH 39266)

  • Bug in setting np.datetime64("NaT") into a Series with Datetime64TZDtype incorrectly treating the timezone-naive value as timezone-aware (GH 39769)

  • Bug in Index.get_loc() not raising KeyError when key=NaN and method is specified but NaN is not in the Index (GH 39382)

  • Bug in DatetimeIndex.insert() when inserting np.datetime64("NaT") into a timezone-aware index incorrectly treating the timezone-naive value as timezone-aware (GH 39769)

  • Bug in incorrectly raising in Index.insert(), when setting a new column that cannot be held in the existing frame.columns, or in Series.reset_index() or DataFrame.reset_index() instead of casting to a compatible dtype (GH 39068)

  • Bug in RangeIndex.append() where a single object of length 1 was concatenated incorrectly (GH 39401)

  • Bug in RangeIndex.astype() where when converting to CategoricalIndex, the categories became a Int64Index instead of a RangeIndex (GH 41263)

  • Bug in setting numpy.timedelta64 values into an object-dtype Series using a Boolean indexer (GH 39488)

  • Bug in setting numeric values into a into a boolean-dtypes Series using at or iat failing to cast to object-dtype (GH 39582)

  • Bug in DataFrame.__setitem__() and DataFrame.iloc.__setitem__() raising ValueError when trying to index with a row-slice and setting a list as values (GH 40440)

  • Bug in DataFrame.loc() not raising KeyError when the key was not found in MultiIndex and the levels were not fully specified (GH 41170)

  • Bug in DataFrame.loc.__setitem__() when setting-with-expansion incorrectly raising when the index in the expanding axis contained duplicates (GH 40096)

  • Bug in DataFrame.loc.__getitem__() with MultiIndex casting to float when at least one index column has float dtype and we retrieve a scalar (GH 41369)

  • Bug in DataFrame.loc() incorrectly matching non-Boolean index elements (GH 20432)

  • Bug in indexing with np.nan on a Series or DataFrame with a CategoricalIndex incorrectly raising KeyError when np.nan keys are present (GH 41933)

  • Bug in Series.__delitem__() with ExtensionDtype incorrectly casting to ndarray (GH 40386)

  • Bug in DataFrame.at() with a CategoricalIndex returning incorrect results when passed integer keys (GH 41846)

  • Bug in DataFrame.loc() returning a MultiIndex in the wrong order if an indexer has duplicates (GH 40978)

  • Bug in DataFrame.__setitem__() raising a TypeError when using a str subclass as the column name with a DatetimeIndex (GH 37366)

  • Bug in PeriodIndex.get_loc() failing to raise a KeyError when given a Period with a mismatched freq (GH 41670)

  • Bug .loc.__getitem__ with a UInt64Index and negative-integer keys raising OverflowError instead of KeyError in some cases, wrapping around to positive integers in others (GH 41777)

  • Bug in Index.get_indexer() failing to raise ValueError in some cases with invalid method, limit, or tolerance arguments (GH 41918)

  • Bug when slicing a Series or DataFrame with a TimedeltaIndex when passing an invalid string raising ValueError instead of a TypeError (GH 41821)

  • Bug in Index constructor sometimes silently ignoring a specified dtype (GH 38879)

  • Index.where() behavior now mirrors Index.putmask() behavior, i.e. index.where(mask, other) matches index.putmask(~mask, other) (GH 39412)

Missing

MultiIndex

  • Bug in DataFrame.drop() raising a TypeError when the MultiIndex is non-unique and level is not provided (GH 36293)

  • Bug in MultiIndex.intersection() duplicating NaN in the result (GH 38623)

  • Bug in MultiIndex.equals() incorrectly returning True when the MultiIndex contained NaN even when they are differently ordered (GH 38439)

  • Bug in MultiIndex.intersection() always returning an empty result when intersecting with CategoricalIndex (GH 38653)

  • Bug in MultiIndex.difference() incorrectly raising TypeError when indexes contain non-sortable entries (GH 41915)

  • Bug in MultiIndex.reindex() raising a ValueError when used on an empty MultiIndex and indexing only a specific level (GH 41170)

  • Bug in MultiIndex.reindex() raising TypeError when reindexing against a flat Index (GH 41707)

I/O

Period

  • Comparisons of Period objects or Index, Series, or DataFrame with mismatched PeriodDtype now behave like other mismatched-type comparisons, returning False for equals, True for not-equal, and raising TypeError for inequality checks (GH 39274)

Plotting

  • Bug in plotting.scatter_matrix() raising when 2d ax argument passed (GH 16253)

  • Prevent warnings when Matplotlib’s constrained_layout is enabled (GH 25261)

  • Bug in DataFrame.plot() was showing the wrong colors in the legend if the function was called repeatedly and some calls used yerr while others didn’t (GH 39522)

  • Bug in DataFrame.plot() was showing the wrong colors in the legend if the function was called repeatedly and some calls used secondary_y and others use legend=False (GH 40044)

  • Bug in DataFrame.plot.box() when dark_background theme was selected, caps or min/max markers for the plot were not visible (GH 40769)

Groupby/resample/rolling

Reshaping

Sparse

  • Bug in DataFrame.sparse.to_coo() raising a KeyError with columns that are a numeric Index without a 0 (GH 18414)

  • Bug in SparseArray.astype() with copy=False producing incorrect results when going from integer dtype to floating dtype (GH 34456)

  • Bug in SparseArray.max() and SparseArray.min() would always return an empty result (GH 40921)

ExtensionArray

Styler

  • Bug in Styler where the subset argument in methods raised an error for some valid MultiIndex slices (GH 33562)

  • Styler rendered HTML output has seen minor alterations to support w3 good code standards (GH 39626)

  • Bug in Styler where rendered HTML was missing a column class identifier for certain header cells (GH 39716)

  • Bug in Styler.background_gradient() where text-color was not determined correctly (GH 39888)

  • Bug in Styler.set_table_styles() where multiple elements in CSS-selectors of the table_styles argument were not correctly added (GH 34061)

  • Bug in Styler where copying from Jupyter dropped the top left cell and misaligned headers (GH 12147)

  • Bug in Styler.where where kwargs were not passed to the applicable callable (GH 40845)

  • Bug in Styler causing CSS to duplicate on multiple renders (GH 39395, GH 40334)

Other

Contributors

For contributors, please see /usr/share/doc/contributors_list.txt or https://github.com/pandas-dev/pandas/graphs/contributors