How do I create plots in pandas?

../../_images/04_plot_overview.svg
In [1]: import pandas as pd

In [2]: import matplotlib.pyplot as plt
Data used for this tutorial:
  • For this tutorial, air quality data about \(NO_2\) is used, made available by OpenAQ and using the py-openaq package. The air_quality_no2.csv data set provides \(NO_2\) values for the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp and London.

    To raw data
    In [3]: air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True)
    ---------------------------------------------------------------------------
    FileNotFoundError                         Traceback (most recent call last)
    <ipython-input-3-7f0fa9a6f21e> in <module>
    ----> 1 air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True)
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        209                 else:
        210                     kwargs[new_arg_name] = new_arg_value
    --> 211             return func(*args, **kwargs)
        212 
        213         return cast(F, wrapper)
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        329                     stacklevel=find_stack_level(),
        330                 )
    --> 331             return func(*args, **kwargs)
        332 
        333         # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
        948     kwds.update(kwds_defaults)
        949 
    --> 950     return _read(filepath_or_buffer, kwds)
        951 
        952 
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
        603 
        604     # Create the parser.
    --> 605     parser = TextFileReader(filepath_or_buffer, **kwds)
        606 
        607     if chunksize or iterator:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
       1440 
       1441         self.handles: IOHandles | None = None
    -> 1442         self._engine = self._make_engine(f, self.engine)
       1443 
       1444     def close(self) -> None:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine)
       1733                 if "b" not in mode:
       1734                     mode += "b"
    -> 1735             self.handles = get_handle(
       1736                 f,
       1737                 mode,
    
    /usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
        854         if ioargs.encoding and "b" not in ioargs.mode:
        855             # Encoding
    --> 856             handle = open(
        857                 handle,
        858                 ioargs.mode,
    
    FileNotFoundError: [Errno 2] No such file or directory: 'data/air_quality_no2.csv'
    
    In [4]: air_quality.head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-4-7c0df1c960a9> in <module>
    ----> 1 air_quality.head()
    
    NameError: name 'air_quality' is not defined
    

    Note

    The usage of the index_col and parse_dates parameters of the read_csv function to define the first (0th) column as index of the resulting DataFrame and convert the dates in the column to Timestamp objects, respectively.

  • I want a quick visual check of the data.

    In [5]: air_quality.plot()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-5-a05eeb23e0da> in <module>
    ----> 1 air_quality.plot()
    
    NameError: name 'air_quality' is not defined
    
    In [6]: plt.show()
    
    ../../_images/04_airqual_quick.png

    With a DataFrame, pandas creates by default one line plot for each of the columns with numeric data.

  • I want to plot only the columns of the data table with the data from Paris.

    In [7]: air_quality["station_paris"].plot()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-7-e13d44830625> in <module>
    ----> 1 air_quality["station_paris"].plot()
    
    NameError: name 'air_quality' is not defined
    
    In [8]: plt.show()
    
    ../../_images/04_airqual_paris.png

    To plot a specific column, use the selection method of the subset data tutorial in combination with the plot() method. Hence, the plot() method works on both Series and DataFrame.

  • I want to visually compare the \(NO_2\) values measured in London versus Paris.

    In [9]: air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-9-5de84405336b> in <module>
    ----> 1 air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)
    
    NameError: name 'air_quality' is not defined
    
    In [10]: plt.show()
    
    ../../_images/04_airqual_scatter.png

Apart from the default line plot when using the plot function, a number of alternatives are available to plot data. Let’s use some standard Python to get an overview of the available plot methods:

In [11]: [
   ....:     method_name
   ....:     for method_name in dir(air_quality.plot)
   ....:     if not method_name.startswith("_")
   ....: ]
   ....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-0d261c6f7c50> in <module>
      1 [
      2     method_name
----> 3     for method_name in dir(air_quality.plot)
      4     if not method_name.startswith("_")
      5 ]

NameError: name 'air_quality' is not defined

Note

In many development environments as well as IPython and Jupyter Notebook, use the TAB button to get an overview of the available methods, for example air_quality.plot. + TAB.

One of the options is DataFrame.plot.box(), which refers to a boxplot. The box method is applicable on the air quality example data:

In [12]: air_quality.plot.box()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-bebe91236add> in <module>
----> 1 air_quality.plot.box()

NameError: name 'air_quality' is not defined

In [13]: plt.show()
../../_images/04_airqual_boxplot.png
To user guide

For an introduction to plots other than the default line plot, see the user guide section about supported plot styles.

  • I want each of the columns in a separate subplot.

    In [14]: axs = air_quality.plot.area(figsize=(12, 4), subplots=True)
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-14-53ab11b5aaf1> in <module>
    ----> 1 axs = air_quality.plot.area(figsize=(12, 4), subplots=True)
    
    NameError: name 'air_quality' is not defined
    
    In [15]: plt.show()
    
    ../../_images/04_airqual_area_subplot.png

    Separate subplots for each of the data columns are supported by the subplots argument of the plot functions. The builtin options available in each of the pandas plot functions are worth reviewing.

To user guide

Some more formatting options are explained in the user guide section on plot formatting.

  • I want to further customize, extend or save the resulting plot.

    In [16]: fig, axs = plt.subplots(figsize=(12, 4))
    
    In [17]: air_quality.plot.area(ax=axs)
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-17-b62c0b760c8f> in <module>
    ----> 1 air_quality.plot.area(ax=axs)
    
    NameError: name 'air_quality' is not defined
    
    In [18]: axs.set_ylabel("NO$_2$ concentration")
    Out[18]: Text(0, 0.5, 'NO$_2$ concentration')
    
    In [19]: fig.savefig("no2_concentrations.png")
    
    In [20]: plt.show()
    
    ../../_images/04_airqual_customized.png

Each of the plot objects created by pandas is a Matplotlib object. As Matplotlib provides plenty of options to customize plots, making the link between pandas and Matplotlib explicit enables all the power of Matplotlib to the plot. This strategy is applied in the previous example:

fig, axs = plt.subplots(figsize=(12, 4))        # Create an empty Matplotlib Figure and Axes
air_quality.plot.area(ax=axs)                   # Use pandas to put the area plot on the prepared Figure/Axes
axs.set_ylabel("NO$_2$ concentration")          # Do any Matplotlib customization you like
fig.savefig("no2_concentrations.png")           # Save the Figure/Axes using the existing Matplotlib method.
plt.show()                                      # Display the plot

REMEMBER

  • The .plot.* methods are applicable on both Series and DataFrames.

  • By default, each of the columns is plotted as a different element (line, boxplot,…).

  • Any plot created by pandas is a Matplotlib object.

To user guide

A full overview of plotting in pandas is provided in the visualization pages.