How do I create plots in pandas?¶
In [1]: import pandas as pd
In [2]: import matplotlib.pyplot as plt
- Air quality data
For this tutorial, air quality data about \(NO_2\) is used, made available by OpenAQ and using the py-openaq package. The
To raw dataair_quality_no2.csv
data set provides \(NO_2\) values for the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp and London.In [3]: air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True) --------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-3-7f0fa9a6f21e> in <module> ----> 1 air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True) /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 209 else: 210 kwargs[new_arg_name] = new_arg_value --> 211 return func(*args, **kwargs) 212 213 return cast(F, wrapper) /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(*args, **kwargs) 332 333 # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options) 948 kwds.update(kwds_defaults) 949 --> 950 return _read(filepath_or_buffer, kwds) 951 952 /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds) 603 604 # Create the parser. --> 605 parser = TextFileReader(filepath_or_buffer, **kwds) 606 607 if chunksize or iterator: /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds) 1440 1441 self.handles: IOHandles | None = None -> 1442 self._engine = self._make_engine(f, self.engine) 1443 1444 def close(self) -> None: /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine) 1733 if "b" not in mode: 1734 mode += "b" -> 1735 self.handles = get_handle( 1736 f, 1737 mode, /usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 854 if ioargs.encoding and "b" not in ioargs.mode: 855 # Encoding --> 856 handle = open( 857 handle, 858 ioargs.mode, FileNotFoundError: [Errno 2] No such file or directory: 'data/air_quality_no2.csv' In [4]: air_quality.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-4-7c0df1c960a9> in <module> ----> 1 air_quality.head() NameError: name 'air_quality' is not defined
Note
The usage of the
index_col
andparse_dates
parameters of theread_csv
function to define the first (0th) column as index of the resultingDataFrame
and convert the dates in the column toTimestamp
objects, respectively.
I want a quick visual check of the data.
In [5]: air_quality.plot() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-5-a05eeb23e0da> in <module> ----> 1 air_quality.plot() NameError: name 'air_quality' is not defined In [6]: plt.show()
With a
DataFrame
, pandas creates by default one line plot for each of the columns with numeric data.
I want to plot only the columns of the data table with the data from Paris.
In [7]: air_quality["station_paris"].plot() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-7-e13d44830625> in <module> ----> 1 air_quality["station_paris"].plot() NameError: name 'air_quality' is not defined In [8]: plt.show()
To plot a specific column, use the selection method of the subset data tutorial in combination with the
plot()
method. Hence, theplot()
method works on bothSeries
andDataFrame
.
I want to visually compare the \(NO_2\) values measured in London versus Paris.
In [9]: air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-9-5de84405336b> in <module> ----> 1 air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5) NameError: name 'air_quality' is not defined In [10]: plt.show()
Apart from the default line
plot when using the plot
function, a
number of alternatives are available to plot data. Let’s use some
standard Python to get an overview of the available plot methods:
In [11]: [
....: method_name
....: for method_name in dir(air_quality.plot)
....: if not method_name.startswith("_")
....: ]
....:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-0d261c6f7c50> in <module>
1 [
2 method_name
----> 3 for method_name in dir(air_quality.plot)
4 if not method_name.startswith("_")
5 ]
NameError: name 'air_quality' is not defined
Note
In many development environments as well as IPython and
Jupyter Notebook, use the TAB button to get an overview of the available
methods, for example air_quality.plot.
+ TAB.
One of the options is DataFrame.plot.box()
, which refers to a
boxplot. The box
method is applicable on the air quality example data:
In [12]: air_quality.plot.box()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-12-bebe91236add> in <module>
----> 1 air_quality.plot.box()
NameError: name 'air_quality' is not defined
In [13]: plt.show()

For an introduction to plots other than the default line plot, see the user guide section about supported plot styles.
I want each of the columns in a separate subplot.
In [14]: axs = air_quality.plot.area(figsize=(12, 4), subplots=True) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-14-53ab11b5aaf1> in <module> ----> 1 axs = air_quality.plot.area(figsize=(12, 4), subplots=True) NameError: name 'air_quality' is not defined In [15]: plt.show()
Separate subplots for each of the data columns are supported by the
subplots
argument of theplot
functions. The builtin options available in each of the pandas plot functions are worth reviewing.
Some more formatting options are explained in the user guide section on plot formatting.
I want to further customize, extend or save the resulting plot.
In [16]: fig, axs = plt.subplots(figsize=(12, 4)) In [17]: air_quality.plot.area(ax=axs) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-17-b62c0b760c8f> in <module> ----> 1 air_quality.plot.area(ax=axs) NameError: name 'air_quality' is not defined In [18]: axs.set_ylabel("NO$_2$ concentration") Out[18]: Text(0, 0.5, 'NO$_2$ concentration') In [19]: fig.savefig("no2_concentrations.png") In [20]: plt.show()
Each of the plot objects created by pandas is a Matplotlib object. As Matplotlib provides plenty of options to customize plots, making the link between pandas and Matplotlib explicit enables all the power of Matplotlib to the plot. This strategy is applied in the previous example:
fig, axs = plt.subplots(figsize=(12, 4)) # Create an empty Matplotlib Figure and Axes
air_quality.plot.area(ax=axs) # Use pandas to put the area plot on the prepared Figure/Axes
axs.set_ylabel("NO$_2$ concentration") # Do any Matplotlib customization you like
fig.savefig("no2_concentrations.png") # Save the Figure/Axes using the existing Matplotlib method.
plt.show() # Display the plot
REMEMBER
The
.plot.*
methods are applicable on both Series and DataFrames.By default, each of the columns is plotted as a different element (line, boxplot,…).
Any plot created by pandas is a Matplotlib object.
A full overview of plotting in pandas is provided in the visualization pages.