In [1]: import pandas as pd

In [2]: import matplotlib.pyplot as plt
Data used for this tutorial:
  • For this tutorial, air quality data about \(NO_2\) and Particulate matter less than 2.5 micrometers is used, made available by OpenAQ and downloaded using the py-openaq package. The air_quality_no2_long.csv" data set provides \(NO_2\) values for the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp and London.

    To raw data
    In [3]: air_quality = pd.read_csv("data/air_quality_no2_long.csv")
    ---------------------------------------------------------------------------
    FileNotFoundError                         Traceback (most recent call last)
    <ipython-input-3-ff3d32135ec4> in <module>
    ----> 1 air_quality = pd.read_csv("data/air_quality_no2_long.csv")
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        209                 else:
        210                     kwargs[new_arg_name] = new_arg_value
    --> 211             return func(*args, **kwargs)
        212 
        213         return cast(F, wrapper)
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        329                     stacklevel=find_stack_level(),
        330                 )
    --> 331             return func(*args, **kwargs)
        332 
        333         # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
        948     kwds.update(kwds_defaults)
        949 
    --> 950     return _read(filepath_or_buffer, kwds)
        951 
        952 
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
        603 
        604     # Create the parser.
    --> 605     parser = TextFileReader(filepath_or_buffer, **kwds)
        606 
        607     if chunksize or iterator:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
       1440 
       1441         self.handles: IOHandles | None = None
    -> 1442         self._engine = self._make_engine(f, self.engine)
       1443 
       1444     def close(self) -> None:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine)
       1733                 if "b" not in mode:
       1734                     mode += "b"
    -> 1735             self.handles = get_handle(
       1736                 f,
       1737                 mode,
    
    /usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
        854         if ioargs.encoding and "b" not in ioargs.mode:
        855             # Encoding
    --> 856             handle = open(
        857                 handle,
        858                 ioargs.mode,
    
    FileNotFoundError: [Errno 2] No such file or directory: 'data/air_quality_no2_long.csv'
    
    In [4]: air_quality = air_quality.rename(columns={"date.utc": "datetime"})
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-4-005e5dd03429> in <module>
    ----> 1 air_quality = air_quality.rename(columns={"date.utc": "datetime"})
    
    NameError: name 'air_quality' is not defined
    
    In [5]: air_quality.head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-5-7c0df1c960a9> in <module>
    ----> 1 air_quality.head()
    
    NameError: name 'air_quality' is not defined
    
    In [6]: air_quality.city.unique()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-6-f3a44628aaac> in <module>
    ----> 1 air_quality.city.unique()
    
    NameError: name 'air_quality' is not defined
    

How to handle time series data with ease?

Using pandas datetime properties

  • I want to work with the dates in the column datetime as datetime objects instead of plain text

    In [7]: air_quality["datetime"] = pd.to_datetime(air_quality["datetime"])
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-7-c0bbe172e4d3> in <module>
    ----> 1 air_quality["datetime"] = pd.to_datetime(air_quality["datetime"])
    
    NameError: name 'air_quality' is not defined
    
    In [8]: air_quality["datetime"]
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-8-068c292e4266> in <module>
    ----> 1 air_quality["datetime"]
    
    NameError: name 'air_quality' is not defined
    

    Initially, the values in datetime are character strings and do not provide any datetime operations (e.g. extract the year, day of the week,…). By applying the to_datetime function, pandas interprets the strings and convert these to datetime (i.e. datetime64[ns, UTC]) objects. In pandas we call these datetime objects similar to datetime.datetime from the standard library as pandas.Timestamp.

Note

As many data sets do contain datetime information in one of the columns, pandas input function like pandas.read_csv() and pandas.read_json() can do the transformation to dates when reading the data using the parse_dates parameter with a list of the columns to read as Timestamp:

pd.read_csv("../data/air_quality_no2_long.csv", parse_dates=["datetime"])

Why are these pandas.Timestamp objects useful? Let’s illustrate the added value with some example cases.

What is the start and end date of the time series data set we are working with?

In [9]: air_quality["datetime"].min(), air_quality["datetime"].max()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-7c20becdfbe8> in <module>
----> 1 air_quality["datetime"].min(), air_quality["datetime"].max()

NameError: name 'air_quality' is not defined

Using pandas.Timestamp for datetimes enables us to calculate with date information and make them comparable. Hence, we can use this to get the length of our time series:

In [10]: air_quality["datetime"].max() - air_quality["datetime"].min()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-f43355a2f9d1> in <module>
----> 1 air_quality["datetime"].max() - air_quality["datetime"].min()

NameError: name 'air_quality' is not defined

The result is a pandas.Timedelta object, similar to datetime.timedelta from the standard Python library and defining a time duration.

To user guide

The various time concepts supported by pandas are explained in the user guide section on time related concepts.

  • I want to add a new column to the DataFrame containing only the month of the measurement

    In [11]: air_quality["month"] = air_quality["datetime"].dt.month
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-11-0acb1ac774bc> in <module>
    ----> 1 air_quality["month"] = air_quality["datetime"].dt.month
    
    NameError: name 'air_quality' is not defined
    
    In [12]: air_quality.head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-12-7c0df1c960a9> in <module>
    ----> 1 air_quality.head()
    
    NameError: name 'air_quality' is not defined
    

    By using Timestamp objects for dates, a lot of time-related properties are provided by pandas. For example the month, but also year, weekofyear, quarter,… All of these properties are accessible by the dt accessor.

To user guide

An overview of the existing date properties is given in the time and date components overview table. More details about the dt accessor to return datetime like properties are explained in a dedicated section on the dt accessor.

  • What is the average \(NO_2\) concentration for each day of the week for each of the measurement locations?

    In [13]: air_quality.groupby(
       ....:     [air_quality["datetime"].dt.weekday, "location"])["value"].mean()
       ....: 
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-13-3f752cf79adc> in <module>
    ----> 1 air_quality.groupby(
          2     [air_quality["datetime"].dt.weekday, "location"])["value"].mean()
    
    NameError: name 'air_quality' is not defined
    

    Remember the split-apply-combine pattern provided by groupby from the tutorial on statistics calculation? Here, we want to calculate a given statistic (e.g. mean \(NO_2\)) for each weekday and for each measurement location. To group on weekdays, we use the datetime property weekday (with Monday=0 and Sunday=6) of pandas Timestamp, which is also accessible by the dt accessor. The grouping on both locations and weekdays can be done to split the calculation of the mean on each of these combinations.

    Danger

    As we are working with a very short time series in these examples, the analysis does not provide a long-term representative result!

  • Plot the typical \(NO_2\) pattern during the day of our time series of all stations together. In other words, what is the average value for each hour of the day?

    In [14]: fig, axs = plt.subplots(figsize=(12, 4))
    
    In [15]: air_quality.groupby(air_quality["datetime"].dt.hour)["value"].mean().plot(
       ....:     kind='bar', rot=0, ax=axs
       ....: )
       ....: 
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-15-2388366517e9> in <module>
    ----> 1 air_quality.groupby(air_quality["datetime"].dt.hour)["value"].mean().plot(
          2     kind='bar', rot=0, ax=axs
          3 )
    
    NameError: name 'air_quality' is not defined
    
    In [16]: plt.xlabel("Hour of the day");  # custom x label using Matplotlib
    
    In [17]: plt.ylabel("$NO_2 (µg/m^3)$");
    
    ../../_images/09_bar_chart.png

    Similar to the previous case, we want to calculate a given statistic (e.g. mean \(NO_2\)) for each hour of the day and we can use the split-apply-combine approach again. For this case, we use the datetime property hour of pandas Timestamp, which is also accessible by the dt accessor.

Datetime as index

In the tutorial on reshaping, pivot() was introduced to reshape the data table with each of the measurements locations as a separate column:

In [18]: no_2 = air_quality.pivot(index="datetime", columns="location", values="value")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-18-382ae379e75f> in <module>
----> 1 no_2 = air_quality.pivot(index="datetime", columns="location", values="value")

NameError: name 'air_quality' is not defined

In [19]: no_2.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-19-f64c1f33f2e3> in <module>
----> 1 no_2.head()

NameError: name 'no_2' is not defined

Note

By pivoting the data, the datetime information became the index of the table. In general, setting a column as an index can be achieved by the set_index function.

Working with a datetime index (i.e. DatetimeIndex) provides powerful functionalities. For example, we do not need the dt accessor to get the time series properties, but have these properties available on the index directly:

In [20]: no_2.index.year, no_2.index.weekday
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-20-97b37d4c929c> in <module>
----> 1 no_2.index.year, no_2.index.weekday

NameError: name 'no_2' is not defined

Some other advantages are the convenient subsetting of time period or the adapted time scale on plots. Let’s apply this on our data.

  • Create a plot of the \(NO_2\) values in the different stations from the 20th of May till the end of 21st of May

    In [21]: no_2["2019-05-20":"2019-05-21"].plot();
    
    ../../_images/09_time_section.png

    By providing a string that parses to a datetime, a specific subset of the data can be selected on a DatetimeIndex.

To user guide

More information on the DatetimeIndex and the slicing by using strings is provided in the section on time series indexing.

Resample a time series to another frequency

  • Aggregate the current hourly time series values to the monthly maximum value in each of the stations.

    In [22]: monthly_max = no_2.resample("M").max()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-22-4754a328b4a4> in <module>
    ----> 1 monthly_max = no_2.resample("M").max()
    
    NameError: name 'no_2' is not defined
    
    In [23]: monthly_max
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-23-a8c51d639d66> in <module>
    ----> 1 monthly_max
    
    NameError: name 'monthly_max' is not defined
    

    A very powerful method on time series data with a datetime index, is the ability to resample() time series to another frequency (e.g., converting secondly data into 5-minutely data).

The resample() method is similar to a groupby operation:

  • it provides a time-based grouping, by using a string (e.g. M, 5H,…) that defines the target frequency

  • it requires an aggregation function such as mean, max,…

To user guide

An overview of the aliases used to define time series frequencies is given in the offset aliases overview table.

When defined, the frequency of the time series is provided by the freq attribute:

In [24]: monthly_max.index.freq
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-24-ac79925bea4b> in <module>
----> 1 monthly_max.index.freq

NameError: name 'monthly_max' is not defined
  • Make a plot of the daily mean \(NO_2\) value in each of the stations.

    In [25]: no_2.resample("D").mean().plot(style="-o", figsize=(10, 5));
    
    ../../_images/09_resample_mean.png
To user guide

More details on the power of time series resampling is provided in the user guide section on resampling.

REMEMBER

  • Valid date strings can be converted to datetime objects using to_datetime function or as part of read functions.

  • Datetime objects in pandas support calculations, logical operations and convenient date-related properties using the dt accessor.

  • A DatetimeIndex contains these date-related properties and supports convenient slicing.

  • Resample is a powerful method to change the frequency of a time series.

To user guide

A full overview on time series is given on the pages on time series and date functionality.