In [1]: import pandas as pd
Data used for this tutorial:
  • This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:

    • PassengerId: Id of every passenger.

    • Survived: Indication whether passenger survived. 0 for yes and 1 for no.

    • Pclass: One out of the 3 ticket classes: Class 1, Class 2 and Class 3.

    • Name: Name of passenger.

    • Sex: Gender of passenger.

    • Age: Age of passenger in years.

    • SibSp: Number of siblings or spouses aboard.

    • Parch: Number of parents or children aboard.

    • Ticket: Ticket number of passenger.

    • Fare: Indicating the fare.

    • Cabin: Cabin number of passenger.

    • Embarked: Port of embarkation.

    To raw data
    In [2]: titanic = pd.read_csv("data/titanic.csv")
    ---------------------------------------------------------------------------
    FileNotFoundError                         Traceback (most recent call last)
    <ipython-input-2-4f6129a55af7> in <module>
    ----> 1 titanic = pd.read_csv("data/titanic.csv")
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        209                 else:
        210                     kwargs[new_arg_name] = new_arg_value
    --> 211             return func(*args, **kwargs)
        212 
        213         return cast(F, wrapper)
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        329                     stacklevel=find_stack_level(),
        330                 )
    --> 331             return func(*args, **kwargs)
        332 
        333         # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
        948     kwds.update(kwds_defaults)
        949 
    --> 950     return _read(filepath_or_buffer, kwds)
        951 
        952 
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
        603 
        604     # Create the parser.
    --> 605     parser = TextFileReader(filepath_or_buffer, **kwds)
        606 
        607     if chunksize or iterator:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
       1440 
       1441         self.handles: IOHandles | None = None
    -> 1442         self._engine = self._make_engine(f, self.engine)
       1443 
       1444     def close(self) -> None:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine)
       1733                 if "b" not in mode:
       1734                     mode += "b"
    -> 1735             self.handles = get_handle(
       1736                 f,
       1737                 mode,
    
    /usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
        854         if ioargs.encoding and "b" not in ioargs.mode:
        855             # Encoding
    --> 856             handle = open(
        857                 handle,
        858                 ioargs.mode,
    
    FileNotFoundError: [Errno 2] No such file or directory: 'data/titanic.csv'
    
    In [3]: titanic.head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-3-cdd760b19866> in <module>
    ----> 1 titanic.head()
    
    NameError: name 'titanic' is not defined
    
  • This tutorial uses air quality data about \(NO_2\) and Particulate matter less than 2.5 micrometers, made available by OpenAQ and using the py-openaq package. The air_quality_long.csv data set provides \(NO_2\) and \(PM_{25}\) values for the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp and London.

    The air-quality data set has the following columns:

    • city: city where the sensor is used, either Paris, Antwerp or London

    • country: country where the sensor is used, either FR, BE or GB

    • location: the id of the sensor, either FR04014, BETR801 or London Westminster

    • parameter: the parameter measured by the sensor, either \(NO_2\) or Particulate matter

    • value: the measured value

    • unit: the unit of the measured parameter, in this case ‘µg/m³’

    and the index of the DataFrame is datetime, the datetime of the measurement.

    Note

    The air-quality data is provided in a so-called long format data representation with each observation on a separate row and each variable a separate column of the data table. The long/narrow format is also known as the tidy data format.

    To raw data
    In [4]: air_quality = pd.read_csv(
       ...:     "data/air_quality_long.csv", index_col="date.utc", parse_dates=True
       ...: )
       ...: 
    ---------------------------------------------------------------------------
    FileNotFoundError                         Traceback (most recent call last)
    <ipython-input-4-bc722d3a7198> in <module>
    ----> 1 air_quality = pd.read_csv(
          2     "data/air_quality_long.csv", index_col="date.utc", parse_dates=True
          3 )
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        209                 else:
        210                     kwargs[new_arg_name] = new_arg_value
    --> 211             return func(*args, **kwargs)
        212 
        213         return cast(F, wrapper)
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        329                     stacklevel=find_stack_level(),
        330                 )
    --> 331             return func(*args, **kwargs)
        332 
        333         # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
        948     kwds.update(kwds_defaults)
        949 
    --> 950     return _read(filepath_or_buffer, kwds)
        951 
        952 
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
        603 
        604     # Create the parser.
    --> 605     parser = TextFileReader(filepath_or_buffer, **kwds)
        606 
        607     if chunksize or iterator:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
       1440 
       1441         self.handles: IOHandles | None = None
    -> 1442         self._engine = self._make_engine(f, self.engine)
       1443 
       1444     def close(self) -> None:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine)
       1733                 if "b" not in mode:
       1734                     mode += "b"
    -> 1735             self.handles = get_handle(
       1736                 f,
       1737                 mode,
    
    /usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
        854         if ioargs.encoding and "b" not in ioargs.mode:
        855             # Encoding
    --> 856             handle = open(
        857                 handle,
        858                 ioargs.mode,
    
    FileNotFoundError: [Errno 2] No such file or directory: 'data/air_quality_long.csv'
    
    In [5]: air_quality.head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-5-7c0df1c960a9> in <module>
    ----> 1 air_quality.head()
    
    NameError: name 'air_quality' is not defined
    

How to reshape the layout of tables?

Sort table rows

  • I want to sort the Titanic data according to the age of the passengers.

    In [6]: titanic.sort_values(by="Age").head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-6-c2673722b86a> in <module>
    ----> 1 titanic.sort_values(by="Age").head()
    
    NameError: name 'titanic' is not defined
    
  • I want to sort the Titanic data according to the cabin class and age in descending order.

    In [7]: titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-7-fdaaa45b1384> in <module>
    ----> 1 titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()
    
    NameError: name 'titanic' is not defined
    

    With DataFrame.sort_values(), the rows in the table are sorted according to the defined column(s). The index will follow the row order.

To user guide

More details about sorting of tables is provided in the user guide section on sorting data.

Long to wide table format

Let’s use a small subset of the air quality data set. We focus on \(NO_2\) data and only use the first two measurements of each location (i.e. the head of each group). The subset of data will be called no2_subset.

# filter for no2 data only
In [8]: no2 = air_quality[air_quality["parameter"] == "no2"]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-bb412ad6329f> in <module>
----> 1 no2 = air_quality[air_quality["parameter"] == "no2"]

NameError: name 'air_quality' is not defined
# use 2 measurements (head) for each location (groupby)
In [9]: no2_subset = no2.sort_index().groupby(["location"]).head(2)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-53529ecc23fc> in <module>
----> 1 no2_subset = no2.sort_index().groupby(["location"]).head(2)

NameError: name 'no2' is not defined

In [10]: no2_subset
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-4e012e63e0bd> in <module>
----> 1 no2_subset

NameError: name 'no2_subset' is not defined
../../_images/07_pivot.svg
  • I want the values for the three stations as separate columns next to each other.

    In [11]: no2_subset.pivot(columns="location", values="value")
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-11-c3aa9fc3ed0a> in <module>
    ----> 1 no2_subset.pivot(columns="location", values="value")
    
    NameError: name 'no2_subset' is not defined
    

    The pivot() function is purely reshaping of the data: a single value for each index/column combination is required.

As pandas supports plotting of multiple columns (see plotting tutorial) out of the box, the conversion from long to wide table format enables the plotting of the different time series at the same time:

In [12]: no2.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-8eb78b98175c> in <module>
----> 1 no2.head()

NameError: name 'no2' is not defined
In [13]: no2.pivot(columns="location", values="value").plot()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-4ff9ddb433f2> in <module>
----> 1 no2.pivot(columns="location", values="value").plot()

NameError: name 'no2' is not defined
../../_images/7_reshape_columns.png

Note

When the index parameter is not defined, the existing index (row labels) is used.

To user guide

For more information about pivot(), see the user guide section on pivoting DataFrame objects.

Pivot table

../../_images/07_pivot_table.svg
  • I want the mean concentrations for \(NO_2\) and \(PM_{2.5}\) in each of the stations in table form.

    In [14]: air_quality.pivot_table(
       ....:     values="value", index="location", columns="parameter", aggfunc="mean"
       ....: )
       ....: 
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-14-e7d9419ba870> in <module>
    ----> 1 air_quality.pivot_table(
          2     values="value", index="location", columns="parameter", aggfunc="mean"
          3 )
    
    NameError: name 'air_quality' is not defined
    

    In the case of pivot(), the data is only rearranged. When multiple values need to be aggregated (in this specific case, the values on different time steps), pivot_table() can be used, providing an aggregation function (e.g. mean) on how to combine these values.

Pivot table is a well known concept in spreadsheet software. When interested in the row/column margins (subtotals) for each variable, set the margins parameter to True:

In [15]: air_quality.pivot_table(
   ....:     values="value",
   ....:     index="location",
   ....:     columns="parameter",
   ....:     aggfunc="mean",
   ....:     margins=True,
   ....: )
   ....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-175242b3ce5d> in <module>
----> 1 air_quality.pivot_table(
      2     values="value",
      3     index="location",
      4     columns="parameter",
      5     aggfunc="mean",

NameError: name 'air_quality' is not defined
To user guide

For more information about pivot_table(), see the user guide section on pivot tables.

Note

In case you are wondering, pivot_table() is indeed directly linked to groupby(). The same result can be derived by grouping on both parameter and location:

air_quality.groupby(["parameter", "location"]).mean()
To user guide

Have a look at groupby() in combination with unstack() at the user guide section on combining stats and groupby.

Wide to long format

Starting again from the wide format table created in the previous section, we add a new index to the DataFrame with reset_index().

In [16]: no2_pivoted = no2.pivot(columns="location", values="value").reset_index()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-16-9d92a9d23db5> in <module>
----> 1 no2_pivoted = no2.pivot(columns="location", values="value").reset_index()

NameError: name 'no2' is not defined

In [17]: no2_pivoted.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-17-03cc355fb699> in <module>
----> 1 no2_pivoted.head()

NameError: name 'no2_pivoted' is not defined
../../_images/07_melt.svg
  • I want to collect all air quality \(NO_2\) measurements in a single column (long format).

    In [18]: no_2 = no2_pivoted.melt(id_vars="date.utc")
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-18-012608c36864> in <module>
    ----> 1 no_2 = no2_pivoted.melt(id_vars="date.utc")
    
    NameError: name 'no2_pivoted' is not defined
    
    In [19]: no_2.head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-19-f64c1f33f2e3> in <module>
    ----> 1 no_2.head()
    
    NameError: name 'no_2' is not defined
    

    The pandas.melt() method on a DataFrame converts the data table from wide format to long format. The column headers become the variable names in a newly created column.

The solution is the short version on how to apply pandas.melt(). The method will melt all columns NOT mentioned in id_vars together into two columns: A column with the column header names and a column with the values itself. The latter column gets by default the name value.

The parameters passed to pandas.melt() can be defined in more detail:

In [20]: no_2 = no2_pivoted.melt(
   ....:     id_vars="date.utc",
   ....:     value_vars=["BETR801", "FR04014", "London Westminster"],
   ....:     value_name="NO_2",
   ....:     var_name="id_location",
   ....: )
   ....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-20-784884a11e5b> in <module>
----> 1 no_2 = no2_pivoted.melt(
      2     id_vars="date.utc",
      3     value_vars=["BETR801", "FR04014", "London Westminster"],
      4     value_name="NO_2",
      5     var_name="id_location",

NameError: name 'no2_pivoted' is not defined

In [21]: no_2.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-21-f64c1f33f2e3> in <module>
----> 1 no_2.head()

NameError: name 'no_2' is not defined

The additional parameters have the following effects:

  • value_vars defines which columns to melt together

  • value_name provides a custom column name for the values column instead of the default column name value

  • var_name provides a custom column name for the column collecting the column header names. Otherwise it takes the index name or a default variable

Hence, the arguments value_name and var_name are just user-defined names for the two generated columns. The columns to melt are defined by id_vars and value_vars.

To user guide

Conversion from wide to long format with pandas.melt() is explained in the user guide section on reshaping by melt.

REMEMBER

  • Sorting by one or more columns is supported by sort_values.

  • The pivot function is purely restructuring of the data, pivot_table supports aggregations.

  • The reverse of pivot (long to wide format) is melt (wide to long format).

To user guide

A full overview is available in the user guide on the pages about reshaping and pivoting.