In [1]: import pandas as pd
Data used for this tutorial:
  • This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:

    • PassengerId: Id of every passenger.

    • Survived: Indication whether passenger survived. 0 for yes and 1 for no.

    • Pclass: One out of the 3 ticket classes: Class 1, Class 2 and Class 3.

    • Name: Name of passenger.

    • Sex: Gender of passenger.

    • Age: Age of passenger in years.

    • SibSp: Number of siblings or spouses aboard.

    • Parch: Number of parents or children aboard.

    • Ticket: Ticket number of passenger.

    • Fare: Indicating the fare.

    • Cabin: Cabin number of passenger.

    • Embarked: Port of embarkation.

    To raw data
    In [2]: titanic = pd.read_csv("data/titanic.csv")
    ---------------------------------------------------------------------------
    FileNotFoundError                         Traceback (most recent call last)
    <ipython-input-2-4f6129a55af7> in <module>
    ----> 1 titanic = pd.read_csv("data/titanic.csv")
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        209                 else:
        210                     kwargs[new_arg_name] = new_arg_value
    --> 211             return func(*args, **kwargs)
        212 
        213         return cast(F, wrapper)
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        329                     stacklevel=find_stack_level(),
        330                 )
    --> 331             return func(*args, **kwargs)
        332 
        333         # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
        948     kwds.update(kwds_defaults)
        949 
    --> 950     return _read(filepath_or_buffer, kwds)
        951 
        952 
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
        603 
        604     # Create the parser.
    --> 605     parser = TextFileReader(filepath_or_buffer, **kwds)
        606 
        607     if chunksize or iterator:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
       1440 
       1441         self.handles: IOHandles | None = None
    -> 1442         self._engine = self._make_engine(f, self.engine)
       1443 
       1444     def close(self) -> None:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine)
       1733                 if "b" not in mode:
       1734                     mode += "b"
    -> 1735             self.handles = get_handle(
       1736                 f,
       1737                 mode,
    
    /usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
        854         if ioargs.encoding and "b" not in ioargs.mode:
        855             # Encoding
    --> 856             handle = open(
        857                 handle,
        858                 ioargs.mode,
    
    FileNotFoundError: [Errno 2] No such file or directory: 'data/titanic.csv'
    
    In [3]: titanic.head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-3-cdd760b19866> in <module>
    ----> 1 titanic.head()
    
    NameError: name 'titanic' is not defined
    

How to calculate summary statistics?

Aggregating statistics

../../_images/06_aggregate.svg
  • What is the average age of the Titanic passengers?

    In [4]: titanic["Age"].mean()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-4-ce36a65d1dd1> in <module>
    ----> 1 titanic["Age"].mean()
    
    NameError: name 'titanic' is not defined
    

Different statistics are available and can be applied to columns with numerical data. Operations in general exclude missing data and operate across rows by default.

../../_images/06_reduction.svg
  • What is the median age and ticket fare price of the Titanic passengers?

    In [5]: titanic[["Age", "Fare"]].median()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-5-67d6e373f850> in <module>
    ----> 1 titanic[["Age", "Fare"]].median()
    
    NameError: name 'titanic' is not defined
    

    The statistic applied to multiple columns of a DataFrame (the selection of two columns returns a DataFrame, see the subset data tutorial) is calculated for each numeric column.

The aggregating statistic can be calculated for multiple columns at the same time. Remember the describe function from the first tutorial?

In [6]: titanic[["Age", "Fare"]].describe()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-eb33412cdf83> in <module>
----> 1 titanic[["Age", "Fare"]].describe()

NameError: name 'titanic' is not defined

Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the DataFrame.agg() method:

In [7]: titanic.agg(
   ...:     {
   ...:         "Age": ["min", "max", "median", "skew"],
   ...:         "Fare": ["min", "max", "median", "mean"],
   ...:     }
   ...: )
   ...: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-563775a98555> in <module>
----> 1 titanic.agg(
      2     {
      3         "Age": ["min", "max", "median", "skew"],
      4         "Fare": ["min", "max", "median", "mean"],
      5     }

NameError: name 'titanic' is not defined
To user guide

Details about descriptive statistics are provided in the user guide section on descriptive statistics.

Aggregating statistics grouped by category

../../_images/06_groupby.svg
  • What is the average age for male versus female Titanic passengers?

    In [8]: titanic[["Sex", "Age"]].groupby("Sex").mean()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-8-5cdc21d53954> in <module>
    ----> 1 titanic[["Sex", "Age"]].groupby("Sex").mean()
    
    NameError: name 'titanic' is not defined
    

    As our interest is the average age for each gender, a subselection on these two columns is made first: titanic[["Sex", "Age"]]. Next, the groupby() method is applied on the Sex column to make a group per category. The average age for each gender is calculated and returned.

Calculating a given statistic (e.g. mean age) for each category in a column (e.g. male/female in the Sex column) is a common pattern. The groupby method is used to support this type of operations. This fits in the more general split-apply-combine pattern:

  • Split the data into groups

  • Apply a function to each group independently

  • Combine the results into a data structure

The apply and combine steps are typically done together in pandas.

In the previous example, we explicitly selected the 2 columns first. If not, the mean method is applied to each column containing numerical columns by passing numeric_only=True:

In [9]: titanic.groupby("Sex").mean(numeric_only=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-36b140e0c45f> in <module>
----> 1 titanic.groupby("Sex").mean(numeric_only=True)

NameError: name 'titanic' is not defined

It does not make much sense to get the average value of the Pclass. If we are only interested in the average age for each gender, the selection of columns (rectangular brackets [] as usual) is supported on the grouped data as well:

In [10]: titanic.groupby("Sex")["Age"].mean()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-b010d6c1d816> in <module>
----> 1 titanic.groupby("Sex")["Age"].mean()

NameError: name 'titanic' is not defined
../../_images/06_groupby_select_detail.svg

Note

The Pclass column contains numerical data but actually represents 3 categories (or factors) with respectively the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a Categorical data type to handle this type of data. More information is provided in the user guide Categorical data section.

  • What is the mean ticket fare price for each of the sex and cabin class combinations?

    In [11]: titanic.groupby(["Sex", "Pclass"])["Fare"].mean()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-11-808583af65f4> in <module>
    ----> 1 titanic.groupby(["Sex", "Pclass"])["Fare"].mean()
    
    NameError: name 'titanic' is not defined
    

    Grouping can be done by multiple columns at the same time. Provide the column names as a list to the groupby() method.

To user guide

A full description on the split-apply-combine approach is provided in the user guide section on groupby operations.

Count number of records by category

../../_images/06_valuecounts.svg
  • What is the number of passengers in each of the cabin classes?

    In [12]: titanic["Pclass"].value_counts()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-12-a58c1746b252> in <module>
    ----> 1 titanic["Pclass"].value_counts()
    
    NameError: name 'titanic' is not defined
    

    The value_counts() method counts the number of records for each category in a column.

The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records within each group:

In [13]: titanic.groupby("Pclass")["Pclass"].count()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-93ba2addf674> in <module>
----> 1 titanic.groupby("Pclass")["Pclass"].count()

NameError: name 'titanic' is not defined

Note

Both size and count can be used in combination with groupby. Whereas size includes NaN values and just provides the number of rows (size of the table), count excludes the missing values. In the value_counts method, use the dropna argument to include or exclude the NaN values.

To user guide

The user guide has a dedicated section on value_counts , see the page on discretization.

REMEMBER

  • Aggregation statistics can be calculated on entire columns or rows.

  • groupby provides the power of the split-apply-combine pattern.

  • value_counts is a convenient shortcut to count the number of entries in each category of a variable.

To user guide

A full description on the split-apply-combine approach is provided in the user guide pages about groupby operations.