In [1]: import pandas as pd
- Titanic data
This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
PassengerId: Id of every passenger.
Survived: Indication whether passenger survived.
0
for yes and1
for no.Pclass: One out of the 3 ticket classes: Class
1
, Class2
and Class3
.Name: Name of passenger.
Sex: Gender of passenger.
Age: Age of passenger in years.
SibSp: Number of siblings or spouses aboard.
Parch: Number of parents or children aboard.
Ticket: Ticket number of passenger.
Fare: Indicating the fare.
Cabin: Cabin number of passenger.
Embarked: Port of embarkation.
In [2]: titanic = pd.read_csv("data/titanic.csv") --------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-2-4f6129a55af7> in <module> ----> 1 titanic = pd.read_csv("data/titanic.csv") /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 209 else: 210 kwargs[new_arg_name] = new_arg_value --> 211 return func(*args, **kwargs) 212 213 return cast(F, wrapper) /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(*args, **kwargs) 332 333 # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options) 948 kwds.update(kwds_defaults) 949 --> 950 return _read(filepath_or_buffer, kwds) 951 952 /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds) 603 604 # Create the parser. --> 605 parser = TextFileReader(filepath_or_buffer, **kwds) 606 607 if chunksize or iterator: /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds) 1440 1441 self.handles: IOHandles | None = None -> 1442 self._engine = self._make_engine(f, self.engine) 1443 1444 def close(self) -> None: /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine) 1733 if "b" not in mode: 1734 mode += "b" -> 1735 self.handles = get_handle( 1736 f, 1737 mode, /usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 854 if ioargs.encoding and "b" not in ioargs.mode: 855 # Encoding --> 856 handle = open( 857 handle, 858 ioargs.mode, FileNotFoundError: [Errno 2] No such file or directory: 'data/titanic.csv' In [3]: titanic.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-3-cdd760b19866> in <module> ----> 1 titanic.head() NameError: name 'titanic' is not defined
How to calculate summary statistics?¶
Aggregating statistics¶
What is the average age of the Titanic passengers?
In [4]: titanic["Age"].mean() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-4-ce36a65d1dd1> in <module> ----> 1 titanic["Age"].mean() NameError: name 'titanic' is not defined
Different statistics are available and can be applied to columns with numerical data. Operations in general exclude missing data and operate across rows by default.
What is the median age and ticket fare price of the Titanic passengers?
In [5]: titanic[["Age", "Fare"]].median() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-5-67d6e373f850> in <module> ----> 1 titanic[["Age", "Fare"]].median() NameError: name 'titanic' is not defined
The statistic applied to multiple columns of a
DataFrame
(the selection of two columns returns aDataFrame
, see the subset data tutorial) is calculated for each numeric column.
The aggregating statistic can be calculated for multiple columns at the
same time. Remember the describe
function from the first tutorial?
In [6]: titanic[["Age", "Fare"]].describe()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-eb33412cdf83> in <module>
----> 1 titanic[["Age", "Fare"]].describe()
NameError: name 'titanic' is not defined
Instead of the predefined statistics, specific combinations of
aggregating statistics for given columns can be defined using the
DataFrame.agg()
method:
In [7]: titanic.agg(
...: {
...: "Age": ["min", "max", "median", "skew"],
...: "Fare": ["min", "max", "median", "mean"],
...: }
...: )
...:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-563775a98555> in <module>
----> 1 titanic.agg(
2 {
3 "Age": ["min", "max", "median", "skew"],
4 "Fare": ["min", "max", "median", "mean"],
5 }
NameError: name 'titanic' is not defined
Details about descriptive statistics are provided in the user guide section on descriptive statistics.
Aggregating statistics grouped by category¶
What is the average age for male versus female Titanic passengers?
In [8]: titanic[["Sex", "Age"]].groupby("Sex").mean() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-8-5cdc21d53954> in <module> ----> 1 titanic[["Sex", "Age"]].groupby("Sex").mean() NameError: name 'titanic' is not defined
As our interest is the average age for each gender, a subselection on these two columns is made first:
titanic[["Sex", "Age"]]
. Next, thegroupby()
method is applied on theSex
column to make a group per category. The average age for each gender is calculated and returned.
Calculating a given statistic (e.g. mean
age) for each category in
a column (e.g. male/female in the Sex
column) is a common pattern.
The groupby
method is used to support this type of operations. This
fits in the more general split-apply-combine
pattern:
Split the data into groups
Apply a function to each group independently
Combine the results into a data structure
The apply and combine steps are typically done together in pandas.
In the previous example, we explicitly selected the 2 columns first. If
not, the mean
method is applied to each column containing numerical
columns by passing numeric_only=True
:
In [9]: titanic.groupby("Sex").mean(numeric_only=True)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-9-36b140e0c45f> in <module>
----> 1 titanic.groupby("Sex").mean(numeric_only=True)
NameError: name 'titanic' is not defined
It does not make much sense to get the average value of the Pclass
.
If we are only interested in the average age for each gender, the
selection of columns (rectangular brackets []
as usual) is supported
on the grouped data as well:
In [10]: titanic.groupby("Sex")["Age"].mean()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-b010d6c1d816> in <module>
----> 1 titanic.groupby("Sex")["Age"].mean()
NameError: name 'titanic' is not defined
Note
The Pclass
column contains numerical data but actually
represents 3 categories (or factors) with respectively the labels ‘1’,
‘2’ and ‘3’. Calculating statistics on these does not make much sense.
Therefore, pandas provides a Categorical
data type to handle this
type of data. More information is provided in the user guide
Categorical data section.
What is the mean ticket fare price for each of the sex and cabin class combinations?
In [11]: titanic.groupby(["Sex", "Pclass"])["Fare"].mean() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-11-808583af65f4> in <module> ----> 1 titanic.groupby(["Sex", "Pclass"])["Fare"].mean() NameError: name 'titanic' is not defined
Grouping can be done by multiple columns at the same time. Provide the column names as a list to the
groupby()
method.
A full description on the split-apply-combine approach is provided in the user guide section on groupby operations.
Count number of records by category¶
What is the number of passengers in each of the cabin classes?
In [12]: titanic["Pclass"].value_counts() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-12-a58c1746b252> in <module> ----> 1 titanic["Pclass"].value_counts() NameError: name 'titanic' is not defined
The
value_counts()
method counts the number of records for each category in a column.
The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records within each group:
In [13]: titanic.groupby("Pclass")["Pclass"].count()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-13-93ba2addf674> in <module>
----> 1 titanic.groupby("Pclass")["Pclass"].count()
NameError: name 'titanic' is not defined
Note
Both size
and count
can be used in combination with
groupby
. Whereas size
includes NaN
values and just provides
the number of rows (size of the table), count
excludes the missing
values. In the value_counts
method, use the dropna
argument to
include or exclude the NaN
values.
The user guide has a dedicated section on value_counts
, see the page on discretization.
REMEMBER
Aggregation statistics can be calculated on entire columns or rows.
groupby
provides the power of the split-apply-combine pattern.value_counts
is a convenient shortcut to count the number of entries in each category of a variable.
A full description on the split-apply-combine approach is provided in the user guide pages about groupby operations.