In [1]: import pandas as pd
- Titanic data
This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
PassengerId: Id of every passenger.
Survived: Indication whether passenger survived.
0
for yes and1
for no.Pclass: One out of the 3 ticket classes: Class
1
, Class2
and Class3
.Name: Name of passenger.
Sex: Gender of passenger.
Age: Age of passenger in years.
SibSp: Number of siblings or spouses aboard.
Parch: Number of parents or children aboard.
Ticket: Ticket number of passenger.
Fare: Indicating the fare.
Cabin: Cabin number of passenger.
Embarked: Port of embarkation.
In [2]: titanic = pd.read_csv("data/titanic.csv") --------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-2-4f6129a55af7> in <module> ----> 1 titanic = pd.read_csv("data/titanic.csv") /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 209 else: 210 kwargs[new_arg_name] = new_arg_value --> 211 return func(*args, **kwargs) 212 213 return cast(F, wrapper) /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(*args, **kwargs) 332 333 # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options) 948 kwds.update(kwds_defaults) 949 --> 950 return _read(filepath_or_buffer, kwds) 951 952 /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds) 603 604 # Create the parser. --> 605 parser = TextFileReader(filepath_or_buffer, **kwds) 606 607 if chunksize or iterator: /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds) 1440 1441 self.handles: IOHandles | None = None -> 1442 self._engine = self._make_engine(f, self.engine) 1443 1444 def close(self) -> None: /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine) 1733 if "b" not in mode: 1734 mode += "b" -> 1735 self.handles = get_handle( 1736 f, 1737 mode, /usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 854 if ioargs.encoding and "b" not in ioargs.mode: 855 # Encoding --> 856 handle = open( 857 handle, 858 ioargs.mode, FileNotFoundError: [Errno 2] No such file or directory: 'data/titanic.csv' In [3]: titanic.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-3-cdd760b19866> in <module> ----> 1 titanic.head() NameError: name 'titanic' is not defined
How do I select a subset of a DataFrame
?¶
How do I select specific columns from a DataFrame
?¶
I’m interested in the age of the Titanic passengers.
In [4]: ages = titanic["Age"] --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-4-7327ba6281af> in <module> ----> 1 ages = titanic["Age"] NameError: name 'titanic' is not defined In [5]: ages.head() Out[5]: 0 22 1 35 2 58 Name: Age, dtype: int64
To select a single column, use square brackets
[]
with the column name of the column of interest.
Each column in a DataFrame
is a Series
. As a single column is
selected, the returned object is a pandas Series
. We can verify this
by checking the type of the output:
In [6]: type(titanic["Age"])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-7bd04334af9f> in <module>
----> 1 type(titanic["Age"])
NameError: name 'titanic' is not defined
And have a look at the shape
of the output:
In [7]: titanic["Age"].shape
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-59f8b0af4ac3> in <module>
----> 1 titanic["Age"].shape
NameError: name 'titanic' is not defined
DataFrame.shape
is an attribute (remember tutorial on reading and writing, do not use parentheses for attributes) of a
pandas Series
and DataFrame
containing the number of rows and
columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only
the number of rows is returned.
I’m interested in the age and sex of the Titanic passengers.
In [8]: age_sex = titanic[["Age", "Sex"]] --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-8-2742720edbbb> in <module> ----> 1 age_sex = titanic[["Age", "Sex"]] NameError: name 'titanic' is not defined In [9]: age_sex.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-9-08f2d4755ae8> in <module> ----> 1 age_sex.head() NameError: name 'age_sex' is not defined
To select multiple columns, use a list of column names within the selection brackets
[]
.
Note
The inner square brackets define a
Python list with column names, whereas
the outer brackets are used to select the data from a pandas
DataFrame
as seen in the previous example.
The returned data type is a pandas DataFrame:
In [10]: type(titanic[["Age", "Sex"]])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-99019f1ef6e5> in <module>
----> 1 type(titanic[["Age", "Sex"]])
NameError: name 'titanic' is not defined
In [11]: titanic[["Age", "Sex"]].shape
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-4e95699a34a8> in <module>
----> 1 titanic[["Age", "Sex"]].shape
NameError: name 'titanic' is not defined
The selection returned a DataFrame
with 891 rows and 2 columns. Remember, a
DataFrame
is 2-dimensional with both a row and column dimension.
For basic information on indexing, see the user guide section on indexing and selecting data.
How do I filter specific rows from a DataFrame
?¶
I’m interested in the passengers older than 35 years.
In [12]: above_35 = titanic[titanic["Age"] > 35] --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-12-e8345e3521a1> in <module> ----> 1 above_35 = titanic[titanic["Age"] > 35] NameError: name 'titanic' is not defined In [13]: above_35.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-13-c3d6cb62760e> in <module> ----> 1 above_35.head() NameError: name 'above_35' is not defined
To select rows based on a conditional expression, use a condition inside the selection brackets
[]
.
The condition inside the selection
brackets titanic["Age"] > 35
checks for which rows the Age
column has a value larger than 35:
In [14]: titanic["Age"] > 35
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-14-379a1c62df62> in <module>
----> 1 titanic["Age"] > 35
NameError: name 'titanic' is not defined
The output of the conditional expression (>
, but also ==
,
!=
, <
, <=
,… would work) is actually a pandas Series
of
boolean values (either True
or False
) with the same number of
rows as the original DataFrame
. Such a Series
of boolean values
can be used to filter the DataFrame
by putting it in between the
selection brackets []
. Only rows for which the value is True
will be selected.
We know from before that the original Titanic DataFrame
consists of
891 rows. Let’s have a look at the number of rows which satisfy the
condition by checking the shape
attribute of the resulting
DataFrame
above_35
:
In [15]: above_35.shape
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-15-a81b794388ef> in <module>
----> 1 above_35.shape
NameError: name 'above_35' is not defined
I’m interested in the Titanic passengers from cabin class 2 and 3.
In [16]: class_23 = titanic[titanic["Pclass"].isin([2, 3])] --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-16-a4f7864eab2d> in <module> ----> 1 class_23 = titanic[titanic["Pclass"].isin([2, 3])] NameError: name 'titanic' is not defined In [17]: class_23.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-17-834772dd73a0> in <module> ----> 1 class_23.head() NameError: name 'class_23' is not defined
Similar to the conditional expression, the
isin()
conditional function returns aTrue
for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets[]
. In this case, the condition inside the selection bracketstitanic["Pclass"].isin([2, 3])
checks for which rows thePclass
column is either 2 or 3.
The above is equivalent to filtering by rows for which the class is
either 2 or 3 and combining the two statements with an |
(or)
operator:
In [18]: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-18-50268a2f0b24> in <module>
----> 1 class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
NameError: name 'titanic' is not defined
In [19]: class_23.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-19-834772dd73a0> in <module>
----> 1 class_23.head()
NameError: name 'class_23' is not defined
Note
When combining multiple conditional statements, each condition
must be surrounded by parentheses ()
. Moreover, you can not use
or
/and
but need to use the or
operator |
and the and
operator &
.
See the dedicated section in the user guide about boolean indexing or about the isin function.
I want to work with passenger data for which the age is known.
In [20]: age_no_na = titanic[titanic["Age"].notna()] --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-20-f18aabab7b5c> in <module> ----> 1 age_no_na = titanic[titanic["Age"].notna()] NameError: name 'titanic' is not defined In [21]: age_no_na.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-21-66adfc9ca9ef> in <module> ----> 1 age_no_na.head() NameError: name 'age_no_na' is not defined
The
notna()
conditional function returns aTrue
for each row the values are not aNull
value. As such, this can be combined with the selection brackets[]
to filter the data table.
You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if the shape has changed:
In [22]: age_no_na.shape
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-22-d783a0586ffe> in <module>
----> 1 age_no_na.shape
NameError: name 'age_no_na' is not defined
For more dedicated functions on missing values, see the user guide section about handling missing data.
How do I select specific rows and columns from a DataFrame
?¶
I’m interested in the names of the passengers older than 35 years.
In [23]: adult_names = titanic.loc[titanic["Age"] > 35, "Name"] --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-23-08b63324ca66> in <module> ----> 1 adult_names = titanic.loc[titanic["Age"] > 35, "Name"] NameError: name 'titanic' is not defined In [24]: adult_names.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-24-31c0f97d35d3> in <module> ----> 1 adult_names.head() NameError: name 'adult_names' is not defined
In this case, a subset of both rows and columns is made in one go and just using selection brackets
[]
is not sufficient anymore. Theloc
/iloc
operators are required in front of the selection brackets[]
. When usingloc
/iloc
, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.
When using the column names, row labels or a condition expression, use
the loc
operator in front of the selection brackets []
. For both
the part before and after the comma, you can use a single label, a list
of labels, a slice of labels, a conditional expression or a colon. Using
a colon specifies you want to select all rows or columns.
I’m interested in rows 10 till 25 and columns 3 to 5.
In [25]: titanic.iloc[9:25, 2:5] --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-25-0d7a88b72b2b> in <module> ----> 1 titanic.iloc[9:25, 2:5] NameError: name 'titanic' is not defined
Again, a subset of both rows and columns is made in one go and just using selection brackets
[]
is not sufficient anymore. When specifically interested in certain rows and/or columns based on their position in the table, use theiloc
operator in front of the selection brackets[]
.
When selecting specific rows and/or columns with loc
or iloc
,
new values can be assigned to the selected data. For example, to assign
the name anonymous
to the first 3 elements of the third column:
In [26]: titanic.iloc[0:3, 3] = "anonymous"
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-26-f84b581a8d45> in <module>
----> 1 titanic.iloc[0:3, 3] = "anonymous"
NameError: name 'titanic' is not defined
In [27]: titanic.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-27-cdd760b19866> in <module>
----> 1 titanic.head()
NameError: name 'titanic' is not defined
See the user guide section on different choices for indexing to get more insight in the usage of loc
and iloc
.
REMEMBER
When selecting subsets of data, square brackets
[]
are used.Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.
Select specific rows and/or columns using
loc
when using the row and column names.Select specific rows and/or columns using
iloc
when using the positions in the table.You can assign new values to a selection based on
loc
/iloc
.
A full overview of indexing is provided in the user guide pages on indexing and selecting data.