In [1]: import pandas as pd
Data used for this tutorial:
  • This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:

    • PassengerId: Id of every passenger.

    • Survived: Indication whether passenger survived. 0 for yes and 1 for no.

    • Pclass: One out of the 3 ticket classes: Class 1, Class 2 and Class 3.

    • Name: Name of passenger.

    • Sex: Gender of passenger.

    • Age: Age of passenger in years.

    • SibSp: Number of siblings or spouses aboard.

    • Parch: Number of parents or children aboard.

    • Ticket: Ticket number of passenger.

    • Fare: Indicating the fare.

    • Cabin: Cabin number of passenger.

    • Embarked: Port of embarkation.

    To raw data
    In [2]: titanic = pd.read_csv("data/titanic.csv")
    ---------------------------------------------------------------------------
    FileNotFoundError                         Traceback (most recent call last)
    <ipython-input-2-4f6129a55af7> in <module>
    ----> 1 titanic = pd.read_csv("data/titanic.csv")
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        209                 else:
        210                     kwargs[new_arg_name] = new_arg_value
    --> 211             return func(*args, **kwargs)
        212 
        213         return cast(F, wrapper)
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        329                     stacklevel=find_stack_level(),
        330                 )
    --> 331             return func(*args, **kwargs)
        332 
        333         # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
        948     kwds.update(kwds_defaults)
        949 
    --> 950     return _read(filepath_or_buffer, kwds)
        951 
        952 
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
        603 
        604     # Create the parser.
    --> 605     parser = TextFileReader(filepath_or_buffer, **kwds)
        606 
        607     if chunksize or iterator:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
       1440 
       1441         self.handles: IOHandles | None = None
    -> 1442         self._engine = self._make_engine(f, self.engine)
       1443 
       1444     def close(self) -> None:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine)
       1733                 if "b" not in mode:
       1734                     mode += "b"
    -> 1735             self.handles = get_handle(
       1736                 f,
       1737                 mode,
    
    /usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
        854         if ioargs.encoding and "b" not in ioargs.mode:
        855             # Encoding
    --> 856             handle = open(
        857                 handle,
        858                 ioargs.mode,
    
    FileNotFoundError: [Errno 2] No such file or directory: 'data/titanic.csv'
    
    In [3]: titanic.head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-3-cdd760b19866> in <module>
    ----> 1 titanic.head()
    
    NameError: name 'titanic' is not defined
    

How do I select a subset of a DataFrame?

How do I select specific columns from a DataFrame?

../../_images/03_subset_columns.svg
  • I’m interested in the age of the Titanic passengers.

    In [4]: ages = titanic["Age"]
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-4-7327ba6281af> in <module>
    ----> 1 ages = titanic["Age"]
    
    NameError: name 'titanic' is not defined
    
    In [5]: ages.head()
    Out[5]: 
    0    22
    1    35
    2    58
    Name: Age, dtype: int64
    

    To select a single column, use square brackets [] with the column name of the column of interest.

Each column in a DataFrame is a Series. As a single column is selected, the returned object is a pandas Series. We can verify this by checking the type of the output:

In [6]: type(titanic["Age"])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-7bd04334af9f> in <module>
----> 1 type(titanic["Age"])

NameError: name 'titanic' is not defined

And have a look at the shape of the output:

In [7]: titanic["Age"].shape
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-59f8b0af4ac3> in <module>
----> 1 titanic["Age"].shape

NameError: name 'titanic' is not defined

DataFrame.shape is an attribute (remember tutorial on reading and writing, do not use parentheses for attributes) of a pandas Series and DataFrame containing the number of rows and columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only the number of rows is returned.

  • I’m interested in the age and sex of the Titanic passengers.

    In [8]: age_sex = titanic[["Age", "Sex"]]
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-8-2742720edbbb> in <module>
    ----> 1 age_sex = titanic[["Age", "Sex"]]
    
    NameError: name 'titanic' is not defined
    
    In [9]: age_sex.head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-9-08f2d4755ae8> in <module>
    ----> 1 age_sex.head()
    
    NameError: name 'age_sex' is not defined
    

    To select multiple columns, use a list of column names within the selection brackets [].

Note

The inner square brackets define a Python list with column names, whereas the outer brackets are used to select the data from a pandas DataFrame as seen in the previous example.

The returned data type is a pandas DataFrame:

In [10]: type(titanic[["Age", "Sex"]])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-99019f1ef6e5> in <module>
----> 1 type(titanic[["Age", "Sex"]])

NameError: name 'titanic' is not defined
In [11]: titanic[["Age", "Sex"]].shape
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-4e95699a34a8> in <module>
----> 1 titanic[["Age", "Sex"]].shape

NameError: name 'titanic' is not defined

The selection returned a DataFrame with 891 rows and 2 columns. Remember, a DataFrame is 2-dimensional with both a row and column dimension.

To user guide

For basic information on indexing, see the user guide section on indexing and selecting data.

How do I filter specific rows from a DataFrame?

../../_images/03_subset_rows.svg
  • I’m interested in the passengers older than 35 years.

    In [12]: above_35 = titanic[titanic["Age"] > 35]
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-12-e8345e3521a1> in <module>
    ----> 1 above_35 = titanic[titanic["Age"] > 35]
    
    NameError: name 'titanic' is not defined
    
    In [13]: above_35.head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-13-c3d6cb62760e> in <module>
    ----> 1 above_35.head()
    
    NameError: name 'above_35' is not defined
    

    To select rows based on a conditional expression, use a condition inside the selection brackets [].

The condition inside the selection brackets titanic["Age"] > 35 checks for which rows the Age column has a value larger than 35:

In [14]: titanic["Age"] > 35
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-379a1c62df62> in <module>
----> 1 titanic["Age"] > 35

NameError: name 'titanic' is not defined

The output of the conditional expression (>, but also ==, !=, <, <=,… would work) is actually a pandas Series of boolean values (either True or False) with the same number of rows as the original DataFrame. Such a Series of boolean values can be used to filter the DataFrame by putting it in between the selection brackets []. Only rows for which the value is True will be selected.

We know from before that the original Titanic DataFrame consists of 891 rows. Let’s have a look at the number of rows which satisfy the condition by checking the shape attribute of the resulting DataFrame above_35:

In [15]: above_35.shape
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-a81b794388ef> in <module>
----> 1 above_35.shape

NameError: name 'above_35' is not defined
  • I’m interested in the Titanic passengers from cabin class 2 and 3.

    In [16]: class_23 = titanic[titanic["Pclass"].isin([2, 3])]
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-16-a4f7864eab2d> in <module>
    ----> 1 class_23 = titanic[titanic["Pclass"].isin([2, 3])]
    
    NameError: name 'titanic' is not defined
    
    In [17]: class_23.head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-17-834772dd73a0> in <module>
    ----> 1 class_23.head()
    
    NameError: name 'class_23' is not defined
    

    Similar to the conditional expression, the isin() conditional function returns a True for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets []. In this case, the condition inside the selection brackets titanic["Pclass"].isin([2, 3]) checks for which rows the Pclass column is either 2 or 3.

The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with an | (or) operator:

In [18]: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-18-50268a2f0b24> in <module>
----> 1 class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]

NameError: name 'titanic' is not defined

In [19]: class_23.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-19-834772dd73a0> in <module>
----> 1 class_23.head()

NameError: name 'class_23' is not defined

Note

When combining multiple conditional statements, each condition must be surrounded by parentheses (). Moreover, you can not use or/and but need to use the or operator | and the and operator &.

To user guide

See the dedicated section in the user guide about boolean indexing or about the isin function.

  • I want to work with passenger data for which the age is known.

    In [20]: age_no_na = titanic[titanic["Age"].notna()]
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-20-f18aabab7b5c> in <module>
    ----> 1 age_no_na = titanic[titanic["Age"].notna()]
    
    NameError: name 'titanic' is not defined
    
    In [21]: age_no_na.head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-21-66adfc9ca9ef> in <module>
    ----> 1 age_no_na.head()
    
    NameError: name 'age_no_na' is not defined
    

    The notna() conditional function returns a True for each row the values are not a Null value. As such, this can be combined with the selection brackets [] to filter the data table.

You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if the shape has changed:

In [22]: age_no_na.shape
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-22-d783a0586ffe> in <module>
----> 1 age_no_na.shape

NameError: name 'age_no_na' is not defined
To user guide

For more dedicated functions on missing values, see the user guide section about handling missing data.

How do I select specific rows and columns from a DataFrame?

../../_images/03_subset_columns_rows.svg
  • I’m interested in the names of the passengers older than 35 years.

    In [23]: adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-23-08b63324ca66> in <module>
    ----> 1 adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
    
    NameError: name 'titanic' is not defined
    
    In [24]: adult_names.head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-24-31c0f97d35d3> in <module>
    ----> 1 adult_names.head()
    
    NameError: name 'adult_names' is not defined
    

    In this case, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient anymore. The loc/iloc operators are required in front of the selection brackets []. When using loc/iloc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.

When using the column names, row labels or a condition expression, use the loc operator in front of the selection brackets []. For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels, a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.

  • I’m interested in rows 10 till 25 and columns 3 to 5.

    In [25]: titanic.iloc[9:25, 2:5]
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-25-0d7a88b72b2b> in <module>
    ----> 1 titanic.iloc[9:25, 2:5]
    
    NameError: name 'titanic' is not defined
    

    Again, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient anymore. When specifically interested in certain rows and/or columns based on their position in the table, use the iloc operator in front of the selection brackets [].

When selecting specific rows and/or columns with loc or iloc, new values can be assigned to the selected data. For example, to assign the name anonymous to the first 3 elements of the third column:

In [26]: titanic.iloc[0:3, 3] = "anonymous"
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-26-f84b581a8d45> in <module>
----> 1 titanic.iloc[0:3, 3] = "anonymous"

NameError: name 'titanic' is not defined

In [27]: titanic.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-27-cdd760b19866> in <module>
----> 1 titanic.head()

NameError: name 'titanic' is not defined
To user guide

See the user guide section on different choices for indexing to get more insight in the usage of loc and iloc.

REMEMBER

  • When selecting subsets of data, square brackets [] are used.

  • Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.

  • Select specific rows and/or columns using loc when using the row and column names.

  • Select specific rows and/or columns using iloc when using the positions in the table.

  • You can assign new values to a selection based on loc/iloc.

To user guide

A full overview of indexing is provided in the user guide pages on indexing and selecting data.