In [1]: import pandas as pd
Data used for this tutorial:
  • This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:

    • PassengerId: Id of every passenger.

    • Survived: Indication whether passenger survived. 0 for yes and 1 for no.

    • Pclass: One out of the 3 ticket classes: Class 1, Class 2 and Class 3.

    • Name: Name of passenger.

    • Sex: Gender of passenger.

    • Age: Age of passenger in years.

    • SibSp: Number of siblings or spouses aboard.

    • Parch: Number of parents or children aboard.

    • Ticket: Ticket number of passenger.

    • Fare: Indicating the fare.

    • Cabin: Cabin number of passenger.

    • Embarked: Port of embarkation.

    To raw data
    In [2]: titanic = pd.read_csv("data/titanic.csv")
    ---------------------------------------------------------------------------
    FileNotFoundError                         Traceback (most recent call last)
    <ipython-input-2-4f6129a55af7> in <module>
    ----> 1 titanic = pd.read_csv("data/titanic.csv")
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        209                 else:
        210                     kwargs[new_arg_name] = new_arg_value
    --> 211             return func(*args, **kwargs)
        212 
        213         return cast(F, wrapper)
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        329                     stacklevel=find_stack_level(),
        330                 )
    --> 331             return func(*args, **kwargs)
        332 
        333         # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
        948     kwds.update(kwds_defaults)
        949 
    --> 950     return _read(filepath_or_buffer, kwds)
        951 
        952 
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
        603 
        604     # Create the parser.
    --> 605     parser = TextFileReader(filepath_or_buffer, **kwds)
        606 
        607     if chunksize or iterator:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
       1440 
       1441         self.handles: IOHandles | None = None
    -> 1442         self._engine = self._make_engine(f, self.engine)
       1443 
       1444     def close(self) -> None:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine)
       1733                 if "b" not in mode:
       1734                     mode += "b"
    -> 1735             self.handles = get_handle(
       1736                 f,
       1737                 mode,
    
    /usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
        854         if ioargs.encoding and "b" not in ioargs.mode:
        855             # Encoding
    --> 856             handle = open(
        857                 handle,
        858                 ioargs.mode,
    
    FileNotFoundError: [Errno 2] No such file or directory: 'data/titanic.csv'
    
    In [3]: titanic.head()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-3-cdd760b19866> in <module>
    ----> 1 titanic.head()
    
    NameError: name 'titanic' is not defined
    

How to manipulate textual data?

  • Make all name characters lowercase.

    In [4]: titanic["Name"].str.lower()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-4-265b08f949e3> in <module>
    ----> 1 titanic["Name"].str.lower()
    
    NameError: name 'titanic' is not defined
    

    To make each of the strings in the Name column lowercase, select the Name column (see the tutorial on selection of data), add the str accessor and apply the lower method. As such, each of the strings is converted element-wise.

Similar to datetime objects in the time series tutorial having a dt accessor, a number of specialized string methods are available when using the str accessor. These methods have in general matching names with the equivalent built-in string methods for single elements, but are applied element-wise (remember element-wise calculations?) on each of the values of the columns.

  • Create a new column Surname that contains the surname of the passengers by extracting the part before the comma.

    In [5]: titanic["Name"].str.split(",")
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-5-935449f33f36> in <module>
    ----> 1 titanic["Name"].str.split(",")
    
    NameError: name 'titanic' is not defined
    

    Using the Series.str.split() method, each of the values is returned as a list of 2 elements. The first element is the part before the comma and the second element is the part after the comma.

    In [6]: titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-6-fcbdf231a551> in <module>
    ----> 1 titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)
    
    NameError: name 'titanic' is not defined
    
    In [7]: titanic["Surname"]
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-7-b2f97a21fcdc> in <module>
    ----> 1 titanic["Surname"]
    
    NameError: name 'titanic' is not defined
    

    As we are only interested in the first part representing the surname (element 0), we can again use the str accessor and apply Series.str.get() to extract the relevant part. Indeed, these string functions can be concatenated to combine multiple functions at once!

To user guide

More information on extracting parts of strings is available in the user guide section on splitting and replacing strings.

  • Extract the passenger data about the countesses on board of the Titanic.

    In [8]: titanic["Name"].str.contains("Countess")
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-8-10daeb5f08eb> in <module>
    ----> 1 titanic["Name"].str.contains("Countess")
    
    NameError: name 'titanic' is not defined
    
    In [9]: titanic[titanic["Name"].str.contains("Countess")]
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-9-6fd9f8d8672d> in <module>
    ----> 1 titanic[titanic["Name"].str.contains("Countess")]
    
    NameError: name 'titanic' is not defined
    

    (Interested in her story? See Wikipedia!)

    The string method Series.str.contains() checks for each of the values in the column Name if the string contains the word Countess and returns for each of the values True (Countess is part of the name) or False (Countess is not part of the name). This output can be used to subselect the data using conditional (boolean) indexing introduced in the subsetting of data tutorial. As there was only one countess on the Titanic, we get one row as a result.

Note

More powerful extractions on strings are supported, as the Series.str.contains() and Series.str.extract() methods accept regular expressions, but out of scope of this tutorial.

To user guide

More information on extracting parts of strings is available in the user guide section on string matching and extracting.

  • Which passenger of the Titanic has the longest name?

    In [10]: titanic["Name"].str.len()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-10-328c04ca6334> in <module>
    ----> 1 titanic["Name"].str.len()
    
    NameError: name 'titanic' is not defined
    

    To get the longest name we first have to get the lengths of each of the names in the Name column. By using pandas string methods, the Series.str.len() function is applied to each of the names individually (element-wise).

    In [11]: titanic["Name"].str.len().idxmax()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-11-8f0f6e862e02> in <module>
    ----> 1 titanic["Name"].str.len().idxmax()
    
    NameError: name 'titanic' is not defined
    

    Next, we need to get the corresponding location, preferably the index label, in the table for which the name length is the largest. The idxmax() method does exactly that. It is not a string method and is applied to integers, so no str is used.

    In [12]: titanic.loc[titanic["Name"].str.len().idxmax(), "Name"]
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-12-fb9214c92e38> in <module>
    ----> 1 titanic.loc[titanic["Name"].str.len().idxmax(), "Name"]
    
    NameError: name 'titanic' is not defined
    

    Based on the index name of the row (307) and the column (Name), we can do a selection using the loc operator, introduced in the tutorial on subsetting.

  • In the “Sex” column, replace values of “male” by “M” and values of “female” by “F”.

    In [13]: titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", "female": "F"})
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-13-49429f8ccbd5> in <module>
    ----> 1 titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", "female": "F"})
    
    NameError: name 'titanic' is not defined
    
    In [14]: titanic["Sex_short"]
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-14-9dc48b619259> in <module>
    ----> 1 titanic["Sex_short"]
    
    NameError: name 'titanic' is not defined
    

    Whereas replace() is not a string method, it provides a convenient way to use mappings or vocabularies to translate certain values. It requires a dictionary to define the mapping {from : to}.

Warning

There is also a replace() method available to replace a specific set of characters. However, when having a mapping of multiple values, this would become:

titanic["Sex_short"] = titanic["Sex"].str.replace("female", "F")
titanic["Sex_short"] = titanic["Sex_short"].str.replace("male", "M")

This would become cumbersome and easily lead to mistakes. Just think (or try out yourself) what would happen if those two statements are applied in the opposite order…

REMEMBER

  • String methods are available using the str accessor.

  • String methods work element-wise and can be used for conditional indexing.

  • The replace method is a convenient method to convert values according to a given dictionary.

To user guide

A full overview is provided in the user guide pages on working with text data.