In [1]: import pandas as pd
- Titanic data
This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
PassengerId: Id of every passenger.
Survived: Indication whether passenger survived.
0
for yes and1
for no.Pclass: One out of the 3 ticket classes: Class
1
, Class2
and Class3
.Name: Name of passenger.
Sex: Gender of passenger.
Age: Age of passenger in years.
SibSp: Number of siblings or spouses aboard.
Parch: Number of parents or children aboard.
Ticket: Ticket number of passenger.
Fare: Indicating the fare.
Cabin: Cabin number of passenger.
Embarked: Port of embarkation.
In [2]: titanic = pd.read_csv("data/titanic.csv") --------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-2-4f6129a55af7> in <module> ----> 1 titanic = pd.read_csv("data/titanic.csv") /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 209 else: 210 kwargs[new_arg_name] = new_arg_value --> 211 return func(*args, **kwargs) 212 213 return cast(F, wrapper) /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(*args, **kwargs) 332 333 # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options) 948 kwds.update(kwds_defaults) 949 --> 950 return _read(filepath_or_buffer, kwds) 951 952 /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds) 603 604 # Create the parser. --> 605 parser = TextFileReader(filepath_or_buffer, **kwds) 606 607 if chunksize or iterator: /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds) 1440 1441 self.handles: IOHandles | None = None -> 1442 self._engine = self._make_engine(f, self.engine) 1443 1444 def close(self) -> None: /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine) 1733 if "b" not in mode: 1734 mode += "b" -> 1735 self.handles = get_handle( 1736 f, 1737 mode, /usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 854 if ioargs.encoding and "b" not in ioargs.mode: 855 # Encoding --> 856 handle = open( 857 handle, 858 ioargs.mode, FileNotFoundError: [Errno 2] No such file or directory: 'data/titanic.csv' In [3]: titanic.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-3-cdd760b19866> in <module> ----> 1 titanic.head() NameError: name 'titanic' is not defined
How to manipulate textual data?¶
Make all name characters lowercase.
In [4]: titanic["Name"].str.lower() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-4-265b08f949e3> in <module> ----> 1 titanic["Name"].str.lower() NameError: name 'titanic' is not defined
To make each of the strings in the
Name
column lowercase, select theName
column (see the tutorial on selection of data), add thestr
accessor and apply thelower
method. As such, each of the strings is converted element-wise.
Similar to datetime objects in the time series tutorial
having a dt
accessor, a number of
specialized string methods are available when using the str
accessor. These methods have in general matching names with the
equivalent built-in string methods for single elements, but are applied
element-wise (remember element-wise calculations?)
on each of the values of the columns.
Create a new column
Surname
that contains the surname of the passengers by extracting the part before the comma.In [5]: titanic["Name"].str.split(",") --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-5-935449f33f36> in <module> ----> 1 titanic["Name"].str.split(",") NameError: name 'titanic' is not defined
Using the
Series.str.split()
method, each of the values is returned as a list of 2 elements. The first element is the part before the comma and the second element is the part after the comma.In [6]: titanic["Surname"] = titanic["Name"].str.split(",").str.get(0) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-6-fcbdf231a551> in <module> ----> 1 titanic["Surname"] = titanic["Name"].str.split(",").str.get(0) NameError: name 'titanic' is not defined In [7]: titanic["Surname"] --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-7-b2f97a21fcdc> in <module> ----> 1 titanic["Surname"] NameError: name 'titanic' is not defined
As we are only interested in the first part representing the surname (element 0), we can again use the
str
accessor and applySeries.str.get()
to extract the relevant part. Indeed, these string functions can be concatenated to combine multiple functions at once!
More information on extracting parts of strings is available in the user guide section on splitting and replacing strings.
Extract the passenger data about the countesses on board of the Titanic.
In [8]: titanic["Name"].str.contains("Countess") --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-8-10daeb5f08eb> in <module> ----> 1 titanic["Name"].str.contains("Countess") NameError: name 'titanic' is not defined
In [9]: titanic[titanic["Name"].str.contains("Countess")] --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-9-6fd9f8d8672d> in <module> ----> 1 titanic[titanic["Name"].str.contains("Countess")] NameError: name 'titanic' is not defined
(Interested in her story? See Wikipedia!)
The string method
Series.str.contains()
checks for each of the values in the columnName
if the string contains the wordCountess
and returns for each of the valuesTrue
(Countess
is part of the name) orFalse
(Countess
is not part of the name). This output can be used to subselect the data using conditional (boolean) indexing introduced in the subsetting of data tutorial. As there was only one countess on the Titanic, we get one row as a result.
Note
More powerful extractions on strings are supported, as the
Series.str.contains()
and Series.str.extract()
methods accept regular
expressions, but out of
scope of this tutorial.
More information on extracting parts of strings is available in the user guide section on string matching and extracting.
Which passenger of the Titanic has the longest name?
In [10]: titanic["Name"].str.len() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-10-328c04ca6334> in <module> ----> 1 titanic["Name"].str.len() NameError: name 'titanic' is not defined
To get the longest name we first have to get the lengths of each of the names in the
Name
column. By using pandas string methods, theSeries.str.len()
function is applied to each of the names individually (element-wise).In [11]: titanic["Name"].str.len().idxmax() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-11-8f0f6e862e02> in <module> ----> 1 titanic["Name"].str.len().idxmax() NameError: name 'titanic' is not defined
Next, we need to get the corresponding location, preferably the index label, in the table for which the name length is the largest. The
idxmax()
method does exactly that. It is not a string method and is applied to integers, so nostr
is used.In [12]: titanic.loc[titanic["Name"].str.len().idxmax(), "Name"] --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-12-fb9214c92e38> in <module> ----> 1 titanic.loc[titanic["Name"].str.len().idxmax(), "Name"] NameError: name 'titanic' is not defined
Based on the index name of the row (
307
) and the column (Name
), we can do a selection using theloc
operator, introduced in the tutorial on subsetting.
In the “Sex” column, replace values of “male” by “M” and values of “female” by “F”.
In [13]: titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", "female": "F"}) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-13-49429f8ccbd5> in <module> ----> 1 titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", "female": "F"}) NameError: name 'titanic' is not defined In [14]: titanic["Sex_short"] --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-14-9dc48b619259> in <module> ----> 1 titanic["Sex_short"] NameError: name 'titanic' is not defined
Whereas
replace()
is not a string method, it provides a convenient way to use mappings or vocabularies to translate certain values. It requires adictionary
to define the mapping{from : to}
.
Warning
There is also a replace()
method available to replace a
specific set of characters. However, when having a mapping of multiple
values, this would become:
titanic["Sex_short"] = titanic["Sex"].str.replace("female", "F")
titanic["Sex_short"] = titanic["Sex_short"].str.replace("male", "M")
This would become cumbersome and easily lead to mistakes. Just think (or try out yourself) what would happen if those two statements are applied in the opposite order…
REMEMBER
String methods are available using the
str
accessor.String methods work element-wise and can be used for conditional indexing.
The
replace
method is a convenient method to convert values according to a given dictionary.
A full overview is provided in the user guide pages on working with text data.