In [1]: import pandas as pd
- Titanic data
This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
PassengerId: Id of every passenger.
Survived: Indication whether passenger survived.
0
for yes and1
for no.Pclass: One out of the 3 ticket classes: Class
1
, Class2
and Class3
.Name: Name of passenger.
Sex: Gender of passenger.
Age: Age of passenger in years.
SibSp: Number of siblings or spouses aboard.
Parch: Number of parents or children aboard.
Ticket: Ticket number of passenger.
Fare: Indicating the fare.
Cabin: Cabin number of passenger.
Embarked: Port of embarkation.
How do I read and write tabular data?¶
I want to analyze the Titanic passenger data, available as a CSV file.
In [2]: titanic = pd.read_csv("data/titanic.csv") --------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-2-4f6129a55af7> in <module> ----> 1 titanic = pd.read_csv("data/titanic.csv") /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 209 else: 210 kwargs[new_arg_name] = new_arg_value --> 211 return func(*args, **kwargs) 212 213 return cast(F, wrapper) /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(*args, **kwargs) 332 333 # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options) 948 kwds.update(kwds_defaults) 949 --> 950 return _read(filepath_or_buffer, kwds) 951 952 /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds) 603 604 # Create the parser. --> 605 parser = TextFileReader(filepath_or_buffer, **kwds) 606 607 if chunksize or iterator: /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds) 1440 1441 self.handles: IOHandles | None = None -> 1442 self._engine = self._make_engine(f, self.engine) 1443 1444 def close(self) -> None: /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine) 1733 if "b" not in mode: 1734 mode += "b" -> 1735 self.handles = get_handle( 1736 f, 1737 mode, /usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 854 if ioargs.encoding and "b" not in ioargs.mode: 855 # Encoding --> 856 handle = open( 857 handle, 858 ioargs.mode, FileNotFoundError: [Errno 2] No such file or directory: 'data/titanic.csv'
pandas provides the
read_csv()
function to read data stored as a csv file into a pandasDataFrame
. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefixread_*
.
Make sure to always have a check on the data after reading in the
data. When displaying a DataFrame
, the first and last 5 rows will be
shown by default:
In [3]: titanic
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-5be31ed7efa5> in <module>
----> 1 titanic
NameError: name 'titanic' is not defined
I want to see the first 8 rows of a pandas DataFrame.
In [4]: titanic.head(8) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-4-5547fc3161fd> in <module> ----> 1 titanic.head(8) NameError: name 'titanic' is not defined
To see the first N rows of a
DataFrame
, use thehead()
method with the required number of rows (in this case 8) as argument.
Note
Interested in the last N rows instead? pandas also provides a
tail()
method. For example, titanic.tail(10)
will return the last
10 rows of the DataFrame.
A check on how pandas interpreted each of the column data types can be
done by requesting the pandas dtypes
attribute:
In [5]: titanic.dtypes
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-2ca0d3b62096> in <module>
----> 1 titanic.dtypes
NameError: name 'titanic' is not defined
For each of the columns, the used data type is enlisted. The data types
in this DataFrame
are integers (int64
), floats (float64
) and
strings (object
).
Note
When asking for the dtypes
, no brackets are used!
dtypes
is an attribute of a DataFrame
and Series
. Attributes
of DataFrame
or Series
do not need brackets. Attributes
represent a characteristic of a DataFrame
/Series
, whereas a
method (which requires brackets) do something with the
DataFrame
/Series
as introduced in the first tutorial.
My colleague requested the Titanic data as a spreadsheet.
In [6]: titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False) --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-6-ef3e997cc15f> in <module> ----> 1 titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False) NameError: name 'titanic' is not defined
Whereas
read_*
functions are used to read data to pandas, theto_*
methods are used to store data. Theto_excel()
method stores the data as an excel file. In the example here, thesheet_name
is named passengers instead of the default Sheet1. By settingindex=False
the row index labels are not saved in the spreadsheet.
The equivalent read function read_excel()
will reload the data to a
DataFrame
:
In [7]: titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-7-081526be361d> in <module>
----> 1 titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")
/usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
209 else:
210 kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)
212
213 return cast(F, wrapper)
/usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
329 stacklevel=find_stack_level(),
330 )
--> 331 return func(*args, **kwargs)
332
333 # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no
/usr/lib/python3/dist-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
480 if not isinstance(io, ExcelFile):
481 should_close = True
--> 482 io = ExcelFile(io, storage_options=storage_options, engine=engine)
483 elif engine and engine != io.engine:
484 raise ValueError(
/usr/lib/python3/dist-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
1650 ext = "xls"
1651 else:
-> 1652 ext = inspect_excel_format(
1653 content_or_path=path_or_buffer, storage_options=storage_options
1654 )
/usr/lib/python3/dist-packages/pandas/io/excel/_base.py in inspect_excel_format(content_or_path, storage_options)
1523 content_or_path = BytesIO(content_or_path)
1524
-> 1525 with get_handle(
1526 content_or_path, "rb", storage_options=storage_options, is_text=False
1527 ) as handle:
/usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
863 else:
864 # Binary mode
--> 865 handle = open(handle, ioargs.mode)
866 handles.append(handle)
867
FileNotFoundError: [Errno 2] No such file or directory: 'titanic.xlsx'
In [8]: titanic.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-cdd760b19866> in <module>
----> 1 titanic.head()
NameError: name 'titanic' is not defined
I’m interested in a technical summary of a
DataFrame
In [9]: titanic.info() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-9-d013d6c494b6> in <module> ----> 1 titanic.info() NameError: name 'titanic' is not defined
The method
info()
provides technical information about aDataFrame
, so let’s explain the output in more detail:It is indeed a
DataFrame
.There are 891 entries, i.e. 891 rows.
Each row has a row label (aka the
index
) with values ranging from 0 to 890.The table has 12 columns. Most columns have a value for each of the rows (all 891 values are
non-null
). Some columns do have missing values and less than 891non-null
values.The columns
Name
,Sex
,Cabin
andEmbarked
consists of textual data (strings, akaobject
). The other columns are numerical data with some of them whole numbers (akainteger
) and others are real numbers (akafloat
).The kind of data (characters, integers,…) in the different columns are summarized by listing the
dtypes
.The approximate amount of RAM used to hold the DataFrame is provided as well.
REMEMBER
Getting data in to pandas from many different file formats or data sources is supported by
read_*
functions.Exporting data out of pandas is provided by different
to_*
methods.The
head
/tail
/info
methods and thedtypes
attribute are convenient for a first check.
For a complete overview of the input and output possibilities from and to pandas, see the user guide section about reader and writer functions.