In [1]: import pandas as pd
Data used for this tutorial:
  • This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:

    • PassengerId: Id of every passenger.

    • Survived: Indication whether passenger survived. 0 for yes and 1 for no.

    • Pclass: One out of the 3 ticket classes: Class 1, Class 2 and Class 3.

    • Name: Name of passenger.

    • Sex: Gender of passenger.

    • Age: Age of passenger in years.

    • SibSp: Number of siblings or spouses aboard.

    • Parch: Number of parents or children aboard.

    • Ticket: Ticket number of passenger.

    • Fare: Indicating the fare.

    • Cabin: Cabin number of passenger.

    • Embarked: Port of embarkation.

    To raw data

How do I read and write tabular data?

../../_images/02_io_readwrite.svg
  • I want to analyze the Titanic passenger data, available as a CSV file.

    In [2]: titanic = pd.read_csv("data/titanic.csv")
    ---------------------------------------------------------------------------
    FileNotFoundError                         Traceback (most recent call last)
    <ipython-input-2-4f6129a55af7> in <module>
    ----> 1 titanic = pd.read_csv("data/titanic.csv")
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        209                 else:
        210                     kwargs[new_arg_name] = new_arg_value
    --> 211             return func(*args, **kwargs)
        212 
        213         return cast(F, wrapper)
    
    /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
        329                     stacklevel=find_stack_level(),
        330                 )
    --> 331             return func(*args, **kwargs)
        332 
        333         # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
        948     kwds.update(kwds_defaults)
        949 
    --> 950     return _read(filepath_or_buffer, kwds)
        951 
        952 
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
        603 
        604     # Create the parser.
    --> 605     parser = TextFileReader(filepath_or_buffer, **kwds)
        606 
        607     if chunksize or iterator:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
       1440 
       1441         self.handles: IOHandles | None = None
    -> 1442         self._engine = self._make_engine(f, self.engine)
       1443 
       1444     def close(self) -> None:
    
    /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine)
       1733                 if "b" not in mode:
       1734                     mode += "b"
    -> 1735             self.handles = get_handle(
       1736                 f,
       1737                 mode,
    
    /usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
        854         if ioargs.encoding and "b" not in ioargs.mode:
        855             # Encoding
    --> 856             handle = open(
        857                 handle,
        858                 ioargs.mode,
    
    FileNotFoundError: [Errno 2] No such file or directory: 'data/titanic.csv'
    

    pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix read_*.

Make sure to always have a check on the data after reading in the data. When displaying a DataFrame, the first and last 5 rows will be shown by default:

In [3]: titanic
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-5be31ed7efa5> in <module>
----> 1 titanic

NameError: name 'titanic' is not defined
  • I want to see the first 8 rows of a pandas DataFrame.

    In [4]: titanic.head(8)
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-4-5547fc3161fd> in <module>
    ----> 1 titanic.head(8)
    
    NameError: name 'titanic' is not defined
    

    To see the first N rows of a DataFrame, use the head() method with the required number of rows (in this case 8) as argument.

Note

Interested in the last N rows instead? pandas also provides a tail() method. For example, titanic.tail(10) will return the last 10 rows of the DataFrame.

A check on how pandas interpreted each of the column data types can be done by requesting the pandas dtypes attribute:

In [5]: titanic.dtypes
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-2ca0d3b62096> in <module>
----> 1 titanic.dtypes

NameError: name 'titanic' is not defined

For each of the columns, the used data type is enlisted. The data types in this DataFrame are integers (int64), floats (float64) and strings (object).

Note

When asking for the dtypes, no brackets are used! dtypes is an attribute of a DataFrame and Series. Attributes of DataFrame or Series do not need brackets. Attributes represent a characteristic of a DataFrame/Series, whereas a method (which requires brackets) do something with the DataFrame/Series as introduced in the first tutorial.

  • My colleague requested the Titanic data as a spreadsheet.

    In [6]: titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-6-ef3e997cc15f> in <module>
    ----> 1 titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)
    
    NameError: name 'titanic' is not defined
    

    Whereas read_* functions are used to read data to pandas, the to_* methods are used to store data. The to_excel() method stores the data as an excel file. In the example here, the sheet_name is named passengers instead of the default Sheet1. By setting index=False the row index labels are not saved in the spreadsheet.

The equivalent read function read_excel() will reload the data to a DataFrame:

In [7]: titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-7-081526be361d> in <module>
----> 1 titanic = pd.read_excel("titanic.xlsx", sheet_name="passengers")

/usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    209                 else:
    210                     kwargs[new_arg_name] = new_arg_value
--> 211             return func(*args, **kwargs)
    212 
    213         return cast(F, wrapper)

/usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    329                     stacklevel=find_stack_level(),
    330                 )
--> 331             return func(*args, **kwargs)
    332 
    333         # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no

/usr/lib/python3/dist-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    480     if not isinstance(io, ExcelFile):
    481         should_close = True
--> 482         io = ExcelFile(io, storage_options=storage_options, engine=engine)
    483     elif engine and engine != io.engine:
    484         raise ValueError(

/usr/lib/python3/dist-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
   1650                 ext = "xls"
   1651             else:
-> 1652                 ext = inspect_excel_format(
   1653                     content_or_path=path_or_buffer, storage_options=storage_options
   1654                 )

/usr/lib/python3/dist-packages/pandas/io/excel/_base.py in inspect_excel_format(content_or_path, storage_options)
   1523         content_or_path = BytesIO(content_or_path)
   1524 
-> 1525     with get_handle(
   1526         content_or_path, "rb", storage_options=storage_options, is_text=False
   1527     ) as handle:

/usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    863         else:
    864             # Binary mode
--> 865             handle = open(handle, ioargs.mode)
    866         handles.append(handle)
    867 

FileNotFoundError: [Errno 2] No such file or directory: 'titanic.xlsx'
In [8]: titanic.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-cdd760b19866> in <module>
----> 1 titanic.head()

NameError: name 'titanic' is not defined
  • I’m interested in a technical summary of a DataFrame

    In [9]: titanic.info()
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-9-d013d6c494b6> in <module>
    ----> 1 titanic.info()
    
    NameError: name 'titanic' is not defined
    

    The method info() provides technical information about a DataFrame, so let’s explain the output in more detail:

    • It is indeed a DataFrame.

    • There are 891 entries, i.e. 891 rows.

    • Each row has a row label (aka the index) with values ranging from 0 to 890.

    • The table has 12 columns. Most columns have a value for each of the rows (all 891 values are non-null). Some columns do have missing values and less than 891 non-null values.

    • The columns Name, Sex, Cabin and Embarked consists of textual data (strings, aka object). The other columns are numerical data with some of them whole numbers (aka integer) and others are real numbers (aka float).

    • The kind of data (characters, integers,…) in the different columns are summarized by listing the dtypes.

    • The approximate amount of RAM used to hold the DataFrame is provided as well.

REMEMBER

  • Getting data in to pandas from many different file formats or data sources is supported by read_* functions.

  • Exporting data out of pandas is provided by different to_*methods.

  • The head/tail/info methods and the dtypes attribute are convenient for a first check.

To user guide

For a complete overview of the input and output possibilities from and to pandas, see the user guide section about reader and writer functions.