In [1]: import pandas as pd
- Air quality data
For this tutorial, air quality data about \(NO_2\) is used, made available by OpenAQ and using the py-openaq package. The
To raw dataair_quality_no2.csv
data set provides \(NO_2\) values for the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp and London.In [2]: air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True) --------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-2-7f0fa9a6f21e> in <module> ----> 1 air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True) /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 209 else: 210 kwargs[new_arg_name] = new_arg_value --> 211 return func(*args, **kwargs) 212 213 return cast(F, wrapper) /usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(*args, **kwargs) 332 333 # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options) 948 kwds.update(kwds_defaults) 949 --> 950 return _read(filepath_or_buffer, kwds) 951 952 /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds) 603 604 # Create the parser. --> 605 parser = TextFileReader(filepath_or_buffer, **kwds) 606 607 if chunksize or iterator: /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds) 1440 1441 self.handles: IOHandles | None = None -> 1442 self._engine = self._make_engine(f, self.engine) 1443 1444 def close(self) -> None: /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine) 1733 if "b" not in mode: 1734 mode += "b" -> 1735 self.handles = get_handle( 1736 f, 1737 mode, /usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 854 if ioargs.encoding and "b" not in ioargs.mode: 855 # Encoding --> 856 handle = open( 857 handle, 858 ioargs.mode, FileNotFoundError: [Errno 2] No such file or directory: 'data/air_quality_no2.csv' In [3]: air_quality.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-3-7c0df1c960a9> in <module> ----> 1 air_quality.head() NameError: name 'air_quality' is not defined
How to create new columns derived from existing columns?¶
I want to express the \(NO_2\) concentration of the station in London in mg/m\(^3\).
(If we assume temperature of 25 degrees Celsius and pressure of 1013 hPa, the conversion factor is 1.882)
In [4]: air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882 --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-4-def8668f6cad> in <module> ----> 1 air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882 NameError: name 'air_quality' is not defined In [5]: air_quality.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-5-7c0df1c960a9> in <module> ----> 1 air_quality.head() NameError: name 'air_quality' is not defined
To create a new column, use the
[]
brackets with the new column name at the left side of the assignment.
Note
The calculation of the values is done element-wise. This means all values in the given column are multiplied by the value 1.882 at once. You do not need to use a loop to iterate each of the rows!
I want to check the ratio of the values in Paris versus Antwerp and save the result in a new column.
In [6]: air_quality["ratio_paris_antwerp"] = ( ...: air_quality["station_paris"] / air_quality["station_antwerp"] ...: ) ...: --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-6-25c5c9755b62> in <module> 1 air_quality["ratio_paris_antwerp"] = ( ----> 2 air_quality["station_paris"] / air_quality["station_antwerp"] 3 ) NameError: name 'air_quality' is not defined In [7]: air_quality.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-7-7c0df1c960a9> in <module> ----> 1 air_quality.head() NameError: name 'air_quality' is not defined
The calculation is again element-wise, so the
/
is applied for the values in each row.
Also other mathematical operators (+
, -
, *
, /
,…) or
logical operators (<
, >
, ==
,…) work element-wise. The latter was already
used in the subset data tutorial to filter
rows of a table using a conditional expression.
If you need more advanced logic, you can use arbitrary Python code via apply()
.
I want to rename the data columns to the corresponding station identifiers used by OpenAQ.
In [8]: air_quality_renamed = air_quality.rename( ...: columns={ ...: "station_antwerp": "BETR801", ...: "station_paris": "FR04014", ...: "station_london": "London Westminster", ...: } ...: ) ...: --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-8-88a9d99efa10> in <module> ----> 1 air_quality_renamed = air_quality.rename( 2 columns={ 3 "station_antwerp": "BETR801", 4 "station_paris": "FR04014", 5 "station_london": "London Westminster", NameError: name 'air_quality' is not defined
In [9]: air_quality_renamed.head() --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-9-5784e526f0b0> in <module> ----> 1 air_quality_renamed.head() NameError: name 'air_quality_renamed' is not defined
The
rename()
function can be used for both row labels and column labels. Provide a dictionary with the keys the current names and the values the new names to update the corresponding names.
The mapping should not be restricted to fixed names only, but can be a mapping function as well. For example, converting the column names to lowercase letters can be done using a function as well:
In [10]: air_quality_renamed = air_quality_renamed.rename(columns=str.lower)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-0549fdd671c6> in <module>
----> 1 air_quality_renamed = air_quality_renamed.rename(columns=str.lower)
NameError: name 'air_quality_renamed' is not defined
In [11]: air_quality_renamed.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-5784e526f0b0> in <module>
----> 1 air_quality_renamed.head()
NameError: name 'air_quality_renamed' is not defined
Details about column or row label renaming is provided in the user guide section on renaming labels.
REMEMBER
Create a new column by assigning the output to the DataFrame with a new column name in between the
[]
.Operations are element-wise, no need to loop over rows.
Use
rename
with a dictionary or function to rename row labels or column names.
The user guide contains a separate section on column addition and deletion.