When using Pandas in production environments for highly-performant tabular data manipulations, we want to get things right. In particular, we don't want to get screwed by silent and potentially deadly mechanisms like implicit type conversion.
A practical example¶
To illustrate the potential issues and suggested solutions, we'll walk through an example where we load and manipulate a simple dataframe with three columns and a few rows. The meaning of the data here is not relevant, we'll just focus on what happens on the data's type as we perform some operations on it.
First, let's import the needed packages and print some information about the versions we are using:
import numpy as np
import pandas as pd
import io
from platform import python_version
print('Python version:', python_version())
print('Pandas version:', pd.__version__)
print('NumPy version:', np.__version__)
and let's define a helper function that we'll use to print a per-column summary of memory usage and type:
def get_memory_and_type_summary(df: pd.DataFrame) -> pd.DataFrame:
return pd.DataFrame({
'memory_usage_bytes': df.memory_usage(deep=True, index=False),
'dtype': df.dtypes
})
All this walkthrough is going to build on a simple idea: be explicit with your dataframe's schema:
SCHEMA = {
'a': np.float32,
'b': np.int32,
'c': "string"
}
pd.read_csv has our back here, since it allows to specify a dtype argument with a column-to-dtype mapping, a.k.a. the schema:
csv_str = """
1.0,1,1
2.0,2,2
3.0,3,3
"""
df = pd.read_csv(io.StringIO(csv_str), names=list('abc'), dtype=SCHEMA)
df
As expected, the resulting columns adhere to the schema:
df.dtypes
The pd.DataFrame constructor doesn't allow to pass a schema, just a single dtype. In this case you would have to perform 2 steps:
- instantiate the dataframe with
df = pd.DataFrame(...), - cast to the desired schema using
df.astype(schema, errors='raise')
data = [
[1.0, 1, "1"],
[2.0, 2, "2"],
[3.0, 3, "3"]
]
another_df = pd.DataFrame(data=data, columns=list('abc'))
get_memory_and_type_summary(another_df)
Note that Pandas here has done a fairly good job by inferring the types. The issue is that, the numerical types have higher precision than needed (see SCHEMA above) and the text column (c) is not explicitly string. So let's fix that:
another_df = another_df.astype(SCHEMA, errors='raise')
get_memory_and_type_summary(another_df)
Note: as in
read_csv, BigQuery'sto_dataframealso supports schema definition via itsdtypeargument. See the docs.
So far we have just loaded data, let's do some operations on it!
Interlude: srsly, what's the big deal?¶
If you still don't see the full utility of controlling the types of your dataframe, maybe Ian Ozsvald's presentation at EuroPython'20 will convince you.
TL;DR by carefully selecting your dtypes you can potentially improve your memory usage by several orders of magnitude.
Appending rows¶
You guessed right: adding new data to the dataframe can (and will) mess up your schema if not done carefully.
In order to add/append rows to an existing dataframe, Pandas' docs suggests using append. Let's assume that we receive a new_record from elsewhere in our system, and we have to update the table:
new_record = [4.0, None, 4]
As the documentations suggests, we'll wrap the new object using a pd.Series with the appropriate indices (mapping to the destination columns):
df_1 = df.append(pd.Series(new_record, index=list('abc')), ignore_index=True)
df_1
It works, but we just busted our schema:
get_memory_and_type_summary(df_1)
Our b column (originally int32) received a None value. This was cast as np.nan by Pandas, making the column become float64. Dtypes for a and c have also changed (float32 to float64, and string to object respectively). What a mess!
A solution: enforce the schema on the new records¶
How? A good way would be to build an auxiliary dataframe with the new entries, and apply the schema on it before appending.
df_to_append = pd.DataFrame(data=[new_record,], columns=list('abc'))
df_to_append
df_to_append.dtypes
try:
df_to_append = df_to_append.astype(df_schema, errors='raise')
except Exception:
print("Type conversion failed.")
Ooops! Our new entry contains an null b value. And neither None or np.nan can be converted to int32. So this raises an exception.
And this is a good thing, because at this stage either:
- the incoming field shouldn't be null, and we just applied validation
- the schema is actually wrong: the
bcolumn should be nullable.
In the former case, we're done. The record is not valid, we notify the system by raising an error for instance, and that's all.
In the latter, it's our schema what we should modify, and make our b column nullable. But np.nan is indeed a float, so it seems like a dead end.
Pandas' Nullable Integer Type¶
There is a built-in nullable integer type support in Pandas through so-called extension types. Here, we'll use pd.Int32Dtype:
CORRECT_SCHEMA = {
'a': np.float32,
'b': pd.Int32Dtype(),
'c': 'string'
}
df = pd.read_csv(
io.StringIO(csv_str),
names=list('abc'),
dtype=CORRECT_SCHEMA
)
df
dtypes_before = df.dtypes
dtypes_before
Note the capital I in Int32.
Now, let's finish what we started:
- Process the new entry, as a new dataframe with an enforced schema:
df_to_append = pd.DataFrame(data=[new_record,], columns=list('abc'))
df_to_append = df_to_append.astype(CORRECT_SCHEMA, errors='raise')
- Append the dataframes:
df = df.append(df_to_append, ignore_index=True)
df
Note that the missing value now is not a np.nan but the new pd.NA.
df.dtypes == dtypes_before
As we see, the schema after the append operation is still correct!
You could nicely wrap it as follows:
from typing import Sequence, Any, Mapping
def append_records_to_df(df: pd.DataFrame, records: Sequence[Any], schema: Mapping[str, Any]) -> pd.DataFrame:
records_df = pd.DataFrame(records, columns=df.columns).astype(schema, errors='raise')
return df.append(records_df, ignore_index=True)
Appending new columns¶
The rationale here is similar: we want to be explicit on the schema declaration. This case is a bit simpler, since each column must have one type. Using our previous dataframe example:
df
It would suffice to parse the new data as a pd.Series with an predefined dtype:
df['new_column_1'] = pd.Series([1, 2, 3, 4], dtype=np.float16)
df
df.dtypes
For constant value assignments, I find it safer to use a construct as the following:
df['constant'] = pd.Series([True, ] * len(df), dtype=np.bool)
df
df.dtypes
Conclusion¶
- Know your dataframe schema (column dtypes) beforehand. Among other things, this will help you verify the integrity of your dataframe.
- Many operations, from initial loading to appending can mess up your dataframe's schema.
- Use
pd.DataFrame.appendto add new rows. Parse yor new rows/entries as dataframes and enforce the target schema before appending. - Use
pd.Serieswith a predefineddtypefor new columns (or aggregate several of these into a new dataframe if needed) - Remember Pandas now supports nullable types beyond
float, e.g. nullable integers.
You can download the notebook used in this post here.