When using Pandas in production environments for highly-performant tabular data manipulations, we want to get things right. In particular, we don't want to get screwed by silent and potentially deadly mechanisms like implicit type conversion.

A practical example¶

To illustrate the potential issues and suggested solutions, we'll walk through an example where we load and manipulate a simple dataframe with three columns and a few rows. The meaning of the data here is not relevant, we'll just focus on what happens on the data's type as we perform some operations on it.

First, let's import the needed packages and print some information about the versions we are using:

In [1]:

import numpy as np
import pandas as pd

import io

from platform import python_version

print('Python version:', python_version())
print('Pandas version:', pd.__version__)
print('NumPy version:', np.__version__)

Python version: 3.7.8
Pandas version: 1.1.1
NumPy version: 1.18.5

and let's define a helper function that we'll use to print a per-column summary of memory usage and type:

In [2]:

def get_memory_and_type_summary(df: pd.DataFrame) -> pd.DataFrame:
    return pd.DataFrame({
        'memory_usage_bytes': df.memory_usage(deep=True, index=False),
        'dtype': df.dtypes
    })

All this walkthrough is going to build on a simple idea: be explicit with your dataframe's schema:

In [3]:

SCHEMA = {
    'a': np.float32,
    'b': np.int32,
    'c': "string"
}

pd.read_csv has our back here, since it allows to specify a dtype argument with a column-to-dtype mapping, a.k.a. the schema:

In [4]:

csv_str = """
1.0,1,1
2.0,2,2
3.0,3,3
"""

df = pd.read_csv(io.StringIO(csv_str), names=list('abc'), dtype=SCHEMA)
df

Out[4]:

	a	b	c
0	1.0	1	1
1	2.0	2	2
2	3.0	3	3

As expected, the resulting columns adhere to the schema:

In [5]:

df.dtypes

Out[5]:

a    float32
b      int32
c     string
dtype: object

The pd.DataFrame constructor doesn't allow to pass a schema, just a single dtype. In this case you would have to perform 2 steps:

instantiate the dataframe with df = pd.DataFrame(...),
cast to the desired schema using df.astype(schema, errors='raise')

In [6]:

data = [
    [1.0, 1, "1"],
    [2.0, 2, "2"],
    [3.0, 3, "3"]
]
another_df = pd.DataFrame(data=data, columns=list('abc'))
get_memory_and_type_summary(another_df)

Out[6]:

	memory_usage_bytes	dtype
a	24	float64
b	24	int64
c	198	object

Note that Pandas here has done a fairly good job by inferring the types. The issue is that, the numerical types have higher precision than needed (see SCHEMA above) and the text column (c) is not explicitly string. So let's fix that:

In [7]:

another_df = another_df.astype(SCHEMA, errors='raise')
get_memory_and_type_summary(another_df)

Out[7]:

	memory_usage_bytes	dtype
a	12	float32
b	12	int32
c	198	string

Note: as in read_csv, BigQuery's to_dataframe also supports schema definition via its dtype argument. See the docs.

So far we have just loaded data, let's do some operations on it!

Interlude: srsly, what's the big deal?¶

If you still don't see the full utility of controlling the types of your dataframe, maybe Ian Ozsvald's presentation at EuroPython'20 will convince you.

TL;DR by carefully selecting your dtypes you can potentially improve your memory usage by several orders of magnitude.

Appending rows¶

You guessed right: adding new data to the dataframe can (and will) mess up your schema if not done carefully.

In order to add/append rows to an existing dataframe, Pandas' docs suggests using append. Let's assume that we receive a new_record from elsewhere in our system, and we have to update the table:

In [8]:

new_record = [4.0, None, 4]

As the documentations suggests, we'll wrap the new object using a pd.Series with the appropriate indices (mapping to the destination columns):

In [9]:

df_1 = df.append(pd.Series(new_record, index=list('abc')), ignore_index=True)
df_1

Out[9]:

	a	b	c
0	1.0	1.0	1
1	2.0	2.0	2
2	3.0	3.0	3
3	4.0	NaN	4

It works, but we just busted our schema:

In [10]:

get_memory_and_type_summary(df_1)

Out[10]:

	memory_usage_bytes	dtype
a	32	float64
b	32	float64
c	230	object

Our b column (originally int32) received a None value. This was cast as np.nan by Pandas, making the column become float64. Dtypes for a and c have also changed (float32 to float64, and string to object respectively). What a mess!

A solution: enforce the schema on the new records¶

How? A good way would be to build an auxiliary dataframe with the new entries, and apply the schema on it before appending.

In [11]:

df_to_append = pd.DataFrame(data=[new_record,], columns=list('abc'))
df_to_append

Out[11]:

	a	b	c
0	4.0	None	4

In [12]:

df_to_append.dtypes

Out[12]:

a    float64
b     object
c      int64
dtype: object

In [13]:

try:
    df_to_append = df_to_append.astype(df_schema, errors='raise')
except Exception:
    print("Type conversion failed.")

Type conversion failed.

Ooops! Our new entry contains an null b value. And neither None or np.nan can be converted to int32. So this raises an exception.

And this is a good thing, because at this stage either:

the incoming field shouldn't be null, and we just applied validation
the schema is actually wrong: the b column should be nullable.

In the former case, we're done. The record is not valid, we notify the system by raising an error for instance, and that's all.

In the latter, it's our schema what we should modify, and make our b column nullable. But np.nan is indeed a float, so it seems like a dead end.

Pandas' Nullable Integer Type¶

There is a built-in nullable integer type support in Pandas through so-called extension types. Here, we'll use pd.Int32Dtype:

In [14]:

CORRECT_SCHEMA = {
    'a': np.float32, 
    'b': pd.Int32Dtype(), 
    'c': 'string'
}

In [15]:

df = pd.read_csv(
    io.StringIO(csv_str), 
    names=list('abc'), 
    dtype=CORRECT_SCHEMA
)
df

Out[15]:

	a	b	c
0	1.0	1	1
1	2.0	2	2
2	3.0	3	3

In [16]:

dtypes_before = df.dtypes
dtypes_before

Out[16]:

a    float32
b      Int32
c     string
dtype: object

Note the capital I in Int32.

Now, let's finish what we started:

Process the new entry, as a new dataframe with an enforced schema:

In [17]:

df_to_append = pd.DataFrame(data=[new_record,], columns=list('abc'))
df_to_append = df_to_append.astype(CORRECT_SCHEMA, errors='raise')

Append the dataframes:

In [18]:

df = df.append(df_to_append, ignore_index=True)
df

Out[18]:

	a	b	c
0	1.0	1	1
1	2.0	2	2
2	3.0	3	3
3	4.0	<NA>	4

Note that the missing value now is not a np.nan but the new pd.NA.

In [19]:

df.dtypes == dtypes_before

Out[19]:

a    True
b    True
c    True
dtype: bool

As we see, the schema after the append operation is still correct!

You could nicely wrap it as follows:

In [20]:

from typing import Sequence, Any, Mapping

def append_records_to_df(df: pd.DataFrame, records: Sequence[Any], schema: Mapping[str, Any]) -> pd.DataFrame:
    records_df = pd.DataFrame(records, columns=df.columns).astype(schema, errors='raise')
    return df.append(records_df, ignore_index=True)

Appending new columns¶

The rationale here is similar: we want to be explicit on the schema declaration. This case is a bit simpler, since each column must have one type. Using our previous dataframe example:

In [21]:

df

Out[21]:

	a	b	c
0	1.0	1	1
1	2.0	2	2
2	3.0	3	3
3	4.0	<NA>	4

It would suffice to parse the new data as a pd.Series with an predefined dtype:

In [22]:

df['new_column_1'] = pd.Series([1, 2, 3, 4], dtype=np.float16)
df

Out[22]:

	a	b	c	new_column_1
0	1.0	1	1	1.0
1	2.0	2	2	2.0
2	3.0	3	3	3.0
3	4.0	<NA>	4	4.0

In [23]:

df.dtypes

Out[23]:

a               float32
b                 Int32
c                string
new_column_1    float16
dtype: object

For constant value assignments, I find it safer to use a construct as the following:

In [24]:

df['constant'] = pd.Series([True, ] * len(df), dtype=np.bool)
df

Out[24]:

	a	b	c	new_column_1	constant
0	1.0	1	1	1.0	True
1	2.0	2	2	2.0	True
2	3.0	3	3	3.0	True
3	4.0	<NA>	4	4.0	True

In [25]:

df.dtypes

Out[25]:

a               float32
b                 Int32
c                string
new_column_1    float16
constant           bool
dtype: object

Conclusion¶

Know your dataframe schema (column dtypes) beforehand. Among other things, this will help you verify the integrity of your dataframe.
Many operations, from initial loading to appending can mess up your dataframe's schema.
Use pd.DataFrame.append to add new rows. Parse yor new rows/entries as dataframes and enforce the target schema before appending.
Use pd.Series with a predefined dtype for new columns (or aggregate several of these into a new dataframe if needed)
Remember Pandas now supports nullable types beyond float, e.g. nullable integers.

You can download the notebook used in this post here.