Working with Parquet Files in Pandas

Apache Parquet has become one of the defacto standards in modern data architecture. This open source, columnar data format serves as the backbone of many high-powered analytics and machine learning pipelines, supported by many of the worlds most sophisticated platforms and services. AWS, Azure, and Google Cloud all offer built-in support for Parquet while big data tools like Hadoop, Spark, Hive, and Databricks natively support Parquet, allowing seamless data processing and analytics. Parquet is also foundational in data lakehouse formats like Delta Lake, Iceberg, and Hudi, where its features are further enhanced.

Parquet is efficient and has broad industry support. In this post, I will showcase a few simple techniques to demonstrate working with Parquet and leveraging its special features using Pandas.

Python Dependencies

If you plan to follow along with the demos, only a couple of libraries are required. The versions I am using are:

  • pandas v2.2.2
  • pyarrow v16.1.0

When calling Parquet-specific methods from Pandas, it is necessary to have either pyarrow or fastparquet libraries installed, as Pandas relies on these libraries for handling Parquet file formats. Apache Arrow, integrated through pyarrow, provides a columnar memory format for efficient data processing and interoperability.

With setup out of the way, let’s get started.

Reading and Writing Parquet Files

Reading and writing Parquet files is managed through a pair of Pandas methods: pandas.DataFrame.to_parquet and pandas.read_parquet. To start, we will establish sample data and create a Pandas dataframe.

import pandas as pd

# Create sample data frame
data = {
    'product': ['a', 'b', 'c', 'd'],
    'price': [10, 20, 30, 40]
}
df = pd.DataFrame(data)
df

Once we have our dataframe, it is effortless to write a Parquet file by using to_parquet() method.

df.to_parquet(path='../data-dest/sample.parquet', engine='pyarrow')

Reading a Parquet file back into a dataframe is also trivial.

df = pd.read_parquet('../data-dest/sample.parquet')

Parquet In-Memory

It is possible to work with Parquet data entirely in-memory. When omitting the filename in to_parquet() the data defaults to bytes. To read the dataframe, first pass the buffer output to io.BytesIO() as shown in the following example.

import pandas as pd
import io

# Create sample data frame
data = {
    'product': ['a', 'b', 'c', 'd'],
    'price': [10, 20, 30, 40]
}
df = pd.DataFrame(data)

buffer = df.to_parquet()

buffer_df = pd.read_parquet(io.BytesIO(buffer))

Unique features

Parquet is a powerful data format for big data analytics and data science. It has several features that set it far apart from legacy formats like CSV. A few of Parquet’s unique features will be covered in the remainder of this article.

  • Columnar Storage – Parquet files store values in columnar format. This layout facilitates improved query and IO performance for analytics workloads. Column projection allows for fewer data pages collected from storage which is typically the slowest component in our systems.
  • Row Group Metadata – Parquet files include metadata for each row group, such as min and max values for each column. This metadata allows for efficient querying and filtering of data. When you run a query, the Parquet reader can quickly skip over row groups that do not match the query criteria, significantly reducing the amount of data read and processed.
  • Predicate Pushdown – Filters can be applied early in the query execution process by analyzing metadata, minimizing the amount of data read from storage.
  • Compression Efficiency: Due to the columnar format, Parquet achieves higher compression ratios when compared to row-based storage counterparts. This is because it can use encoding schemes that take advantage of repeated patterns within each column; compression algorithms can exploit this redundancy within the same data type.

Slightly Bigger Data

We will need a more complex data set to illustrate the true power of the Parquet format. I’ve pulled weather data from The National Weather Service to utilize for the next several examples. This data set contains a weeks worth of ~hourly temperature readings from the Orlando International Airport.

The dataframe begins in the following form:

Data Cleanup

We will run data cleanup and transformations to make the data more usable for the examples.

# Convert datetime datatype
df['date'] = pd.to_datetime(df['date'])


# Separate columns for Celsius and Fahrenheit temperatures
df.rename(columns={'temperature': 'c_temperature'}, inplace=True)
df['f_temperature'] = (df['c_temperature'] * 9 / 5) + 32


# Split out a new column for station_id
df['station_id'] = df['station_url'].str.rsplit('/', n=1, expand=True)[1]


# Create date parts for hive partitioning
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day

As we can infer, it has not been pleasant in the South.

df[['f_temperature']].max()

# f_temperature 97.808

The dataframe with new columns:

Column Projection

Filtering out unnecessary columns early in the process can greatly enhance query performance. The example below shows the subtle difference in creating an efficient dataframe with Parquet.

The columns argument will pass the column selection down to pyarrow and return a subset of columns into memory.

file_path = '../data-dest/weather_proj.parquet'

df.to_parquet(
    file_path, 
    engine='pyarrow', 
    index=False
)

# Pandas - read entire file, then filter columns
df1 = pd.read_parquet(file_path, engine='pyarrow')[['date', 'f_temperature']]

# PyArrow - read only necessary columns
df1 = pd.read_parquet(file_path, engine='pyarrow', columns=['date', 'f_temperature'])

Row Groups

To properly show off Parquet row groups, the dataframe should be sorted by our f_temperature field. After, the Parquet file will be written with row_group_size=100, which will write 8 row groups.

When reading back this file, the filters argument will pass the predicate down to pyarrow and apply the filter based on row group statistics.

file_path = '../data-dest/weather_rg.parquet'

df2 = df.sort_values(by=["f_temperature"], ascending=False)

df.to_parquet(
    file_path, 
    engine='pyarrow', 
    index=False, 
    row_group_size=100, 
)

temperature_filter = 90

# Pandas - read entire file, then filter condition
df2 = pd.read_parquet(file_path, engine='pyarrow').query("f_temperature > " + str(temperature_filter))

# PyArrow - read only data rows that match condition
df2 = pd.read_parquet(file_path, engine='pyarrow', filters=[("f_temperature", ">", temperature_filter)])

Now let’s examine the innards of the Parquet file to better understand row group structure. To read the metadata, we will leverage PyArrow directly in the following example:

import pyarrow.parquet as pq

parquet_file = pq.ParquetFile("../data-dest/weather_rg.parquet")

# Return column index of the target column
col = df2.columns.get_loc('f_temperature')

# Loop through row groups and display metadata
for i in range(8):
    min_value = parquet_file.metadata.row_group(i).column(col).statistics.min
    max_value = parquet_file.metadata.row_group(i).column(col).statistics.max
    print(f"row group: {i}, min: {min_value}, max: {max_value}")
row group: 0, min: 90.392, max: 97.808
row group: 1, min: 87.008, max: 90.392
row group: 2, min: 82.508, max: 87.008
row group: 3, min: 78.908, max: 82.508
row group: 4, min: 77.594, max: 78.908
row group: 5, min: 76.406, max: 77.594
row group: 6, min: 75.794, max: 76.406
row group: 7, min: 74.300, max: 75.794

From the row group metadata, we can see how our filter of > 90 will only require reading from row group 0 and row group 1 and skip the rest.

Partitions

Parquet can take advantage of Hive partitioning and this can provide the best performance gains when implemented correctly.

import datetime as dt

file_path = '../data-dest/weather_part'

df.to_parquet(
    file_path, 
    engine='pyarrow', 
    index=False,
    partition_cols=["year", "month", "day"], 
    existing_data_behavior='delete_matching'
)

The integer columns year, month, and day that we derived from the original date column, are being used in the partition_cols argument to split the Parquet files. Once we execute that command, the directory structure represents the following:

data-dest\weather_part\
└── year=2024\
    ├── month=6\
    │   ├── day=27\
    │   │   └── ad2704d90d864255b8754dba2c07d516-0.parquet
    │   ├── day=28\
    │   │   └── ad2704d90d864255b8754dba2c07d516-0.parquet
    │   ├── day=29\
    │   │   └── ad2704d90d864255b8754dba2c07d516-0.parquet
    │   └── day=30\
    │       └── ad2704d90d864255b8754dba2c07d516-0.parquet
    └── month=7\
        ├── day=1\
        │   └── ad2704d90d864255b8754dba2c07d516-0.parquet
        ├── day=2\
        │   └── ad2704d90d864255b8754dba2c07d516-0.parquet
        ├── day=3\
        │   └── ad2704d90d864255b8754dba2c07d516-0.parquet
        └── day=4\
            └── ad2704d90d864255b8754dba2c07d516-0.parquet

To take advantage of predicate push down, we will again use the filters argument, this time passing year, month, and day integers for the date we are interested in.

# Define the filter date
filter_date = dt.datetime(2024, 7, 1)

# PyArrow - read only files that meet the filter criteria
df3 = pd.read_parquet(file_path, engine='pyarrow', filters=[
    ("year", "=", filter_date.year),
    ("month", "=", filter_date.month),
    ("day", "=", filter_date.day)
])

An alternative method for filtering partitions is by adjusting the file path. For example, to only load July files:

file_path = '../data-dest/weather_part/year=2024/month=7'
df4 = pd.read_parquet(file_path)

min_date = df4['date'].min()
max_date = df4['date'].max()

print(f"Start: {min_date.year}-{min_date.month}-{min_date.day}")
print(f"End: {max_date.year}-{max_date.month}-{max_date.day}")
Start: 2024-7-1
End: 2024-7-4

Compression

Parquet’s design offers significant advantages in terms of data compression, which is crucial for optimizing storage and improving query performance. The columnar storage format of Parquet allows for more efficient compression than row-based formats like CSV.

In this example, the compression='snappy' argument specifies that the Snappy compression algorithm should be used. Snappy is a fast and efficient compression algorithm, suitable for a variety of data types.

file_path = '../data-dest/weather_tiny.snappy.parquet'

df.to_parquet(
    file_path, 
    engine='pyarrow', 
    index=False,
    compression='snappy'
)

Conclusion

Parquet’s efficiency and advanced features make it an essential tool in the modern data engineer’s toolkit. Through the examples provided, we have explored how to leverage Parquet’s capabilities using Pandas and PyArrow for reading, writing, and optimizing data handling.

For more in-depth exploration, visit the references below.

Happy data processing!

References