Get the number of rows for a parquet file

We were using Pandas to get the number of rows for a parquet file:

import pandas as pd
df = pd.read_parquet("my.parquet")
print(df.shape[0])

This is easy but will cost a lot of time and memory when the parquet file is very large. For example, it may cost more than 100GB of memory to just read a 10GB parquet file.

If we only need to get the number of rows, not the whole data, Pyarrow will be a better solution:

import pyarrow.parquet as pq
table = pq.read_table("my.parquet", columns=[])
print(table.num_rows)

This method only spend a couple seconds and cost about 2GB of memory for the same parquet file.

Robin on Linux

Get the number of rows for a parquet file

Leave a Reply to r Cancel reply

Robin on Linux

Related Posts

Leave a Reply to r Cancel reply