We were using Pandas to get the number of rows for a parquet file:
import pandas as pd df = pd.read_parquet("my.parquet") print(df.shape)
This is easy but will cost a lot of time and memory when the parquet file is very large. For example, it may cost more than 100GB of memory to just read a 10GB parquet file.
If we only need to get the number of rows, not the whole data, Pyarrow will be a better solution:
import pyarrow.parquet as pq table = pq.read_table("my.parquet", columns=) print(table.num_rows)
This method only spend a couple seconds and cost about 2GB of memory for the same parquet file.