Get the schema of a parquet file

Previously I just use this snippet to get all the column names of a parquet file:

import pandas as pd
df = pd.read_parquet("hello.parquet")
print(list(df.columns))

But if the parquet file is very large (maybe not very large, for example, 1GB), it will cause OOM in my small VM (about 4GB RAM).

Actually, what I want is just column names, not the whole data. Since parquet file has strongly designed format, there must be someway we can only get the schema instead of all data.

And, here it is:

import pyarrow.parquet as pq
schema = pq.read_schema("hello.parquet", memory_map=True)
print(list(schema.names))

Robin on Linux

Get the schema of a parquet file

Leave a Reply Cancel reply

Robin on Linux

Related Posts

Leave a Reply Cancel reply