Previously I just use this snippet to get all the column names of a parquet file:
import pandas as pd df = pd.read_parquet("hello.parquet") print(list(df.columns))
But if the parquet file is very large (maybe not very large, for example, 1GB), it will cause OOM in my small VM (about 4GB RAM).
Actually, what I want is just column names, not the whole data. Since parquet file has strongly designed format, there must be someway we can only get the schema instead of all data.
And, here it is:
import pyarrow.parquet as pq schema = pq.read_schema("hello.parquet", memory_map=True) print(list(schema.names))