To save memory for my program using Pandas, I change types of some column from string to category as the reference.
df[["os_type", "cpu_type", "chip_brand"]] = df[["os_type", "cpu_type", "chip_brand"]].astype("category")
It could save at least half memory in my case. But when I use pyarrow to store the dataframe to parquet
df.to_parquet("my.parquet")
it reports errors:
Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647
It’s a bug from old version pyarrow and had been fixed in Sep 2019. Then I tried to upgrade my pyarrow-0.12.1 to pyarrow-0.17.1 and it fixed this error.
But the story hasn’t ended up here.
For pyarrow-0.12.1, the below snippet will return a class of type <pyarrow.lib.Column
>
import pyarrow.parquet as pq table = pq.read_table(path) table.column(0)
and this class will also contain a attribute “Column name”
But for pyarrow-0.17.1, the same code will return a class of type <pyarrow.lib.ChunkedArray>
which doesn’t have a “Column name”.
This difference will make some code fail (actually, our program). Beware of this: after you upgrade pyarrow (or any other library in Python), run the test to make sure all the legacy code work properly.