An old bug about PyArrow

To save memory for my program using Pandas, I change types of some column from string to category as the reference.

df[["os_type", "cpu_type", "chip_brand"]] =
	df[["os_type", "cpu_type", "chip_brand"]].astype("category")

It could save at least half memory in my case. But when I use pyarrow to store the dataframe to parquet

df.to_parquet("my.parquet")

it reports errors:

Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647

It’s a bug from old version pyarrow and had been fixed in Sep 2019. Then I tried to upgrade my pyarrow-0.12.1 to pyarrow-0.17.1 and it fixed this error.

But the story hasn’t ended up here.

For pyarrow-0.12.1, the below snippet will return a class of type <pyarrow.lib.Column>

import pyarrow.parquet as pq
table = pq.read_table(path)
table.column(0)

and this class will also contain a attribute “Column name”

But for pyarrow-0.17.1, the same code will return a class of type <pyarrow.lib.ChunkedArray> which doesn’t have a “Column name”.

This difference will make some code fail (actually, our program). Beware of this: after you upgrade pyarrow (or any other library in Python), run the test to make sure all the legacy code work properly.

Robin on Linux

An old bug about PyArrow

Leave a Reply Cancel reply

Robin on Linux

Related Posts

Leave a Reply Cancel reply