Recent learned tips abou Numpy and Pandas

Precision

After running this snippet:

It print out:

Why np.float32 and np.float64 have the same output? The answer is: displaying of numpy array need to set options.
Let’s set option before print:

The result has became:

which looks much reasonable.
Furthermore, why it prints out ‘0.1122334455667789’ which has only ’16’ precision instead of ’18’? Because the float64 only support about 15~16 precisions, as this reference said.

Hidden metadata

There are two parquet files which look different after using ‘cksum’ to compare. But after we export them as CSV files:

The two output CSV files are exactly the same.
Then what happened in those previous two parquet files? Dose parquet file have some hidden metadata in it?
As a matter of fact, parquet file will save the ‘index’ of a DataFrame of Pandas while CSV file will not. If we drop the index before writing out the parquet file:

These two parquet files would become identical.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.