My previous code was trying to read all data and get only one column that I need:
import pandas as pd df = pd.read_csv("data.csv")["card_id"]
In the test environment, this program cost more than 10GB memory because of the large size of the data file.
To reduce the memory, I changed to use usecols
:
import pandas as pd df = pd.read_csv("data.csv", usecols=["card_id"])
Then, the program only cost less than 1GB memory.
The only problem is: only read_csv()
and read_sql()
support reading special columns. In read_parquet()
, we still need to read all data at first.