- pd.merge() may change the names of original columns:
import pandas as pd df1 = pd.DataFrame(data={"name": ["robin", "hood"], "age": [40, 30]}) df2 = pd.DataFrame(data={"name": ["lion", "heart"], "age": [50, 60]}) merged = pd.merge(df1, df2, how="outer", on="name") print(merged)
The output will not have a column named age but two more new columns named age_x and age_y. So when you merging two tables with many columns, be aware of that the column names may change.
2. Use iterrows() to traverse rows of dataframe:
import pandas as pd from multiprocessing import Pool def process(row): # Do something for row print(row[1]) df = pd.DataFrame(data={"name": ["robin", "hood"], "age": [40, 30]}) pool = Pool(6) pool.map(process, df.iterrows())
If we directly use pool.map(process, df), it will incorrectly traverse the column names of dataframe.
3. How to append pd.Series to a pd.DataFrame. From this article, the easist way is:
import pandas as pd df = pd.DataFrame(data={"name": ["robin", "hood"], "age": [40, 30]}) series = pd.Series(["water", 50], index=["name", "age"]) print(df.append(series, ignore_index=True))
The result is
name age
0 robin 40
1 hood 30
2 water 50
Or, we can add a name to pd.Series and remove the ignore_index. It could give the same result.
If the pd.Series doesn’t have index, the result will become:
name age 0 1
0 robin 40.0 NaN NaN
1 hood 30.0 NaN NaN
2 NaN NaN water 50.0