Understanding Pandas Merge: Joins & Best Practices

How does the pandas merge function work for different types of joins?

How can I perform an INNER, LEFT, RIGHT, or FULL OUTER join using pandas merge?

  • How do I handle missing values (adding or removing NaNs) after merging?
  • Can I merge DataFrames based on the index?
  • What’s the best way to merge multiple DataFrames?
  • How do I perform a cross join in pandas?
  • When should I use merge, join, concat, or update?

Many questions arise about pandas merge, but the information is often scattered. This guide aims to consolidate key insights and best practices for merging DataFrames efficiently.

The merge() function is incredibly powerful, but knowing when to use INNER, LEFT, RIGHT, or FULL OUTER joins can make all the difference.

Here’s how you can perform different types of joins:

import pandas as pd  

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Value1': ['A', 'B', 'C']})  
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Value2': ['X', 'Y', 'Z']})  

# INNER JOIN
inner = pd.merge(df1, df2, on="ID", how="inner")

# LEFT JOIN
left = pd.merge(df1, df2, on="ID", how="left")

# RIGHT JOIN
right = pd.merge(df1, df2, on="ID", how="right")

# FULL OUTER JOIN
full_outer = pd.merge(df1, df2, on="ID", how="outer")

:point_right: Use pandas merge() with how="inner", "left", "right", or "outer" to control the join behavior.

After merging, you might notice NaN values appearing due to mismatches in keys. Handling them effectively is key to keeping your data clean.

Here’s how you can manage missing values:

# Fill NaN values with a default string
merged_df.fillna("Unknown", inplace=True)

# Drop rows with missing values
merged_df.dropna(inplace=True)

:point_right: Use .fillna() to replace NaN values or .dropna() to remove incomplete rows after a pandas merge.

Sometimes, you don’t have a common column to merge on, but you still need to combine DataFrames. You can merge using index-based joins like this:

df1.set_index("ID", inplace=True)  
df2.set_index("ID", inplace=True)  

merged_index = df1.merge(df2, left_index=True, right_index=True, how="outer")

:point_right: Use left_index=True, right_index=True in pandas merge() to join DataFrames based on index.

Need a cross join instead? Pandas doesn’t have a direct cross join function, but you can simulate it by adding a dummy key column:

df1["key"] = 1  
df2["key"] = 1  
cross_join = pd.merge(df1, df2, on="key").drop("key", axis=1)

:point_right: Cross joins can be simulated by adding a dummy key column before using pandas merge().