How to Python Read XLSX File and Parse Data?
I want to read a .xlsx file using the Pandas library in Python and later port the data to a PostgreSQL table.
Here’s what I have so far:
import pandas as pd
data = pd.ExcelFile("*File Name*")
I know the step is successful, but I want to understand how to parse the Excel file and map its data to the data
variable. I believe data
is a DataFrame object. How can I parse this DataFrame and extract each row one by one?
If you’re new to this, one straightforward way to work with your data is by iterating row by row using iterrows()
. Here’s how it looks:
import pandas as pd
# Read the Excel file into a DataFrame
data = pd.read_excel("*File Name*")
# Iterate over rows using iterrows()
for index, row in data.iterrows():
print(row) # Access each row as a pandas Series
This method gives you each row as a Series
, which is handy when you’re working with smaller datasets. However, keep in mind that iterrows()
can be slower for larger files.
Building on Tim’s solution, if performance becomes a concern (especially for larger datasets), I recommend using itertuples()
. It’s a more efficient way to iterate through rows.
import pandas as pd
# Read the Excel file into a DataFrame
data = pd.read_excel("*File Name*")
# Iterate over rows using itertuples() (faster)
for row in data.itertuples(index=False): # index=False prevents including index in the row
print(row) # Access each row as a named tuple
Unlike iterrows()
, this gives you a named tuple for each row, making it both faster and cleaner to work with. If you’re dealing with larger datasets in your python read xlsx
operations, this approach is ideal.
Taking Emma’s point further, another option for processing data is to convert the DataFrame into a list of lists. This approach simplifies accessing rows for certain use cases:
import pandas as pd
# Read the Excel file into a DataFrame
data = pd.read_excel("*File Name*")
# Convert the DataFrame to a list of lists (each row is a list)
data_list = data.values.tolist()
# Print each row
for row in data_list:
print(row) # Each row is a list of values
Using values.tolist()
, you effectively remove the overhead of Pandas operations when iterating. This is great if your goal is to feed this data directly into another system, like your PostgreSQL table, during a python read xlsx
workflow.