How can I parse a text file in Python and filter data based on conditions?

mehta_tvara · December 19, 2024, 6:30pm

How can I parse a text file in Python** and perform the following tasks? I have a text file (.txt) that looks like this:

Date, Day, Sect, 1, 2, 3
1, Sun, 1-1, 123, 345, 678
2, Mon, 2-2, 234, 585, 282
3, Tue, 2-2, 231, 232, 686

Here are my goals:

Read the text file by line as a separate element in a list and split each line by commas, while removing unnecessary newline characters (\n).
Set the first row (Date, Day, Sect, 1, 2, 3) as the dictionary keys and the subsequent rows as their corresponding values in a dictionary. However, the code I have currently:

file = open('abc.txt', mode='r', encoding='utf-8-sig')
lines = file.readlines()
file.close()
my_dict = {}
my_list = []
for line in lines:
    line = line.split(',')
    line = [i.strip() for i in line]

Has two issues:

The first row should also be set as a dictionary.
When I add the dictionary to the list with my_list.append(my_dict), only the last row is saved, not all the rows.

Create a list that includes the dictionary as elements.
After creating the list of dictionaries, I want to subset the elements based on specific conditions. For example, I want to select the elements where the Sect is 2-2. The expected result would look like this:

[{'Date': '2', 'Day': 'Mon', 'Sect': '2-2', '1': '234', '2': '585', '3': '282'},
 {'Date': '3', 'Day': 'Tue', 'Sect': '2-2', '1': '231', '2': '232', '3': '686'}]

Can someone help me achieve this using parse text file Python?

Ambikayache · December 19, 2024, 6:41pm

I’ve been working with text data for a while, and one of the most straightforward ways to parse a file is using the csv module. It’s built into Python, lightweight, and perfect for handling structured text files like CSVs.

Here’s how you can do it:

import csv

my_list = []
with open('abc.txt', mode='r', encoding='utf-8-sig') as file:
    reader = csv.reader(file)
    headers = next(reader)  # Extract the first row as headers
    for row in reader:
        my_dict = dict(zip(headers, row))  # Combine headers with row values
        my_list.append(my_dict)

# Filter rows where Sect is '2-2'
filtered_list = [d for d in my_list if d['Sect'] == '2-2']
print(filtered_list)

How it works: The csv.reader() reads the file line by line. The first row is treated as headers, and each subsequent row is combined with those headers using zip(). The result is a list of dictionaries that you can easily filter. Simple and effective!

devan-skeem · December 19, 2024, 6:42pm

Building on what Ambika mentioned, if you want full control over parsing without relying on the csv module, you can manually handle the text file line by line. It gives you more flexibility when the file isn’t in a standard CSV format.

Here’s how you can do it:

my_list = []
with open('abc.txt', mode='r', encoding='utf-8-sig') as file:
    lines = file.readlines()
    headers = [header.strip() for header in lines[0].split(',')]  # Extract headers
    for line in lines[1:]:
        values = [value.strip() for value in line.split(',')]
        my_dict = dict(zip(headers, values))  # Combine headers with values
        my_list.append(my_dict)

# Filter rows where Sect is '2-2'
filtered_list = [d for d in my_list if d['Sect'] == '2-2']
print(filtered_list)

How it works: The first line is split to extract headers, and each subsequent line is processed similarly. This approach is useful if your text file has irregular delimiters or spacing that requires extra cleanup. It’s a bit more manual, but the results match perfectly!

ishrth_fathima · January 2, 2025, 11:53am

Let me take this a step further. If you’re dealing with large datasets or need advanced operations, the pandas library is your best friend. It’s fast, efficient, and handles CSV-like data effortlessly.

Here’s how you can do it:

import pandas as pd

# Read the CSV data into a pandas DataFrame
df = pd.read_csv('abc.txt', encoding='utf-8-sig')

# Convert the DataFrame to a list of dictionaries
my_list = df.to_dict(orient='records')

# Filter rows where Sect is '2-2'
filtered_list = [d for d in my_list if d['Sect'] == '2-2']
print(filtered_list)

How it works: pandas.read_csv() reads the file into a DataFrame, which is a highly optimized structure for handling tabular data. Converting it to a list of dictionaries using to_dict() gives you the same flexibility as earlier approaches, but with the added power of pandas for more complex operations. It’s my go-to for parsing and filtering!