How to parse text files with multiple delimiters in Python?

anusha_gg · January 9, 2025, 6:30pm

I am trying to parse a series of text files and save them as CSV files using Python (2.7.3). All text files have a 4-line header that needs to be stripped out. The data lines have various delimiters, including " (quote), - (dash), : (colon), and blank spaces. I found it difficult to handle these different delimiters in C++, so I decided to try Python as it seemed easier.

I wrote some code to test parsing a single line of data, and it works fine. However, I couldn’t make it work for the entire file. I was using the replace method on a text string for a single line, but my current implementation reads the text file as a list, and the replace method is not available for list objects.

I am new to Python and got stuck here. Can anyone help me resolve this?

Thanks!

Code for Parsing:

# function for parsing the data
def data_parser(text, dic):
    for i, j in dic.iteritems():
        text = text.replace(i, j)
    return text

# open input/output files
inputfile = open('test.dat')
outputfile = open('test.csv', 'w')

my_text = inputfile.readlines()[4:]  # reads the whole text file, skipping the first 4 lines

# sample text string, just for demonstration to show how the data looks
# my_text = '"2012-06-23 03:09:13.23",4323584,-1.911224,-0.4657288,-0.1166382,-0.24823,0.256485,"NAN",-0.3489428,-0.130449,-0.2440527,-0.2942413,0.04944348,0.4337797,-1.105218,-1.201882,-0.5962594,-0.586636'

# dictionary definition to handle the date block delimited by dashes and ensure negative numbers are not affected
reps = {'"NAN"': 'NAN', '"': '', '0-': '0,', '1-': '1,', '2-': '2,', '3-': '3,', '4-': '4,', '5-': '5,', '6-': '6,', '7-': '7,', '8-': '8,', '9-': '9,', ' ': ',', ':': ','}

txt = data_parser(my_text, reps)
outputfile.writelines(txt)

inputfile.close()
outputfile.close()

shilpa.chandel · January 9, 2025, 6:32pm

Alright, I’ve worked with Python file parsing a lot, and a straightforward way to handle multiple delimiters is by iterating through the lines of the text file. You can use a simple for loop like this:

for line in my_text:
    outputfile.writelines(data_parser(line, reps))

This works well if you’re reading line-by-line without loading the entire file into memory. Here’s how you can do it:

inputfile = open('test.dat')
outputfile = open('test.csv', 'w')

# Dictionary for parsing delimiters
reps = {'"NAN"': 'NAN', '"': '', '0-': '0,', '1-': '1,', '2-': '2,', '3-': '3,', '4-': '4,', '5-': '5,', '6-': '6,', '7-': '7,', '8-': '8,', '9-': '9,', ' ': ',', ':': ','}

# Skip the first four lines
for i in range(4): 
    next(inputfile) 

# Parse each line
for line in inputfile:
    outputfile.writelines(data_parser(line, reps))

inputfile.close()
outputfile.close()

This way, you can manage large files without worrying about memory usage. Pretty handy, right?

raimavaswani · January 12, 2025, 8:35am

Nice start, @anusha_gg! If you’re working specifically with CSV files, I’ve found Python’s built-in csv module to be super useful for parsing. You can read and write line-by-line just like before, but using csv.reader and csv.writer will make it cleaner and handle delimiters more smoothly.

Here’s how you can use it:

import csv

inputfile = open('test.dat')
outputfile = open('test.csv', 'w')

reader = csv.reader(inputfile)
writer = csv.writer(outputfile)

# Skip the first four lines
for i in range(4): 
    next(reader) 

# Parse and write the rows
for row in reader:
    parsed_row = [data_parser(cell, reps) for cell in row]
    writer.writerow(parsed_row)

inputfile.close()
outputfile.close()

The csv module makes things easier when working with structured data. Just remember that csv.reader handles delimiters for you, so it’s great for standard CSV formats, but you’ll still want your custom data_parser to handle those tricky cases!

Rashmihasija · January 13, 2025, 8:36am

Great addition, @raimavaswani! If you’re aiming for cleaner code, you might want to look into using Python’s with statement for automatic file handling. It ensures your files are always properly closed, even if something goes wrong during the process. Here’s an enhanced version of what we’ve discussed:

with open('test.dat') as inputfile, open('test.csv', 'w') as outputfile:
    # Skip the first four lines
    for i in range(4): 
        next(inputfile) 

    # Parse the remaining lines
    for line in inputfile:
        outputfile.writelines(data_parser(line, reps))

This simplifies things because you don’t need to manually close the files at the end. Plus, it keeps the code neat and tidy while handling the python file parsing job with ease!