How to properly handle UTF-8 encoding for special characters in Python?

Asheenaraghununan · December 19, 2024, 6:30pm

How can I ensure that Python encode utf-8 is properly used when dealing with strings containing special characters (like é or è) in my scripts?

I am creating a string in Python that I save in a file, which contains a lot of data, including directory names and filenames. I want to keep everything in UTF-8 since I will save it in MySQL later. My MySQL database is also set to UTF-8, but I am encountering issues with some characters (like é or è), even though the string in the file is displayed correctly.

Here’s my script:

#!/usr/bin/python
# -*- coding: utf-8 -*-
def createIndex():
    import codecs
    toUtf8 = codecs.getencoder('UTF8')
    # lot of operations & building indexSTR the string that matters
    findex = open('config/index/music_vibration_' + date + '.index', 'a')
    findex.write(codecs.BOM_UTF8)
    findex.write(toUtf8(indexSTR))  # This throws an error!

When I run this script, I encounter the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2171: ordinal not in range(128)

I noticed that after creating this file, I can read it and write it into MySQL, but I face issues with encoding. My MySQL database is set to utf8 (as confirmed by the SQL query SHOW variables LIKE 'char%', which only returns utf8 or binary).

Here’s the MySQL code I’m using:

#!/usr/bin/python
# -*- coding: utf-8 -*-
def saveIndex(index, date):
    import MySQLdb as mdb
    import codecs

    sql = mdb.connect('localhost', 'admin', '*******', 'music_vibration')
    sql.charset = "utf8"
    findex = open('config/index/' + index, 'r')
    lines = findex.readlines()
    for line in lines:
        if line.find('#artiste') != -1:
            artiste = line.split('[:::]')
            artiste = artiste[1].replace('\n', '')

            c = sql.cursor()
            c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom="' + artiste + '"')
            nbr = c.fetchone()
            if nbr[0] == 0:
                c = sql.cursor()
                iArt += 1
                c.execute('INSERT INTO artistes(nom, status, path) VALUES("' + artiste + '", 99, "' + artiste + '/")'.encode('utf8'))

Even though the artiste string is correctly displayed in the file, it is being written incorrectly into the MySQL database. What might be causing this issue?

richaaroy · December 19, 2024, 6:32pm

From my experience, a simple and effective way to handle UTF-8 encoding is to use open() with the encoding='utf-8' parameter when working with files. There’s no need to worry about manually encoding or decoding strings like we used to with codecs. Python does all the heavy lifting for you! Here’s an example:

#!/usr/bin/python
# -*- coding: utf-8 -*-
def createIndex():
    # Open the file directly with UTF-8 encoding
    with open('config/index/music_vibration_' + date + '.index', 'a', encoding='utf-8') as findex:
        findex.write(indexSTR)

This way, you’re letting Python automatically handle the python encode utf-8 process when writing to files. It cuts down on complexity and avoids encoding errors, which is always a win!

mark-mazay · December 27, 2024, 7:08am

That’s a solid approach, @richaaroy. One thing I’ve learned over the years is that if you’re working with a MySQL database, it’s just as crucial to ensure your MySQL connection is set to UTF-8. Otherwise, you’re likely to run into problems when storing or retrieving special characters. I set the charset explicitly when connecting to the database like this:

#!/usr/bin/python
# -*- coding: utf-8 -*-
def saveIndex(index, date):
    import MySQLdb as mdb

    # Establish MySQL connection and set the charset to utf8
    sql = mdb.connect('localhost', 'admin', '*******', 'music_vibration')
    sql.set_character_set('utf8')  # Ensuring UTF-8 encoding is used

    with open('config/index/' + index, 'r', encoding='utf-8') as findex:
        lines = findex.readlines()

    for line in lines:
        if line.find('#artiste') != -1:
            artiste = line.split('[:::]')
            artiste = artiste[1].replace('\n', '')
            c = sql.cursor()
            c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))
            nbr = c.fetchone()
            if nbr[0] == 0:
                c.execute('INSERT INTO artistes(nom, status, path) VALUES(%s, 99, %s)', (artiste, artiste + '/'))

Using the python encode utf-8 setting in both your file operations and database queries helps prevent encoding issues. And don’t forget to use parameterized queries to protect against SQL injection risks. This method is cleaner and ensures everything plays nice together.

emma-crepeau · December 30, 2024, 7:08am

Great points, @mark-mazay! Just to build on that: sometimes, when working with specific parts of the program, you might need to manually encode strings. In these cases, you can explicitly use encode('utf-8') before performing any database operations. It’s a helpful way to ensure that the data is properly encoded when passing through different layers of your application. Here’s an updated example:

#!/usr/bin/python
# -*- coding: utf-8 -*-
def saveIndex(index, date):
    import MySQLdb as mdb

    sql = mdb.connect('localhost', 'admin', '*******', 'music_vibration')
    sql.charset = "utf8"
    findex = open('config/index/' + index, 'r')
    lines = findex.readlines()

    for line in lines:
        if line.find('#artiste') != -1:
            artiste = line.split('[:::]')
            artiste = artiste[1].replace('\n', '')
            encoded_artiste = artiste.encode('utf-8')  # Explicitly encode the string in UTF-8

            c = sql.cursor()
            c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (encoded_artiste,))
            nbr = c.fetchone()
            if nbr[0] == 0:
                c.execute('INSERT INTO artistes(nom, status, path) VALUES(%s, 99, %s)', (encoded_artiste, encoded_artiste + '/'))

By explicitly calling python encode utf-8 in this manner, you ensure that any string manipulation is handled correctly, especially when working with databases or other systems that may not automatically handle encoding as smoothly as Python does.