I’m dealing with email headers (like message-id) via the milter protocol in Python 3, and I need to save this data without altering it, even if it’s in an unknown encoding. Using error handlers like ignore or replace makes the headers RFC-compliant, but that reduces the effectiveness of antispam scoring, which depends on the raw input.
In Python 2, my milter application handled this just fine. But in Python 3, when I try to write the raw string to a file, I run into UnicodeEncodeError if the input contains bytes outside UTF-8 (e.g., from ISO8859-2).
Is there a way in Python 3 to write raw data with unknown encoding to disk without raising an exception or altering the content?
Any advice would be appreciated!
I’ve worked quite a bit with mail filters and antispam pipelines, and dealing with unknown encoding is a headache I’ve hit many times. If you’re handling raw email headers and want to preserve every byte exactly as received, the safest way is to skip encodings entirely and write the data in binary mode.
Instead of treating your header as a string, keep it as a bytes object and write it like this:
with open("headers.raw", "ab") as f:
f.write(header_bytes)
This bypasses encoding issues completely, so it won’t complain about bytes outside UTF-8. The key is making sure your data hasn’t been decoded prematurely. Writing raw bytes like this is perfect if you’re working with antispam systems where integrity matters more than readability when handling unknown encoding.
I’ve run into similar problems doing forensic debugging of mail systems over the years, especially when dealing with unknown encoding in email streams. Mark’s binary-write approach is spot-on for preserving raw data, but sometimes I also need it stored in a format that’s safe and human-readable.
What worked well for me was storing the raw data base64-encoded. It’s a bit of a detour, but it guarantees that you can write the data out safely without hitting encoding errors, and it’s easy to decode later:
import base64
with open("raw_headers.txt", "a") as f:
f.write(base64.b64encode(header_bytes).decode('ascii') + '\n')
This way, even if the original headers have unknown encoding, you’re storing the exact bytes without corruption. Plus, it’s readable enough for logs or archives.
I’ve dealt a lot with cross-gateway email systems where unknown encoding can trip things up, especially in Python 3. I love both Mark’s and Charity’s solutions but there’s also a neat trick for when your data’s already a decoded string, and you still want to write it out without crashes due to unexpected characters.
The surrogateescape
error handler can save the day. It lets you preserve original byte values by mapping them into special Unicode code points, so you can round-trip back to bytes later if needed. Here’s how I’d handle it:
with open("output.txt", "w", encoding="utf-8", errors="surrogateescape") as f:
f.write(header_string)
This way, even with unknown encoding in your input, you avoid UnicodeEncodeError
and don’t lose any data. I’ve found this approach super reliable when dealing with unpredictable encodings in mail flows.