What Python module can convert PDF to text?

Is there a Python pdf to text module for converting PDF files into plain text? I tried a script using pypdf (found on ActiveState), but the output text lacked spaces, making it unreadable.

Are there better alternatives for accurately extracting text from PDFs in Python?

PyMuPDF is a Python binding for MuPDF, a lightweight PDF, XPS, and EPUB viewer. It provides a highly accurate and reliable way to extract text from PDFs. Example:

import fitz  # PyMuPDF

def pdf_to_text(pdf_file):
    doc = fitz.open(pdf_file)
    text = ''
    for page in doc:
        text += page.get_text()
    return text

print(pdf_to_text('sample.pdf'))

Advantages:

  • High accuracy and retains the layout.
  • Supports extracting text, images, and metadata.

pdfplumber allows you to extract both text and tables from PDFs. It provides a more structured and clean approach to extracting data, especially if the document contains tabular data.

Example:

import pdfplumber

def pdf_to_text(pdf_file):
    with pdfplumber.open(pdf_file) as pdf:
        text = ''
        for page in pdf.pages:
            text += page.extract_text()
    return text

print(pdf_to_text('sample.pdf'))

Advantages:

  • Can handle complex layouts and tables better.
  • Provides high-quality text extraction.

PyPDF2 is another popular library that can extract text from PDFs. While not as robust as PyMuPDF or pdfplumber, it can work well for simpler PDFs.

Example:

import PyPDF2

def pdf_to_text(pdf_file):
    with open(pdf_file, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''
        for page in reader.pages:
            text += page.extract_text()
    return text

print(pdf_to_text('sample.pdf'))

Advantages:

  • Good for simple, text-based PDFs.
  • Easy to use and widely adopted.