What Python module can convert PDF to text?

anusha_gg · December 30, 2024, 6:30pm

Is there a Python pdf to text module for converting PDF files into plain text? I tried a script using pypdf (found on ActiveState), but the output text lacked spaces, making it unreadable.

Are there better alternatives for accurately extracting text from PDFs in Python?

shilpa.chandel · December 30, 2024, 6:30pm

PyMuPDF is a Python binding for MuPDF, a lightweight PDF, XPS, and EPUB viewer. It provides a highly accurate and reliable way to extract text from PDFs. Example:

import fitz  # PyMuPDF

def pdf_to_text(pdf_file):
    doc = fitz.open(pdf_file)
    text = ''
    for page in doc:
        text += page.get_text()
    return text

print(pdf_to_text('sample.pdf'))

Advantages:

High accuracy and retains the layout.
Supports extracting text, images, and metadata.

Rashmihasija · December 30, 2024, 6:31pm

pdfplumber allows you to extract both text and tables from PDFs. It provides a more structured and clean approach to extracting data, especially if the document contains tabular data.

Example:

import pdfplumber

def pdf_to_text(pdf_file):
    with pdfplumber.open(pdf_file) as pdf:
        text = ''
        for page in pdf.pages:
            text += page.extract_text()
    return text

print(pdf_to_text('sample.pdf'))

Advantages:

Can handle complex layouts and tables better.
Provides high-quality text extraction.

netra.agarwal · December 30, 2024, 6:34pm

PyPDF2 is another popular library that can extract text from PDFs. While not as robust as PyMuPDF or pdfplumber, it can work well for simpler PDFs.

Example:

import PyPDF2

def pdf_to_text(pdf_file):
    with open(pdf_file, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''
        for page in reader.pages:
            text += page.extract_text()
    return text

print(pdf_to_text('sample.pdf'))

Advantages:

Good for simple, text-based PDFs.
Easy to use and widely adopted.