Say goodbye to copy+paste, Python implements PDF to text

created at 07-13-2021 views: 6

Introduction

For many people, converting PDF to editable text is just a need, but there is no easy way. In the project described in this article, Lucas Soares, a senior machine learning engineer from K1 Digital, tried to use OCR (Optical Character Recognition) to automatically transcribe pdf slides, and the transcription effect was not bad.

Traditional lectures are usually accompanied by a set of pdf slides. Generally speaking, if you want to take notes on such lectures, you need to copy and paste a lot of content from the pdf.

Recently, Lucas Soares, a senior machine learning engineer from K1 Digital, has been trying to automatically transcribe pdf slides by using OCR (Optical Character Recognition) to directly manipulate their content in markdown files, thereby avoiding manual copying and pasting of pdf content. Automation of this process.

Lucas Soares

Why not use the traditional pdf-to-text tool?

Lucas Soares found that traditional tools often bring more problems and need to take time to solve them. He once tried to use the traditional Python software package, but encountered many problems (for example, must use complex regular expression patterns to parse the final output, etc.), so he decided to try to use target detection and OCR to solve.

The basic process can be divided into the following steps:

  1. Convert pdf to picture;
  2. Detect and recognize text in images;
  3. Show sample output.

OCR based on deep learning to transcribe pdf to text

Convert pdf to image

The pdf slides used by Soares are from David Silver's enhanced learning (see the pdf slide address below). Use the "pdf2image" package to convert each slide to png image format.

powerpoint

Examples of pdf slides. Address: https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf

from pdf2image import convert_from_path
from pdf2image.exceptions import (
 PDFInfoNotInstalledError,
 PDFPageCountError,
 PDFSyntaxError
)

pdf_path = "path/to/file/intro_RL_Lecture1.pdf"
images = convert_from_path(pdf_path)
for i, image in enumerate(images):
    fname = "image" + str(i) + ".png"
    image.save(fname, "PNG")

After processing, all pdf slides are converted into png format images:

pdf converted into pnd

Detect and recognize text in images

In order to detect and recognize text in png images, Soares uses the text detector in the ocr.pytorch library. Follow the instructions to download the model and save the model in the checkpoints folder.

code show as below:

# adapted from this source: https://github.com/courao/ocr.pytorch
%load_ext autoreload
%autoreload 2
import os
from ocr import ocr
import time
import shutil
import numpy as np
import pathlib
from PIL import Image
from glob import glob
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import pytesseract

def single_pic_proc(image_file):
    image = np.array(Image.open(image_file).convert('RGB'))
    result, image_framed = ocr(image)
    return result,image_framed

image_files = glob('./input_images/*.*')
result_dir = './output_images_with_boxes/'

# If the output folder exists we will remove it and redo it.
if os.path.exists(result_dir):
    shutil.rmtree(result_dir)
os.mkdir(result_dir)

for image_file in sorted(image_files):
    result, image_framed = single_pic_proc(image_file) # detecting and recognizing the text
    filename = pathlib.Path(image_file).name
    output_file = os.path.join(result_dir, image_file.split('/')[-1])
    txt_file = os.path.join(result_dir, image_file.split('/')[-1].split('.')[0]+'.txt')
    txt_f = open(txt_file, 'w')
    Image.fromarray(image_framed).save(output_file)
    for key in result:
        txt_f.write(result[key][1]+'\n')
    txt_f.close()

Set the input and output folders, then traverse all input images (transformed pdf slides), then run the detection and recognition model in the OCR module through the single_pic_proc() function, and finally save the output to the output folder.

Among them, the detection inherits the Pytorch CTPN model, and the identification inherits the Pytorch CRNN model, both of which exist in the OCR module.

Sample output

code show as below:

import cv2 as cv

output_dir = pathlib.Path("./output_images_with_boxes")

# image = cv.imread(str(np.random.choice(list(output_dir.iterdir()),1)[0]))
image = cv.imread(f"{output_dir}/image7.png")
size_reshaped = (int(image.shape[1]),int(image.shape[0]))
image = cv.resize(image, size_reshaped)
cv.imshow("image", image)
cv.waitKey(0)
cv.destroyAllWindows()

The left of the picture below is the original pdf slideshow, and the right of the picture is the output text after transcription. The accuracy of the transcription is very high.

pdf and png

The text recognition output is as follows:

filename = f"{output_dir}/image7.txt"
with open(filename, "r") as text:
    for line in text.readlines():
        print(line.strip("\n"))

Through the above methods, you can finally get a very powerful tool to transcribe all kinds of documents, from detecting and recognizing handwritten notes to detecting and recognizing random text in photos. Having your own OCR tool to process some text content is much better than relying on external software to transcribe documents.

created at:07-13-2021
edited at: 07-13-2021: