OCR PDF with Tesseract

October 29, 2018

To use Tesseract-OCR on PDF convert PDF to TIFF. For single page PDF and multipage PDF:

magick -density 300 in.pdf -depth 1 -strip -background white -alpha off out.tiff

This binary (black or white only) TIFF file is about 1 MB / page. Consider doing groups of pages for large/complicated PDFs. Pages are 0-indexed, so to do say pages 4-7 of the PDF:

magick -density 300 in.pdf[3-6] -depth 1 -strip -background white -alpha off out.tiff

While at least 300 DPI is recommended, sometimes increasing resolution can make Tesseract performance worsen, particularly for poor quality text. In such cases, it may be better to work on filtering/processing the input imagery more before inputting into Tesseract.

Run OCR: Tesseract can also output PDF or other formats. Be aware that not all documentation/tips on the web address the machine learning models present in Tesseract 4.x.

tesseract out.tiff out

Tesseract processing can be controlled in numerous ways.

improving tesseract input