Scientific Computing

OCR PDF with Tesseract

To use Tesseract-OCR on PDF convert PDF to TIFF. For single page PDF and multipage PDF:

magick -density 300 in.pdf -depth 1 -strip -background white -alpha off out.tiff

This binary (black or white only) TIFF file is about 1 MB / page. Consider doing groups of pages for large/complicated PDFs. Pages are 0-indexed, so to do say pages 4-7 of the PDF:

magick -density 300 in.pdf[3-6] -depth 1 -strip -background white -alpha off out.tiff

While at least 300 DPI is recommended, sometimes increasing resolution can make Tesseract performance worsen, particularly for poor quality text. In such cases, it may be better to work on filtering/processing the input imagery more before inputting into Tesseract.

Run OCR: Tesseract can also output PDF or other formats. Be aware that not all documentation/tips on the web address the machine learning models present in Tesseract 4.x.

tesseract out.tiff out

Tesseract processing can be controlled in numerous ways.

  • improving tesseract input

Fix ImageMagick 6 not authorized reading PDF

ImageMagick uses policy.xml to set read/write permissions by file format. When read permissions are disabled for a format such as PDF, ImageMagick operations might fail like:

convert-im6.q16: not authorized

convert-im6.q16: DistributedPixelCache ‘127.0.0.1’

Fix: find policy.xml location at the top of

magick -list policy

for example on Linux it might be at /etc/ImageMagick-*/policy.xml

Edit this policy.xml to have a line like:

<policy domain="coder" rights="read" pattern="PDF" />

Markdown relative links in Readme / docs

Markdown as a de facto documentation syntax has many variants. The relative linking syntax seems to be widely supported by sites including GitHub and GitLab among others. The syntax is simply like:

[TODO list](./TODO.md)

then even when cloned, forked, renamed, etc. the relative links will continue to work.

Spyder / Jupyter plots in separate window

IPython console in Spyder IDE by default opens non-interactive Matplotlib plots in the same inline “notebook”. Fix this by creating separate windows for interactive figures in Spyder:

Tools → Preferences → Ipython Console → Graphics → Graphics Backend → Backend: “automatic”

Interactive figures are more useful in general to probe the figure data and zoom/pan the figure, unlike the static PNGs in the inline notebook.

Jupyter notebooks can also have interactive plots. Instead of static inline notebook plots with

%matplotlib inline

for inline interactive plots in Jupyter:

%matplotlib notebook

Example:

%matplotlib notebook

from matplotlib.pyplot import figure

ax = figure().gca()
ax.plot(range(5))

References:

Select compiler versions update-alternatives

Switch between compilers e.g. g++-7 and g++-8 with simple commands. Note: We suggest NOT using sudo, but rather to make the links under ~/.local/bin, which should already be in your PATH (or start using it as in step 1).

(one-time) Setup shell to use ~/.local/bin instead of system-wide /usr. This is generally beneficial in any case.

mkdir ~/.local/bin

Add to ~/.profile:

export PATH="$HOME/.local/bin:$PATH"

(one-time setup) enable switching

update-alternatives --install $HOME/.local/bin/g++ g++ /usr/bin/g++-7 20
update-alternatives --install $HOME/.local/bin/g++ g++ /usr/bin/g++-6 10

and so on for gcc and gfortran

At any time, switch compiler versions:

update-alternatives --config g++
update-alternatives --config gcc
update-alternatives --config gfortran

update-alternatives works with virtually any program including Python.

Compiler version priority order: last number of update-alternatives --install is priority. The highest priority number is used in “automatic” update-alternatives mode.

Troubleshooting: If accidentally reversed the order of the link and target or if used sudo in /usr/bin then may need to reinstall the compiler.

expanduser home directory tilde Fortran

Like Python, Fortran does not understand tilde ~ in commands like open(). Just like in Python and some Matlab functions, an expanduser() procedure is needed. We provide this functionality in fortran-filesystem expanduser().

The shell typically expands ~ itself. It becomes an issue when reading an absolute path involving ~ from say a config file using Fortran and then trying to open that filename read from the config file.

Find Linux package dependencies

Numerous Python packages use PyTest extensively. Upon apt install a package that depends on python-astropy, python-pytest is also installed. The system version of a package is typically several minor versions behind for stable Linux distros like Ubuntu. To find which specific package is responsible for a package being installed, use a command like:

apt-cache rdepends --installed python-pytest

which reveals that

python-py
python-astropy

are requiring this old pytest version.

Disable Gnome Keyring SSH Agent

Ubuntu Gnome Agent remembers SSH private key passwords until you log out. If someone knows an Ubuntu user password, they also have access to any SSH private keys loaded since last logon.

This also fixes error upon trying to use ssh or sshfs:

sign_and_send_pubkey: signing failed: agent refused operation

Permanently disable Gnome Keyring SSH Agent by including this line in /etc/xdg/autostart/gnome-keyring-ssh.desktop

X-GNOME-Autostart-enabled=false

Reboot and test that private key passwords aren’t being remembered.


Alternative method to disable Gnome Keyring SSH Agent: Edit /etc/xdg/autostart/gnome-keyring-ssh.desktop to include the line:

NoDisplay=false

Under Startup Applications → SSH Key Agent (uncheck). Reboot and test that private key passwords aren’t being remembered.


Related: configure SSH agent to remember SSH keys

Installing GPSTk in Anaconda Python

GPSTk Python examples require installing GPSTk for Python. GPSTk is a complicated program that is more difficult to install than typical Python programs.

Install

Unstall minimal prereqs

apt install g++ make cmake swig

Download gpstk source

git clone --depth 1 https://github.com/SGL-UT/GPSTk
cd GPSTk

Build & install

./build.sh -ue

Test

Get the example code

git clone https://github.com/scivision/gpstk-examples-python

GPSTk Example 1