Scientific Computing

Numpy can't read .zip files

ZIP files or GZ files and the like can be quick-and-dirty ways to compress individual data files for retrieval from remote sensors. In particular, the GeoRinex program has extensive capabilities for transparently (without extracting to uncompressed file) reading .zip, .z, .gz, etc. compressed text files, which benefit greatly from storage space savings. It was surprising to find that transparently processing similarly compressed binary data is not trivial, particularly with numpy.fromfile. Numpy has unresolved bugs with numpy.fromfile that preclude easy use with inline reading via zipfile.ZipFile or tarfile. Specifically, the .fileno attribute is not available from zipfile or tarfile, and numpy.fromfile() relies on .fileno among other attributes.

numpy.frombuffer is not generally suitable for this application either, because it does not advance the buffer position. We are not saying there’s no way around this situation, but we chose a more generally beneficial path.

Use HDF5

When raw data files need to be compressed and then later analyzed, we use HDF5. Even when the original program writing the raw binary data cannot be modified, a simple post-processing Python script with h5py reads the raw data and converts to lossless compressed HDF5 on the sensor. Then, when the data is analyzed out-of-core processing can be used, or at least the whole file doesn’t have to be read to retrieve data from an arbitrary location in the HDF5 file. This allows getting nearly all of the size and speed advantages of HDF5 without modifying the original program.

Delete empty zero-sized files

If faced with a large amount of arbitrarily named files that are empty (zero bytes) and it is desired to delete them, this can be easily done with GNU Findutils. macOS Homebrew findutils makes the command “gfind” in place of “find”.

Verify the file list to be deleted:

find ~/foo -type f -empty | sort

where ~/foo is the directory in which to delete the files and sort is used because in general the files are listed in random order. If satisfied, actually delete the empty files with:

find ~/foo -type f -empty -delete

Python using NaN or None as sentinel

Comparing to None instead of NaN is:

  • 4..50 times faster in CPython
  • more than 1000 times faster in PyPy3 with Numpy, same speed with math

Benchmarks using Intel Coffee Lake CPU with:

ipython

Python 3.7.3 IPython 7.7.0 Numpy 1.16.4

or PyPy3 with IPython

pypy3 -m IPython

Python 3.6.1 (PyPy3 7.1.1)

Numpy is well known to be slower at scalar operations than pure Python. But many data science and STEM application using arrays are vastly faster and more convenient with Numpy than pure Python methods.

from numpy import isnan
%timeit isnan(0.)
  • CPython: 428 ns ± 1.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Python NaN

from math import isnan
%timeit isnan(0.)
  • CPython: 45.7 ns ± 0.209 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  • PyPy3: 0.988 ns ± 0.00506 ns per loop (mean ± std. dev. of 7 runs, 1000000000 loops each)

Python None

%timeit 0. is not None
  • CPython: 17 ns ± 0.328 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
  • PyPy3: 0.987 ns ± 0.0041 ns per loop (mean ± std. dev. of 7 runs, 1000000000 loops each)

Numba

using python-performance

python NoneVsNan.py
--> Numba NaN sentinel: 1.00e-07
--> Numba None sentinel: 1.00e-07
--> CPython NaN sentinel: 2.00e-07
--> Numpy NaN sentinel: 6.00e-07
--> CPython None sentinel: 1.00e-07

Force upgrade Windows

Over the years and major Windows releases, we have many times had to force upgrade Windows. This is especially so on development machines that see a lot of programs installed in weird locations, external hard drives used, etc.

In general, the approach to force upgrade Windows version is:

  1. make an external backup of files–this could be to a cloud service like Google Drive or OneDrive as well as unpluggable storage like a USB drive. We usually don’t backup the entire PC, just manually drag over folders containing needed info, as it very well may be lost in this procedure.
  2. obtain a USB 3 flash drive and necessary adapters (e.g. USB-C to USB 3) for your PC. USB 2 flash drives will be painfully slow. At this time, 8 GB or larger is required.
  3. Download and run the Windows Media Creation Tool. Be sure the USB 3 drive is plugged in before running, and create a bootable flash drive using the tool.
  4. To help ensure you only have to do this once, and after ensuring you have backed up any data, consider the most powerful install option. That is “choose what to keep” → Nothing. That erase all files to help ensure there isn’t any bit of bad configuration left over. You don’t want to have to keep repeating the upgrade.

I didn’t include screenshots etc. as while the particulars change over the years, the process has been the same since nearly the Windows 9x or even DOS days. Generally the OS upgrades are a gamble that doesn’t always work, while hard reinstalls naturally virtually always work. This is the case for Linux including Ubuntu as well.

Convert animated GIF to PNG stack

Convert animated GIF to PNG stack using ImageMagick by:

magick in.gif out_%04.png

where 04 is governed by the number of images in the GIF–04 accommodates up to 10000 images.

GIFs are not a great format for science image data, because the palette is compressed to 8-bit (256 colors). For plotting reduced data, GIFs can be fine.

Fix Spyder IDE not visible

Spyder IDE is a complex but usually stable Python program. A problem symptom is Spyder not getting past the splash logo or not even showing the splash logo.

To totally reset Spyder (erasing all user preferences for Spyder), type in Terminal / Command Prompt:

spyder --reset

Normally, that fixes Spyder. To diagnose further, start Spyder from Terminal instead of OS Start menu, it might give some hints.

CUDA, cuDNN and NCCL for Anaconda Python

Access GPU CUDA, cuDNN and NCCL functionality are accessed in a Numpy-like way from CuPy. CuPy also allows use of the GPU in a more low-level fashion as well.

Before starting GPU work in any programming language realize these general caveats:

  • I/O heavy workloads may make realizing GPU benefits more difficult
  • Consumer GPUs (GeForce) can be > 10x slower than workstation class (Tesla, Quadro)

CUDA requires a discrete Nvidia GPU. Check for existence of an Nvidia GPU by:

  • Linux: a blank response means an Nvidia GPU is not detected.

    lspci | grep -i nvidia
  • Windows: Look under the “render” tab to see if an Nvidia GPU exists.

    dxdiag

Determine the Compute Capability of the GPU and install the correct CUDA Toolkit. CuPy is installed distinctly depending on the CUDA Toolkit version installed on your computer. Reboot.

CuPy syntax is very similar to Numpy. There are a large set of CuPy functions relevant to many engineering and scientific computing tasks.

import cupy

dev = cupy.cuda.Device()
print('Compute Capability', dev.compute_capability)
print('GPU Memory', dev.mem_info)

The should return like:

Compute Capability 75

If you get error like

cupy.cuda.runtime.CUDARuntimeError: cudaErrorInsufficientDriver: CUDA driver version is insufficient for CUDA runtime version

This means the CUDA Toolkit version is expecting a newer Nvidia driver. The Nvidia driver can be updated via your standard Nvidia update program that was installed from the factory. “Table 1” of the CUDA Toolkit release notes gives the CUDA Toolkit required Driver Versions.

Examples:

Alternatives to CuPy include Numba.cuda, which is a lower-level C-like CUDA interface from Python. CUDA for Julia is provided in JuliaGPU. Anaconda Accelerate was discontinued

Code cells in Python IDE

A code cell in popular Python IDEs including PyCharm and Spyder is created by line starting with # %%. This “code cell” is analogous to IPython code cells and Matlab code sections.

You will see like

import math

# %% user data
x = 3
y = 4
# %% main loop
for i in range(5):
    x += y

The code cells allow running sections of code in an IDE without the need to constantly set/unset breakpoints in the IDE. They also catch the eye of developers to delineate logical blocks of code in the algorithm.

We encourage the use of code cell syntax, even if you don’t use them in the IDE directly, as the IDE will highlight sections of code to visibly delineate these separate parts of the algorithm.

Git SSH with GitLab self-managed instances

GitLab Community Edition is open source. Anyone may host their own self-managed GitLab instance if desired instead of gitlab.com. Git SSH. For this example, we use Kitware’s CMake GitLab instance.

First, create an account on the self-managed GitLab instance and fork the desired repo. This will be available like

git clone https://gitlab.kitware.com/username/cmake

To git push using SSH, type:

git config --global url."ssh://gitlab.kitware.com/".pushInsteadOf https://gitlab.kitware.com/

Generate an SSH key–don’t reuse SSH keys between sites.

ssh-keygen -t ed25519 -f ~/.ssh/kitware

Go to the GitLab SSH Key page and add the contents of ~/.ssh/kitware.pub

Add to ~/.ssh/config:

Host gitlab.kitware.com
  User git
  IdentityFile ~/.ssh/kitware

Now checkout a new branch, make your changes according to project guidelines and submit a merge request.