Scientific Computing

Strip Jupyter notebook outputs from Git

Jupyter notebook outputs can be large (plots, images, etc.), making Git repo history excessively large and making Git operations slower as the Git history grows. Jupyter notebook outputs can reveal personal information with regard to usernames, Python executable, directory layout, and data outputs.

Strip all Jupyter outputs from Git tracking with a client-side Git pre-commit hook. We use Git pre-commit hook because Git filters can interfere with other programs such as CMake ExternalProject.

Tell Git user-wide where to find Git hooks:

git config --global core.hooksPath ~/.git/hooks

Edit the file ~/.git/hooks/pre-commit to contain:

Watch shell command repeat

The procps watch command allows running a command repeatedly on a Unix-like system such as Linux and macOS. Typically the command is a very quick shell command watching temperature, file status, etc. An alternative in general is a small standalone C program watch.

On macOS “watch” is available via Homebrew. Most Linux distributions have “watch” available by default.

How much time an HPC batch job took

HPC batch systems generally track resources used by users and batch jobs to help ensure fair use of system resources, even if the user isn’t actually charged money for specific job usage. The qacct command allows querying batch accounting logs by job number or username, etc.

For example

qacct -d 7 -o $(whoami) -j

Gives the last 7 days of jobs. “ru_wallclock” is the number of seconds it took to run the job.

accounting log format

Cache directory vs. temporary directory

The system temporary directory has long been used as a scratch pad in examples. Over time, security limitations (virus scanners) and performance issues (abrupt clearing of system temporary directory) have lead major programs to use user temporary or cache directories instead of the system temporary directory.

The XDG Base Directory specification is a standard for the user cache directory. For systems not implementing the environment variable “XDG_CACHE_HOME”, typical defaults for user cache directory are:

  • Windows %LOCALAPPDATA%
  • macOS ${HOME}/Library/Caches
  • Linux ${HOME}/.cache

Matplotlib cycle plot colors endlessly

To allow a for loop to make an arbitrary number of overlaid plots in a single axes, we may wish to endlessly cycle colors using itertools. This technique only makes sense up to a few dozen cycles depending on the Matplotlib color palette but it can be better than just ending a loop after the palette is exhausted.

import itertools

import matplotlib.pyplot as plt
import matplotlib.colors as mplcolors

color_cycle = itertools.cycle(mplcolors.TABLEAU_COLORS)

xy = [(x, x**1.2) for x in range(20)]
# toy data

ax = plt.figure().gca()

for (x, y), color in zip(xy, color_cycle):
    ax.scatter(x, y, color=color)

print(xy)

plt.show()

Python temporary working directory copy

If an external program needs a subdirectory to create and load multiple files, Python tempfile.TemporaryDirectory() creates a temporary working directory. shutil.copytree is used to recursively copy all files if the call to the external program succeeds.

from pathlib import Path
import tempfile
import subprocess
import shutil
import uuid

file = Path(f"~/{uuid.uuid4().hex}.txt").expanduser()
# toy file
file.write_text("Hello World!")

with tempfile.TemporaryDirectory(ignore_cleanup_errors=True) as f:
    shutil.copy(file, f)
    subprocess.check_call(["cat", Path(f) / file.name])

    new_dir = file.parent / f"{file.stem}"
    print(f"\n{file}  Solved, copy to {new_dir}")
    shutil.copytree(f, new_dir)

Git clone private repo SSH

Public Git repo clone via HTTPS and push via SSH is fast and generally effective for security. For private Git repos do Git clone over SSH. Optionally, SSH Agent avoids the need to constantly type the SSH password.

Git clone with Git over SSH by simply replacing “https://” in the Git repo URL with “ssh://”. Be sure to remove the trailing “.git” from the URL if present. For example:

git clone ssh://github.invalid/username/private-repo-name

Related: GitHub Oauth access private repo

Dynamic libraries and CMake

On Unix-like platforms, CMake variable CMAKE_DL_LIBS is populated to link with target_link_libraries(), providing functions like “dlopen” and “dladdr”. For some libdl functions it’s necessary to also define “_GNU_SOURCE” like:

add_library(mylib SHARED mylib.c)
target_link_libraries(mylib PRIVATE ${CMAKE_DL_LIBS})
target_compile_definitions(mylib PRIVATE _GNU_SOURCE)

On Windows different mechanisms can be used to access dynamic libraries. With MSYS2 libdl is available.

Run path (Rpath)

On Unix-like systems, the concept of run path is the search path for libraries used by a binary at runtime. For Windows there is no separate Rpath, just PATH is used–necessary .dll files must be on PATH environment variable at the time of running a binary. For Unix-like systems life can be easier since the Rpath is compiled into a binary. Optionally, using $ORIGIN in Rpath allows relocating binary packages.

For CMake, set all the time – no need for “if()” statements.

include(GNUInstallDirs)

set(CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS true)
# must be before all targets

Note that we use CMake defaults (we do NOT set values for these) for the following to avoid problems on HPC where library modules may be loaded dynamically. Instead we allow the end user to set these variables in the top level executable.

CMAKE_INSTALL_NAME_DIR
CMAKE_INSTALL_RPATH
CMAKE_INSTALL_RPATH_USE_LINK_PATH

Exporting symbols for MSVC-based compilers is necessary to generate a “example.lib” corresponding to the “example.dll”.

To have the installed CMake binaries work correctly, it’s necessary to set CMAKE_PREFIX_PATH at configure time. That is, the configure-build-install sequence for shared library project in CMake is like:

cmake -Bbuild -DBUILD_SHARED_LIBS=on --install-prefix=/opt/my_program

cmake --build build

cmake --install build

In general the rpath in a binary can be checked like:

  • Linux: readelf -d /path/to/binary | head -n 25
  • macOS: otool -l /path/to/binary | tail

References:

ssize_t for Visual Studio

The POSIX C type ssize_t is available on Unix-like systems in <sys/types.h>. Windows Visual Studio BaseTsd.h has SSIZE_T.

However, ssize_t is POSIX, but not C standard. It’s possible to define a signed size type “ssize_t” using “ptrdiff_t” for “ssize_t” in C and C++. Using ptrdiff_t instead of ssize_t is the practice of major projects like Emacs.

size_t bit width is guaranteed by C and C++ standards to have bit width not less than 16.

ptrdiff_t bit width is guaranteed by C standard to have bit width not less than 16, and C++ standard to have bit width not less than 17.

This example shows how to use ssize_t across computing platforms.


Related: C++ size_type property vs size_t

xarray to_netcdf() file compression

As with HDF5 and h5py, using xarray to_netcdf() to write netCDF files can losslessly compress Datasets and DataArrays, but file compression is off by default. Each data variable must have the compression option set to take effect. We typically only compress variables of 2-D or higher rank.

Notes:

  • Specify format="NETCDF4", engine="netcdf4" to allow a broader range of data types.
  • if “chunksizes” is not set, the data variable will not compress. We arbitrarily made the chunk sizes half of each dimension, but this can be optimized for particular data.
  • “fletcher32” is a checksum that can be used to detect data corruption.
  • Setting “.attr” of a data variable will be written to the netCDF file as well. This is useful to note physical units, for example.
from pathlib import Path
import xarray


def write_netcdf(ds: xarray.Dataset, out_file: Path) -> None:
    enc = {}

    for k in ds.data_vars:
        if ds[k].ndim < 2:
            continue

        enc[k] = {
            "zlib": True,
            "complevel": 3,
            "fletcher32": True,
            "chunksizes": tuple(map(lambda x: x//2, ds[k].shape))
        }

    ds.to_netcdf(out_file, format="NETCDF4", engine="netcdf4", encoding=enc)