Scientific Computing

Xvfb makes fake X11 for CI

Continuous integration for program that plot or need a display can be tricky, since in many cases the CI doesn’t have an X11 display server. Workarounds include generating plots using X server virtual framebuffer (Xvfb) dummy X11 display server. This maintains code coverage and may allow dumping plots to disk for further checks

GitHub Actions: “.github/workflows/ci.yml”: assuming the project uses PyTest, the xvfb-action enables Xvfb for that command:

- name: Run headless test
  uses: GabrielBB/xvfb-action
  with:
    run: pytest

Related: Detect CI via environment variable](/ci-detect-environment-variable)

Save WSJT-X raw audio for data analysis

Upload raw .wav WSJT-X data to the HamSci Zenodo data archive to help future data analysis. The location of the WSJT-X raw data is found by the WSJT-X menu: File → Open Log Directory. The raw data save location is typically:

  • Windows: $Env:LocalAppData/WSJT-X/save
  • Linux: ~/.local/share/WSJT-X/save
  • macOS: ~/Library/Application Support/WSJT-X/save

To save the raw data, from the WSJT-X menu: Save → Save All. One .wav file is saved per two minute cycle. This setting is persistent.

Archive raw WSPR data for easier upload to HamSci Zenodo archive:

Upload raw data to Zenodo by creating a Zenodo account to upload WSPR data to Zenodo. Upon clicking “Publish” the data is assigned a DOI and is citable.

Tips:

  • Avoid using a virtual machine for WSJT-X due to issues with broken/choppy audio.
  • WSJT-X collects about 1.7 GByte/day depending on how often you transmit (no recording occurs when you transmit).
  • raw audio data file size is: 12000 samples/sec * 16 bits/sample / 8 bits/byte * 86400 sec/day * 0.8 RX duty cycle = 1.7 GByte / day. That’s 2.88 Mbytes per 2 minute WSPR RX cycle.
  • Since this is 6 kHz of spectrum, you can widen your receiver filters (particularly if using an SDR or other advanced receiver) to also pass JT65, FT8, or other useful transmitters for even more potent results that fall within the 12 kS/s sampling bandwidth.

The raw data .wav files are uncompressed PCM audio. “tar” is used to make one archive file instead of thousands of sound files per day. The files are full of noise, which by definition is poorly compressible.


Related: Load raw WSPR data for analysis

Git pull don't merge or rebase by default

Git 2.27 has default git pull behavior that we feel is beneficial. The Git 2.27 default is to not merge or rebase by default, unless the user specifies a default behavior. Specify “safe” default behavior for git pull so that linear Git history is maintained unless manually specifying git pull options. Git services such as GitHub allow enforcing linear history.

git config --global pull.ff only

If encountering a Git remote that cannot be fast-forwarded, the user can then either git rebase or git merge.

Reference: Git: rebase vs. merge

CMake CTest cost data

CMake’s CTest assigns a dynamic COST to each test that updates each time the test is run. Kitware considers the cost test data to be undocumented behavior, so it’s not part of the CMake COST docs.

The computed test cost data is stored under ${CMAKE_BINARY_DIR}/Testing/Temporary/CTestCostData.txt This file stores data for each test in a row:

  TestName NumberOfTestRuns Cost

Compare HDF5 data values

The h5diff tool has limitations for comparing HDF5 data files because it currently can compare only absolute tolerance or relative tolerance. The comparison is mutually exclusive, which fails for many floating point data. A more suitable comparison for floating point data is similar to Numpy:

is_close = abs(actual-desired) <= max(rtol * max(abs(actual), abs(desired)), atol)
rtol
relative tolerance, perhaps 1e-5
atol
absolute tolerance, perhaps 1e-8 but not zero

We use h5py to read / write HDF5 files from Python.

Comparing floating point data for Python in CI can be done by pytest.approx.

Detect project primary code languages

GitHub, GitLab and similar repository services deal with hundreds of coding languages. Accurate detection of coding languages in a project is useful for discovery of repositories that are of interest to users and for security scanning, among other purposes. Scientific computing developers are generally interested in a narrow subset of programming languages. HPC developers are generally interested in an even narrower subset of programming languages. We recognize the “long tail” of advanced research using specialized languages or even their own language. However, most contemporary HPC and scientific computing work revolves around a handful of programming languages.

To rapidly detect coding languages at each “git push”, GitHub developed the open-source Ruby-based Linguist. GitLab also uses Linguist. We developed a Python interface to Linguist that requires the end user to install Ruby and Linguist. However, Linguist is not readily usable from native Windows (including MSYS2) because some of Linguist’s dependencies have Unix-specific code, despite being written in Ruby. The same issues can happen in general in Python if the developers aren’t using multi-OS CI. GitHub recognized the accuracy shortcomings of Linguist (cited as 84% on average) and developed the 99% accurate closed-source OctoLingua OctoLingua deals with the 50 most popular code languages on GitHub. Little has been heard since July 2019 about OctoLingua.

We provide initial implementation of a tool code-sleuth that actively introspects projects, using a variety of heuristics and direct action. A key design factor of code-sleuth is to introspect languages using specific techniques such as invoking CMake or Meson to introspect the project developers intended languages. The goal is not to detect every language in a project, but instead to detect the primary languages of a project. Also, we desire to resolve the language standards required, including: Python, C++, C, Fortran. This detection will allow a user to know what compiler or environment is needed in automated fashion.

Boost install on Windows

The Boost library brings useful features to C++ that are not yet in STL. For example, the C++17 filesystem library was in Boost for several years before becoming part of the C++ standard.

The Boost install requires several hundred megabytes in general. macOS Homebrew and Linux users can install from a package manager.

On Windows, installing Boost from the Boost binary distribution takes a lengthy build procedure. Most developers using GCC or Clang on Windows can instead simply install MSYS2 Boost.

Software executable dry run

Developers covering multiple platforms and archs can benefit from including a self-contained dry run. We define a software dry run as a fast self-contained run of the executable, exercising most or all of the program using actual input files. The concept of dry run is used by popular programs that rely on several components and connections including rsync.

A dry run self-check can be used from Python or any other script calling the executable to ensure the binary is compatible with the current platform environment. The dry run helps mitigate confusing error messages by checking that the executable runs on the platform before making a large program run.

The dry run can catch platform-specific issues like:

  • incompatible executable format (running a executable built for another platform)
  • executable built for incompatible arch (using CPU feature not available on this platform)
  • shared / dynamic library run-time search path issues

The dry run does not output any files besides temporary files. For example, in a simulation, the dry run might run one complete time step. To test file I/O, optionally write temporary file(s) using the same file format. An advanced dry run might read in those temporary files and do a basic sanity check.

A dry run is distinct from an integration test. A dry run of the program just checks that the platform environment is OK to run with this binary. The dry run checks simply that the code executes without crashing. The dry run does not emphasize deep checks of program output as an integration test would. Consider making the dry run return code be 0 for compatibility with CMake and other high level build systems.

Implement a check of the dry run with CMake:

project(Demo LANGUAGES C)
enable_testing()

add_executable(main main.c)

add_test(NAME check COMMAND main -dryrun <other command line flags>)
set_property(TEST check PROPERTY PASS_REGULAR_EXPRESSION "OK: myprogram")

PASS_REGULAR_EXPRESSION to verify the special dry run text you put in the executable code. The dry run test normally returns code zero, but PASS_REGULAR_EXPRESSION ignore the executable return code.

Python f2py install problem workaround

f2py works with legacy Fortran 77 code, but generally does not work with modern Fortran code. Projects should carefully consider alternative approaches to f2py, such as a command-line + file interface with Python.

If experiencing compiler errors when using f2py, a last resort workaround is finding another computer that the install works on, of the same operating system. This can work on Windows or Linux from a computer of the same operating system and compiler ABI.

On the “donor” working computer:

python -m build

This creates mypkg/dist/mypkg-x.y.z-cp3x-cp3xm-win_amd64.whl (similar for other OS). This can only be used on Python 3.x (as per the filename) and the same CPU architecture.

python -m pip install -e .

This creates mypkg/src/mypkgy/fortranmodule.cp3x-win_amd64.pyd

Recipient computer: both of those files are copied from the “donor” computer to the “recipient” computer. The *.pyd file is placed or soft-linked to the Python current working directory. The *.whl file is one-time installed by:

python -m pip install mypkg-x.y.z-cp3x-cp3xm-win_amd64.whl