Scientific Computing

Hash of empty file

Checking the hash checksum of downloaded files can help indicate if a file has been tampered with. Hash collisions are possible by intentionally manipulating a harmful file to have the same hash as the expected file. The simpler the hash function, the more likely hash collisions are. Hash collisions have been demonstrated for MD5 and SHA-1.

SHA-256 is a popular SHA-2 hash function for which it takes longer to generate collisons.

Example empty file hash

We use CMake command tool as a platform-independent command line tool to generate and compute hashes. The results are the same regardless of the tool used.

In general, the hash length is fixed for a given hash function. The input file size does not affect the hash length.

First create an empty file:

cmake -E touch empty-file

SHA-512 hash of an empty file:

cmake -E sha512sum empty-file
cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e  empty-file

SHA-256 hash of an empty file:

cmake -E sha256sum empty-file
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  empty-file

SHA-1 hash of an empty file:

cmake -E sha1sum empty-file
da39a3ee5e6b4b0d3255bfef95601890afd80709  empty-file

MD5 hash of an empty file:

cmake -E md5sum empty-file
d41d8cd98f00b204e9800998ecf8427e  empty-file

Clang -Wunsafe-buffer-usage tips

Clang C++ flag -Wunsafe-buffer-usage enables a heuristic that can catch potentially unsafe buffer access. However, this flag is known to make warnings that are unavoidable, such as accessing elements of argv beyond argv[0], even via encapsulation such as std::span.

This flag could be used by occasionally having a human (or suitably trained AI) occasionally review the warnings. For example, in CMake:

option(warn_dev "Enable warnings that may have false positives" OFF)

if(warn_dev)
  add_compile_options("$<$<COMPILE_LANG_AND_ID:CXX,AppleClang,Clang,IntelLLVM>:-Wunsafe-buffer-usage>")
endif()

argv general issues

General issues with argv are discussed in C++ proposal P3474R0 std::arguments. An LLVM issue proposed an interim solution roughly like the following, but at the time of writing, this still makes a warning with -Wunsafe-buffer-usage.

#if __has_include(<span>)
#include <span>
#endif
#if defined(__cpp_lib_span)
#  if __cpp_lib_span >= 202311L
#    define HAVE_SPAN_AT
#  endif
#endif

int main(int argc, char* argv[]) {

#ifdef HAVE_SPAN_AT
const std::span<char*> ARGS(argv, argc);
#endif

int n = 1000;

if(argc > 1) {
  n = std::stoi(
#ifdef HAVE_SPAN_AT
  ARGS.at(1)
#else
  argv[1]
#endif
  );
}

return 0;
}

MSVC __cplusplus macro

The __cplusplus macro indicates the version of the C++ standard that the compiler claims to implement given the current compiler flags. Some later language standard features like __has_include are available despite earlier compiler standard settings, which is a great convenience. C++ projects regularly use the __cplusplus macro to conditionally compile code based on the C++ standard version implemented by the compiler in use. This allows adding new optional features, which still working with older compilers that do not support them.

Surprisingly, Visual Studio MSVC defines __cplusplus as 199711L by default, which is the C++98 standard. Visual Studio 2017 15.7 added the flag /Zc:__cplusplus to define __cplusplus as the correct value like other compilers.

Intel oneAPI 2023.1 release uniformly adds the MSVC flag /Zc:__cplusplus. To see the note, scroll down to the text “oneAPI 2023.1, Compiler Release 2023.1 New in this release” and click the down caret.

Added /Zc:__cplusplus as a default option during host compilation with MSVC.

In CMake, add this flags as needed by deciphering the MSVC compiler version.

if(CMAKE_CXX_COMPILER_ID STREQUAL "MSVC" AND CMAKE_CXX_COMPILER_VERSION VERSION_GREATER_EQUAL 19.14)
  # MSVC has __cpluscplus = 199711L by default, which is C++98!
  # oneAPI since 2023.1 sets __cplusplus to the true value with MSVC by auto-setting this flag.
  add_compile_options("$<$<COMPILE_LANGUAGE:CXX>:/Zc:__cplusplus>")
endif()

Meson build system adds this flag automatically.

Matlab system stdin pipe

Matlab or GNU Octave can call programs and handle arbitrarily large and complex inputs and outputs via stdin, stderr, and stdout command line pipes as in matlab-stdlib subprocess_run that works for Matlab or GNU Octave across operating systems. This Java interface is via Matlab external language interface.

stdlib.subprocess_run() overcomes limitations of factory system. and works like Python subprocess. “stdout” and “stderr” are returned from stdlib.subprocess_run() separately, and “stdin” can be passed as a string.

stdlib.subprocess_run() can be faster than using temporary files. stdlib.subprocess_run() helps avoid filesystem clashes when running many external processes in parallel or asynchronously.

Across programming languages, calling an external program with pipes avoids the need to write additional code directly interfacing memory between Fortran or C/C++ by using file-based or pipe-based API for data streaming.


Reference: Python or Java pass stdin from Matlab to executable.

HDF5 / NetCDF4 in GNU Octave

Open data file formats such as HDF5 and NetCDF4 are excellent way to share and store archival data across computing platforms and software languages. Numerical software such as Matlab, GNU Octave, Python, and many more support these data file formats.

The syntax in the code examples below is exactly the same for Matlab and GNU Octave. Omit the pkg load and pkg install statements in Matlab.

HDF5

HDF5 files in GNU Octave are accessed via hdf5oct in similar fashion to Matlab.

From Octave prompt, install the package:

pkg install -forge hdf5oct

Octave program that writes an array to an HDF5 file “example.h5” dataset “/m”:

pkg load hdf5oct

fn = 'example.h5';

h5create (fn, '/m', [3 3]);
h5write (fn, '/m', magic (3));

Observe the file “example.h5” has been created. If the HDF5 command line tools are installed, the contents can be printed from system Terminal:

h5ls -v example.h5

In Octave or Matlab, the HDF5 file can be read to an array:

x = h5read (fn, '/m')
8   1   6
3   5   7
4   9   2

NetCDF4

NetCDF4 files in GNU Octave are accessed via Octave NetCDF4 package. Install the package from Octave prompt:

pkg install -forge netcdf

Write an array to a NetCDF4 file “example.nc” dataset “m”:

pkg load netcdf

fn = 'example.nc';

nccreate (fn, 'm', "Dimensions", {"x", 3, "y", 3});
% must include dimensions or a scalar dataset will be created

ncwrite (fn, 'm', magic (3));

Read the NetCDF4 file “example.nc” to an array:

x = ncread (fn, 'm')

Reference:

  • oct-hdf5 package: Octave low-level access to HDF5 files.

Eliminate old C-style casts in C++

The C++ named casts such as static_cast, dynamic_cast, and reinterpret_cast are preferred in C++ Core Guideline ES.49 over ambiguous old C-style casts. C++ named casts can help provide type safety by making the intention of the cast explicit / readable. C++ compilers can detect and warn about improper or unsafe casts when using named casts. C-style cast mistakes are more difficult to detect by humans or automated tools.

static_cast is used for conversions between compatible types, such as converting an int to a float or a pointer to a base class to a pointer to a derived class. Another common static_cast use case is interfacing with C functions such as Windows API functions that require specific types less common in pure C++ code.

int a = 10;
float b = static_cast<float>(a);

reinterpret_cast is used for low-level reinterpreting of bit patterns. It casts a type to a completely different type. This cast is not type safe and should be used with caution to avoid undefined behavior. reinterpret_cast is commonly used in low-level programming, such as interfacing with hardware or converting between pointers and integers.

int a = 10;
char* b = reinterpret_cast<char*>(&a);

dynamic_cast is used for safe downcasting of pointers or references to classes in a class hierarchy. It performs a runtime or RTTI check to help ensure that the cast is valid. dynamic_cast is used when you need to convert a pointer or reference to a base class to a pointer or reference to a derived class. static_cast is more common and faster than dynamic_cast, but dynamic_cast is safer when downcasting in a class hierarchy.

Detecting Old-Style Casts with GCC or Clang

To ensure that old C-style casts are not used in a codebase, consider the -Wold-style-cast flag with GCC or Clang. This flag generates warnings for any old-style casts found in the code.

In CMake, this flag is applied like:

dd_compile_options("$<$<COMPILE_LANG_AND_ID:CXX,AppleClang,Clang,GNU>:-Wold-style-cast>")

If the CMake variable CMAKE_COMPILE_WARNING_AS_ERROR is set true, the old-style cast warnings (and other compile warnings) will be treated as errors.

Matlab / Octave integer representation

For proper integer representation in Matlab / Octave use explicit type to avoid Matlab unwanted casting to “double” for integers.

x = int64(2^63);

Operations involving an explicitly-typed variable will retain that type, assuming implicit casting due to other variables or operations doesn’t occur. Precise string representation of “x” can be done using int2str(), sprintf(), or string():

xc = int2str(x);

xf = sprintf('%d', x);

xs = string(x);

sprintf() gives more control over the string output format, while string() or int2str() are more concise.

ATSC 1.0 MPEG-4 older TVs no video

In 2008 the ATSC ratified the MPEG-4 TV broadcast standard. Numerous ATSC 1.0 TVs were sold before this standard was ratified, and still operate today. TV manufacturers continued to make some non-MPEG-4 TVs for a decade after the standard was ratified. As a practical matter to avoid abandoning viewers with older receivers, ATSC 1.0 broadcasts remain on while implementing ATSC 3.0 broadcasts. This lighthousing of ATSC 1.0 broadcasts leads broadcasters to use MPEG-4 encoding for ATSC 1.0 broadcasts.

MPEG-2 is the legacy encoding standard for ATSC 1.0 broadcasts, which any old DTV can receive. A typical ATSC 1.0 MPEG-2 broadcast channel layout was one 1080i channel and several 480i channels, or 1-2 720p channel(s) with even more 480i channels. ATSC broadcast channel layout is a tradespace between the number of subchannels vs. the bandwidth per subchannel. This database lists the channels available in a given area. Click “Technical Data” to see the resolution and encoding of each channel.

As ATSC 3.0 broadcasts roll out, the number of ATSC 1.0 channels will decrease. A mitigation for broadcasters is to switch to MPEG-4 encoding for the ATSC 1.0 broadcasts, which is more efficient than MPEG-2 and allows packing more channels into the same transmitter bandwidth. This leaves older TVs and receivers with audio-only on MPEG-4 channels. This MPEG-4 list is missing some broadcasters. Note that some ATSC broadcasts have audio-only subchannels.

A solution for the end user lacking an MPEG-4-capable TV is to buy an ATSC receiver box that supports MPEG-4. These can be obtained for less than $50. ATSC 3.0 receivers are available for less than $100 if desired to access ATSC 3.0 broadcasts not available even on some new TVs.

Enthusiasts make their “band scan” data available for TV and FM radio typically using a Raspberry Pi to enjoy and share the hobby of broadcast DXing.

Satellite radio outside North America

SDARS satellite radio broadcasts of music or video to mobile receivers has largely been a North American phenomenon. While there are North American specific satellite TV networks like DirectTV and Dish Network, satellite TV has long been a global phenomenon in certain markets.

Automobiles typically have Bluetooth audio, which may one day be used with 5G broadcast instead of individual mobile data streams. This may stymie the growth of SDARS in other continents. Despite a global need for wide-area and rural broadcast radio coverage, SDARS is only widespread in North America. SiriusXM has made massive investment in a “long game” with receiver availability just as mobile internet streaming became widely feasible. Other continents’ markets may be too fragmented despite the large population, with not enough intercity user mobility to make the subscriber base big enough, and the cities dense enough to also need hundreds of terrestrial repeaters.

Notes: