Scientific Computing

Pros/cons of LogMeIn, TeamViewer, GoToMyPC

We develop and deploy data collection from remote, inaccessible sites located around the world. Thus we need to have highly-reliable methods of remote control. This is accomplished in part by Intel vPro enabled computers, allowing remote power down, reboot, and even reinstall the operating system remotely from a HTTP vPro internal webserver.

Remote PC control checklist:

  1. Intel vPro motherboard
  2. Certificates to control vPro (don’t rely on passwords for full PC control!)
  3. Clonezilla DVD in DVD drive
  4. Clonezilla HDD image on Blu-ray in drive or USB HDD / flash drive
  5. Hardware Firewall to not expose vPro ports to outside world.

Commercial remote desktop: SSH port forwarding and RDP, but what about those who want to use LogMeIn or the like?

Pros:

  • Commercial remote desktop services such as LogMeIn are typically more secure on a Windows PC than just leaving port 3389 open to the internet.
  • LogMeIn has convenient apps for smartphones and from a web browser

Cons:

The downsides of LogMeIn-type commercial services have philosophical and practical aspects.

  • Commercial services typically use proprietary (non-open-source) technologies for the central server and/or securing the connection. Open source choices are using perhaps the same technology but open to world-wide security reviewers.
  • The convenience of commercial services (centralized server making the connections) is seen by some as a weakness, since it could have unknown hackers as employees, could shut down their server, raise prices, etc.

Free alternatives:

  • SSH → RDP: Cygwin OpenSSH server SSH port forward port 3389
  • phone remote desktop app or HTML5-based Guacamole
  • access PCs with a “single click” from a phone or laptop, without having a 3rd party server involved, without plugins (see Guacamole).

Make gem install not install docs default

The most time consuming part of some gem install packages is the documentation, which most users don’t use as they search the Web instead.

Make gem install not install docs by default by adding this line to ~/.gemrc:

gem: --no-document

Find text string in file

Python script findtext.py looks for specific text inside any file smaller than a maximum size (avoiding searching binary files)

Recursive find and edit: to speedily edit the files recursively found, consider a command recursively searching for mytext like:

  • Linux: gedit $(findtext mytext) or nano $(findtext mytext)
  • macOS: nano $(findtext mytext)
  • Windows Subsystem for Linux: nano $(findtext mytext)

Benefits of conda vs. pip

conda and pip are not merely two different ways to install Python packages. Conda can install compilers such as gfortran. Here are a few factors on where conda or pip have respective advantages.

This article defines “cross-platform”: working on Linux, macOS and Windows

Ease of install: Python wheels greatly ease end-user install of libraries requiring compilation without the end-user needing a compiler. For example, high-performance Fortran, C and/or C++ code can be imported as a Python module, compiled beforehand and downloaded automatically. However, major packages like SciPy released cross-platform wheels only in late 2017 (SciPy 1.0.0). This means until 2017, easily installable, pre-compiled SciPy was not universal–some users would have to have Fortran, C and C++ compilers installed. For a large subset of Python users, compiling software libraries is not intuitive and end users disliked waiting 10 minutes for SciPy to compile itself.

A core design reason behind conda is excellent conflict resolution, so I often type conda install when I want to install something complicated like Spyder.

Easy virtual environments

The first-class conflict resolution of conda is matched by excellent virtual environment management.

conda env list

lists all the environments installed. This allows you to safely try out complicated programs like Mayavi with lots of prerequisite packages. Instead of ripping out the latest libraries you have, create Python environments with

conda create

High performance MKL Python libraries:

FFT benchmark plot

Python Intel MKL FFT benchmark.

pip install scipy

downloads and immediately makes available precompiled Fortran, C, C++ libraries within SciPy. Python wheels do not obviate Conda’s usefulness! One of the key advantages of using conda-installed packages are the free high-performance Anaconda MKL libraries, freely available since February 2016 for:

  • Numpy
  • SciPy
  • Scikit-learn

Although some specialized users may still want to compile Python libraries with Intel MKL, most will simply do as we recommend:

conda install numpy scipy scikit-learn

Python dynamic update in-place Terminal text

Cross-platform, dynamically updating text is enabled in Python print() with end='\r', like:

print('dynamic text', end='\r')

The dynamically updating text will immediately display.

Don’t allow the printed line length to exceed the Terminal/Command Prompt width. This method breaks if the line wraps.

Retrieve the maximum line width with Python os.get_terminal_size:

import os

width, height = os.get_terminal_size()

or get terminal width from the command line:

python -c "import os; print(os.get_terminal_size()[0])"

Typically the terminal/command prompt is 80 or 100 characters wide.

The advantage of this method is that previous information is not scrolled off the screen. One common use for this method is terminal text progress indicators. Using special characters, pseudo-graphical dynamic terminal displays are also possible, or use Python curses.

Here’s an example:

from time import sleep

N=12

for i in range(N):
   sleep(0.5)
   print(f"{i/N*100:.1f} %", end="\r")

One cannot create an actual new terminal session windows from curses. curses.newwin() “new window” is inside the “screen”, which is the existing Terminal.

Understanding Pandas read_csv read_excel errors

In data science we often deal with messy, heterogeneous data and file types too. Python Pandas is a very powerful data science tool. A simple but not infrequent mistake is using the wrong Pandas function to read data, that is, using read_excel to read CSV data or read_csv to read Excel spreadsheet data.

Note: Pandas cannot read ODS OpenDocument formats, so for those using LibreOffice/OpenOffice, convert ODS data to XLSX first.

Pandas wrong function format errors

The intended Pandas reader usage is:

  • read_csv() for .csv and .tsv files
  • read_excel for .xls and .xlsx files

read_excel(.csv)

This leads to errors including:

xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b''

read_csv → .xlsx

This leads to errors including:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfa in position 1: invalid start byte
arserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Pandas prereqs

To help avoid excessive prerequisites, Pandas makes the xlrd install optional–until using read_excel, so simply do:

pip install pandas xlrd

SCP/SSHFS recursive copy with exclusions

SCP does not have an option to exclude files while copying remote files over SSH. This is a problem when you have Git-managed code you’ve modified on say:

  • offline computer
  • HPC, and don’t want to put your Git host credentials on the HPC

If you just use scp -r, you’ll also overwrite the .git directory which can destroy work done on this or other branches. We want to just copy the code files, NOT the .git tree.

Solution

On laptop install SSHFS. SSHFS uses SSH and FUSE to make a virtual directory while the true files reside on the remote computer.

  • Linux / Windows Subsystem for Linux: apt install sshfs
  • macOS: brew install sshfs

Mount a local location, say ~/X

mkdir ~/X    # one-time

sshfs -o follow_symlinks myserver: ~/X

Now the home directory directory on myserver is connected to ~/X locally.

Use cp as usual to copy a directory while excluding .git

shopt -s extglob   # enables fancy globbing

cp -rv ~/X/myprog/*!(.git/|bin/) .

Note exclusion under bin/ of the repository. The HPC probably runs a different Linux distro and the compilation is optimized for a different CPU, so the HPC binaries wouldn’t generally be useful elsewhere.

print vs write Fortran statements

The Fortran 2003 standard constitutes a strong foundation of “modern Fortran”. Modern Fortran (Fortran ≥ 2003) is so different in capabilities and coding style from Fortran 77 as to be a distinct, highly backward compatible language. Almost all of Fortran 95 was incorporated into Fortran 2003, except for a few obscure little used and confusing features deprecated and already unsupported by some popular compilers.

Writing to console effectively: write(*,*) grew out of non-standard use of Fortran 66’s write statement that was introduced for device-independent sequential I/O. Although write(*,*) became part of Fortran 77 for printing to console standard output, the Fortran 77 print command is more concise and more importantly visually distinct. That is, where the full versatility of the write command is not needed, print should be used to help make those cases where write is needed more distinct.

Assembly language comparison: print *,'hi' and write(*,*) 'hi' are IDENTICAL in assembly, within modern compilers as it should be. In general, disassemble Fortran executables with:

gfortran myprog.f90

objdump --disassemble a.out > myprog.s

Fortran 2003 finally settled the five-decade old ambiguity over console I/O with the intrinsic iso_fortran_env module, which is often invoked at the top of a Fortran module like:

module mymod

use, intrinsic:: iso_fortran_env, only: stdout=>output_unit, stdin=>input_unit, stderr=>error_unit

The => operators are here for renaming (they have other meanings for other Fortran statements). It’s not necessary to rename, but it’s convenient for the popularly used names for these console facilities.

Recommendation: routine console printing:

print *, 'Hello text'

For advanced console printing, whether to output errors, use non-advancing text, or toggle between log files and printing to console, use write(stdout,*) or the like.


Example: print to stdout console if output filename not specified

use, intrinsic:: iso_fortran_env, only: stdout=>output_unit

implicit none (type, external)

character(:), allocatable :: fn
integer :: i, u, L

call get_command_argument(1, length=L, status=i)
if (i /= 0) error stop "first command argument not available"
allocate(character(L) :: fn)
call get_command_argument(1, fn)
if (i==0) then
  print '(a)', 'writing to ' // fn
  open(newunit=u, file=fn, form='formatted')
else
  u = stdout
endif

i = 3 ! test data

write(u,*) i, i**2, i**3

if (u /= stdout) close(u)   ! closing stdout can disable text console output, and writes to file `fort.6` in gfortran

print *,'goodbye'

! end program implies closing all file units, but here we close in case you'd use in subprogram (procedure), where the file reference would persist.
end program

Non-advancing stdout/stdin (for interactive Fortran prompts)

Fortran 2003+ module procedure polymorphism

Polymorphism is a part of generic programming enabled by Fortran 2003. Typically one should encapsulate procedures in modules, even when the whole program is contained in a single file.

Example: addtwo() automatically selects the correct type thanks to the interface block.

module funcs
use, intrinsic:: iso_fortran_env, only: sp=>real32, dp=>real64
implicit none (type, external)
!! takes affect for all procedures within module

interface addtwo
  procedure addtwo_s, addtwo_d, addtwo_i
end interface addtwo

contains

elemental real(sp) function addtwo_s(x) result(y)
real(sp), intent(in) :: x

y = x + 2
end function addtwo_s


elemental real(dp) function addtwo_d(x) result(y)
real(dp), intent(in) :: x

y = x + 2
end function addtwo_d


elemental integer function addtwo_i(x) result(y)
integer, intent(in) :: x

y = x + 2
end function addtwo_i

end module funcs


program test2

use funcs
implicit none (type, external)

real(sp) :: twos = 2._sp
real(dp) :: twod = 2._dp
integer :: twoi = 2

print *, addtwo(twos), addtwo(twod), addtwo(twoi)

end program