Scientific Computing

Understanding Pandas read_csv read_excel errors

In data science we often deal with messy, heterogeneous data and file types too. Python Pandas is a very powerful data science tool. A simple but not infrequent mistake is using the wrong Pandas function to read data, that is, using read_excel to read CSV data or read_csv to read Excel spreadsheet data.

Note: Pandas cannot read ODS OpenDocument formats, so for those using LibreOffice/OpenOffice, convert ODS data to XLSX first.

Pandas wrong function format errors

The intended Pandas reader usage is:

  • read_csv() for .csv and .tsv files
  • read_excel for .xls and .xlsx files

read_excel(.csv)

This leads to errors including:

xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b''

read_csv → .xlsx

This leads to errors including:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfa in position 1: invalid start byte
arserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Pandas prereqs

To help avoid excessive prerequisites, Pandas makes the xlrd install optional–until using read_excel, so simply do:

pip install pandas xlrd

SCP/SSHFS recursive copy with exclusions

SCP does not have an option to exclude files while copying remote files over SSH. This is a problem when you have Git-managed code you’ve modified on say:

  • offline computer
  • HPC, and don’t want to put your Git host credentials on the HPC

If you just use scp -r, you’ll also overwrite the .git directory which can destroy work done on this or other branches. We want to just copy the code files, NOT the .git tree.

Solution

On laptop install SSHFS. SSHFS uses SSH and FUSE to make a virtual directory while the true files reside on the remote computer.

  • Linux / Windows Subsystem for Linux: apt install sshfs
  • macOS: brew install sshfs

Mount a local location, say ~/X

mkdir ~/X    # one-time

sshfs -o follow_symlinks myserver: ~/X

Now the home directory directory on myserver is connected to ~/X locally.

Use cp as usual to copy a directory while excluding .git

shopt -s extglob   # enables fancy globbing

cp -rv ~/X/myprog/*!(.git/|bin/) .

Note exclusion under bin/ of the repository. The HPC probably runs a different Linux distro and the compilation is optimized for a different CPU, so the HPC binaries wouldn’t generally be useful elsewhere.

print vs write Fortran statements

The Fortran 2003 standard constitutes a strong foundation of “modern Fortran”. Modern Fortran (Fortran ≥ 2003) is so different in capabilities and coding style from Fortran 77 as to be a distinct, highly backward compatible language. Almost all of Fortran 95 was incorporated into Fortran 2003, except for a few obscure little used and confusing features deprecated and already unsupported by some popular compilers.

Writing to console effectively: write(*,*) grew out of non-standard use of Fortran 66’s write statement that was introduced for device-independent sequential I/O. Although write(*,*) became part of Fortran 77 for printing to console standard output, the Fortran 77 print command is more concise and more importantly visually distinct. That is, where the full versatility of the write command is not needed, print should be used to help make those cases where write is needed more distinct.

Assembly language comparison: print *,'hi' and write(*,*) 'hi' are IDENTICAL in assembly, within modern compilers as it should be. In general, disassemble Fortran executables with:

gfortran myprog.f90

objdump --disassemble a.out > myprog.s

Fortran 2003 finally settled the five-decade old ambiguity over console I/O with the intrinsic iso_fortran_env module, which is often invoked at the top of a Fortran module like:

module mymod

use, intrinsic:: iso_fortran_env, only: stdout=>output_unit, stdin=>input_unit, stderr=>error_unit

The => operators are here for renaming (they have other meanings for other Fortran statements). It’s not necessary to rename, but it’s convenient for the popularly used names for these console facilities.

Recommendation: routine console printing:

print *, 'Hello text'

For advanced console printing, whether to output errors, use non-advancing text, or toggle between log files and printing to console, use write(stdout,*) or the like.


Example: print to stdout console if output filename not specified

use, intrinsic:: iso_fortran_env, only: stdout=>output_unit

implicit none (type, external)

character(:), allocatable :: fn
integer :: i, u, L

call get_command_argument(1, length=L, status=i)
if (i /= 0) error stop "first command argument not available"
allocate(character(L) :: fn)
call get_command_argument(1, fn)
if (i==0) then
  print '(a)', 'writing to ' // fn
  open(newunit=u, file=fn, form='formatted')
else
  u = stdout
endif

i = 3 ! test data

write(u,*) i, i**2, i**3

if (u /= stdout) close(u)   ! closing stdout can disable text console output, and writes to file `fort.6` in gfortran

print *,'goodbye'

! end program implies closing all file units, but here we close in case you'd use in subprogram (procedure), where the file reference would persist.
end program

Non-advancing stdout/stdin (for interactive Fortran prompts)

Fortran 2003+ module procedure polymorphism

Polymorphism is a part of generic programming enabled by Fortran 2003. Typically one should encapsulate procedures in modules, even when the whole program is contained in a single file.

Example: addtwo() automatically selects the correct type thanks to the interface block.

module funcs
use, intrinsic:: iso_fortran_env, only: sp=>real32, dp=>real64
implicit none (type, external)
!! takes affect for all procedures within module

interface addtwo
  procedure addtwo_s, addtwo_d, addtwo_i
end interface addtwo

contains

elemental real(sp) function addtwo_s(x) result(y)
real(sp), intent(in) :: x

y = x + 2
end function addtwo_s


elemental real(dp) function addtwo_d(x) result(y)
real(dp), intent(in) :: x

y = x + 2
end function addtwo_d


elemental integer function addtwo_i(x) result(y)
integer, intent(in) :: x

y = x + 2
end function addtwo_i

end module funcs


program test2

use funcs
implicit none (type, external)

real(sp) :: twos = 2._sp
real(dp) :: twod = 2._dp
integer :: twoi = 2

print *, addtwo(twos), addtwo(twod), addtwo(twoi)

end program

flake8 PEP8 quick start

PEP8 code style benefits code readability. “flake8” checks for PEP8 compliance, as well as catch some syntax errors in unexecuted code. flake8 is typically part of continuous integration.

pip install flake8

in the top directory of the particular Python package type:

flake8

flake8 is configured via the per-project top-level file “.flake8:

[flake8]
max-line-length = 132
exclude = .git,__pycache__,doc/,docs/,build/,dist/,archive/

If lines that violate PEP8 must remain as-is, individual lines can be exempted from PEP8 with noqa pragma like:

3==1+2  # noqa: E225

Why upgrade to Python 3.7

Python 3.7 was released in June 2018, adding performance to common operations, and adds user-visible changes in the following categories.

The boilerplate copy-paste required for Python classes can seem inelegant. Python 3.7 data class eliminates the boilerplate code in initializing classes. The @dataclass decorator enables this template.

@dataclass
class Rover:
    '''Class for robotic rover.'''
    name: str
    uid: int
    battery_charge: float=0.
    temperature: float

    def check_battery_voltage(self) -> float:
        return self.aioread(port35) / 256 * 4.1

Python 3.7 introduced breakpoint, which breaks into the debugger.

x=1
y=0

breakpoint()

z = x/y

It’s very common to have more than one version of Python installed. Likewise, multiple versions of the same library may be installed, overriding other versions. For example, system Numpy may be overridden with a pip installed Numpy.

Python ≥ 3.7 gives the absolute path and filename from which the ImportError was generated.

from numpy import blah

Python < 3.7:

ImportError: cannot import name ‘blah’

Python ≥ 3.7:

ImportError: cannot import name ‘blah’ from ’numpy’ (c:/Python37/Lib/site-packages/numpy/init.py)

The popular and efficient argparse module can now handle intermixed positional and optional arguments, just like the shell.

from argparse import ArgumentParser
p = ArgumentParser()
p.add_argument('xmlfn')
p.add_argument('--plottype')
p.add_argument('indices',nargs='*',type=int)
p = p.parse_intermixed_args()   # instead of p.parse_args()

print(p)
python myprogram.py my.xml 26 --plottype inv 2 3

Namespace(indices=[26,2,3], plottype=‘inv’, xmlfm=‘my.xml’)

whereas if you have used p.parse_args() you would have gotten

error: unrecognized arguments: 2 3

Note: optparse was deprecated in 2011 and is no longer maintained.


Python ≥ 3.7 can do

import a.b as c

instead of Python ≤ 3.6 needing

from a import b as c

The discussion makes the details clear for those who are really interested in Python import behavior.

Python ≥ 3.7 disassembler dis.dis() can reach more deeply inside Python code, adding a depth parameter useful for recursive functions, and elements including:

  • list comprehension: x2 = [x**2 for x in X] (greedy eval)
  • generator expressions: x2 = (x**2 for x in X) (lazy eval)

Case-insensitive regex sped up by as much as 20x.

Python 3.7 added constants that allow controlling subprocess priority in Windows. This allows keeping the main Python program at one execution priority, while launching subprocesses at another priority. The ability to start subprocesses without opening a new console window is enabled by subprocess.CREATE_NO_WINDOW. The confusingly named but important universal_newlines boolean parameter is now named text. When text=True, stdin/stderr/stdout will emit/receive text stream instead of bytes stream.

Require minimum Python version

pyproject.toml allows fine-grained control of supported Python versions. The minimum Python project version is set like:

[project]
requires-python = ">=3.10"

Python versions required specification can be a range of versions or specific list of version(s)

Check console script with Pytest

Pytest is the de facto standard for Python unit testing and continuous integration. To be complete in testing, one should test the interactive console scripts that for many Python programs is the main method of use.

Console script testing can be added through Pytest Console Scripts addon, but I usually simply use subprocess.check_call directly like Pytest Console Scripts addon does.

Note that “sys.executable” is the recommended way to securely get the Python executable path, to ensure testing with the same Python interpreter.

import pytest
import subprocess
import sys

def test_find():
    subprocess.check_call([sys.executable, '-m', 'mypkg'])

This is for a package configured with __main__.py or __init__.py such that in normal use, the user types:

python -m mypkg

to run the Python program.

Fix Matlab network license authorization

Matlab should generally be installed NOT using sudo. Upon upgrading operating system, or if you installed Matlab on a laptop using a docking station, and then run off the docking station, Matlab may complain about a changed host ID.

If Matlab is already installed, but won’t open the desktop due to a licensing error,reactivate Matlab:

$(dirname $(realpath $(which matlab)))/activate_matlab.sh


Get the host ID (MAC address) by:

ip a

look for the WiFi link/ether hexadecimal value. If connected to the internet via WiFi, you can confirm the correct device by comparing the value for inet or inet6 vs. https://ident.me


Install to the home directory and do NOT use sudo. Make a directory for Matlab installs:

mkdir ~/.local/matlab

Start the Matlab install NOT as root or sudo

./install

Install to directory like “~/matlab/” Activate via Internet and sign in to select the license key.

Quick start RTL2832 USB SDR receiver on Linux

GQRX is popular for RTL-SDR receivers on Linux:

adduser $(whoami) plugdev

apt install gqrx-sdr rtl-sdr librtlsdr-dev

You can also download the latest release of GQRX.

In GNU Radio Companion, look for the RTL-SDR Source block.

Test RTL2832 PLL Frequency range:

rtl_test -t

Output should be like:

E4000 tuner

Found 1 device(s): 0: ezcap USB 2.0 DVB-T/DAB/FM dongle
Using device 0: ezcap USB 2.0 DVB-T/DAB/FM dongle
Found Elonics E4000 tuner
Supported gain values (18): -1.0 1.5 4.0 6.5 9.0 11.5 14.0 16.5 19.0 21.5 24.0 29.0 34.0 42.0 43.0 45.0 47.0 49.0
Benchmarking E4000 PLL...
E4K PLL not locked for 53000000 Hz!
E4K PLL not locked for 2217000000 Hz!
E4K PLL not locked for 1109000000 Hz!
E4K PLL not locked for 1248000000 Hz!
E4K range: 54 to 2216 MHz
E4K L-band gap: 1109 to 1248 MHz

R820

Found 1 device(s):  0:  Realtek, RTL2838UHIDIR, SN: 00000001

Using device 0: Generic RTL2832U OEM
Detached kernel driver
Found Rafael Micro R820T tuner
Supported gain values (29): 0.0 0.9 1.4 2.7 3.7 7.7 8.7 12.5 14.4 15.7 16.6 19.7 20.7 22.9 25.4 28.0 29.7 32.8 33.8 36.4 37.2 38.6 40.2 42.1 43.4 43.9 44.5 48.0 49.6
[R82XX] PLL not locked!
Sampling at 2048000 S/s.
No E4000 tuner found, aborting.
Reattached kernel driver

Record the entire passband ~ 2 MHz bandwidth, not just the demodulated audio. Example command:

rtl_sdr ${TMPDIR}/cap.bin -s 1.8e6 -f 90.1e6

Press Ctrlc to stop recording after several seconds so that your hard drive doesn’t fill up. You can read the cap.bin file in MATLAB, Python or GNU Radio.


Troubleshooting:

  • is RTL-SDR recognized? Before and after inserting the RTL-SDR receiver into the USB port of your Linux PC, type:

    lsusb

    should show Realtek device.

  • try a different, non-USB 3 port (USB 2).

  • librtlsdr0 provides file /lib/udev/rules.d/60-librtlsdr0.rules that allows the RTL-SDR stick to be recognized upon USB plugin.

  • dmesg should show dozens of messages with RTL2832 when the USB receiver is plugged in


Other popular programs for the RTL-SDR:

  • MATLAB RTL-SDR support has several examples and a free eBook. Matlab also supports USRP and PLUTO SDR hardware among others.

  • GNU Radio (start with GNU Radio Companion graphical SDR IDE)

    apt install gnuradio
  • pyrtlsdr: pure Python wrapper for librtlsdr and less bulky than GNU Radio.

  • SDR#

  • CubicSDR