Detect project primary code languages

GitHub, GitLab and similar repository services deal with hundreds of coding languages. Accurate detection of coding languages in a project is useful for discovery of repositories that are of interest to users and for security scanning, among other purposes. Scientific computing developers are generally interested in a narrow subset of programming languages. HPC developers are generally interested in an even narrower subset of programming languages. We recognize the “long tail” of advanced research using specialized languages or even their own language. However, most contemporary HPC and scientific computing work revolves around a handful of programming languages.

To rapidly detect coding languages at each “git push”, GitHub developed the open-source Ruby-based Linguist. GitLab also uses Linguist. We developed a Python interface to Linguist that requires the end user to install Ruby and Linguist. However, Linguist is not readily usable from native Windows (including MSYS2) because some of Linguist’s dependencies have Unix-specific code, despite being written in Ruby. The same issues can happen in general in Python if the developers aren’t using multi-OS CI. GitHub recognized the accuracy shortcomings of Linguist (cited as 84% on average) and developed the 99% accurate closed-source OctoLingua OctoLingua deals with the 50 most popular code languages on GitHub. Little has been heard since July 2019 about OctoLingua.

We provide initial implementation of a tool code-sleuth that actively introspects projects, using a variety of heuristics and direct action. A key design factor of code-sleuth is to introspect languages using specific techniques such as invoking CMake or Meson to introspect the project developers intended languages. The goal is not to detect every language in a project, but instead to detect the primary languages of a project. Also, we desire to resolve the language standards required, including: Python, C++, C, Fortran. This detection will allow a user to know what compiler or environment is needed in automated fashion.