My scientific computing toolbox

Sat 19 January 2013

There are so many amazing tools out there, and finding them can be quite difficult. I normally ask Cap, since he keeps abreast of this stuff far more than I do. But not everyone has a Cap.

Nat Friedman, founder of Ximian and then Xamarin (and fellow East Campus alum) once wrote a great blog post, "Instant Company" (Google cache) about all the best-of-breed tools he was using in his new (at the time) startup. This post was invaluable in figuring out which tools we should be using for various startup-related tasks. I'm hoping this post will serve as a similar framework for people getting started with scientific computing and machine learning, and explain some of my design decisions along the way.

Scientific Computing and Software Engineering

Two of the hats I wear are "scientist writing experimental throw-away code to do data analytics" and "engineer productizing the results of scientific experiments". Machine learning is particularly punishing here, because we often work with very large datasets and do strange things to them -- it's hard to map the world down to a set of linear algebra routines that BLAS has already optimized.

I guess, as an electrophysiologist and machine learning person with a focus on signal processing, I'm used to working with large dense time-series data. That's why Matlab is the common toolset in these fields. The messy, heterogeneous data present in social sciences and in a lot of computational biology is very foreign to my tool flow (Yarden, this is why I'm not a big pandas user).

Version control and other religions

Keeping your source code in version control as every bit as important as writing in a lab notebook. If you're a scientist who writes code and you're not using some sort of version control, you are a sloppy scientist. You write your results down in a lab notebook, but you don't keep a changelog of your code? How do you go back and look at your mistakes? How do you test if a change you introduced last week broke everything?

Fortunately, version control and associated hosting is a lot easier these days with git and the hosting service github. It's easy to share code with collaborators, easy to browse others code, and easy to make the code open-source when you publish (you do release all your research software, right?).

The text editor wars have come and gone, but I love emacs, and "Emacs for Mac OS X" is my preferred build. Magit is an amazing git mode for emacs that makes git really easy to use -- I never touch the command-line anymore. If you're not an emacs user, I have close friends that swear by Tower

Scientific Computing

Like many who started out young and doing DSP, I was trained on Matlab. But as a computer scientist, the Matlab language always felt awful, and the license-management bullshit drove me crazy.

Fortunately, both Cap and the brilliant hacker-scientist Fabian Kloosterman introduced me to Python in 2003. Python is a dynamically-typed, interpreted language that is now my go-to language for basically everything. Everything I do at least gets prototyped in Python.

Python works for scientific computing thanks to several amazing libraries:

Want it all on OSX? Chris Fonnesbeck maintains the SciPy superpack, a giant pre-built blob of all interworking tested Scientific Python tools, including some I didn't mention here. He updates it often, too.

Projects to Watch

The scientific Python community is a hotbed of innovation, especially from the folks at the newly-formed Continuum Analytics which are advancing the state-of-the art.

Production code

I deploy a lot of Python in production -- especially on the web. But for when I need to write super-optimized inference algorithms, meet hard real-time constraints, or get it all running on an embedded system, I reach for C++.

I don't love CMake as a build system, I hate it. I just hate it less than autotools/scons/etc.

Distributed computation

Python's multiprocess module is pretty convenient, and ruffus takes care of a lot of the grunt work of exploiting many cores for simple tasks. But for more complicated / more granular tasks, I heavily recommend using PiCloud. PiCloud makes it trivial to map your python computation over thousands of cores, and is very useful for all of my advanced machine learning prototyping, especially as it's compute bound. My latest UCSF project has used it extensively. And the guys behind it are great, and continually push the envelope.

Other aspects of science

The other must-have tool in my toolbox is the citation manager Mendeley which keeps my archive of papers in great order and syncs them between all my devices, across linux, OS X, and my iPad. Their new iOS client (now in beta) is amazing! Any time I can read a paper on the train is a win.

Is there anything I'm missing? Let me know!