My scientific computing toolbox
Sat 19 January 2013
There are so many amazing tools out there, and finding them can be quite difficult. I normally ask Cap, since he keeps abreast of this stuff far more than I do. But not everyone has a Cap.
Nat Friedman, founder of Ximian and then Xamarin (and fellow East Campus alum) once wrote a great blog post, "Instant Company" (Google cache) about all the best-of-breed tools he was using in his new (at the time) startup. This post was invaluable in figuring out which tools we should be using for various startup-related tasks. I'm hoping this post will serve as a similar framework for people getting started with scientific computing and machine learning, and explain some of my design decisions along the way.
Scientific Computing and Software Engineering
Two of the hats I wear are "scientist writing experimental throw-away code to do data analytics" and "engineer productizing the results of scientific experiments". Machine learning is particularly punishing here, because we often work with very large datasets and do strange things to them -- it's hard to map the world down to a set of linear algebra routines that BLAS has already optimized.
I guess, as an electrophysiologist and machine learning person with a focus on signal processing, I'm used to working with large dense time-series data. That's why Matlab is the common toolset in these fields. The messy, heterogeneous data present in social sciences and in a lot of computational biology is very foreign to my tool flow (Yarden, this is why I'm not a big pandas user).
Version control and other religions
Keeping your source code in version control as every bit as important as writing in a lab notebook. If you're a scientist who writes code and you're not using some sort of version control, you are a sloppy scientist. You write your results down in a lab notebook, but you don't keep a changelog of your code? How do you go back and look at your mistakes? How do you test if a change you introduced last week broke everything?
Fortunately, version control and associated hosting is a lot easier these days with git and the hosting service github. It's easy to share code with collaborators, easy to browse others code, and easy to make the code open-source when you publish (you do release all your research software, right?).
The text editor wars have come and gone, but I love emacs, and "Emacs for Mac OS X" is my preferred build. Magit is an amazing git mode for emacs that makes git really easy to use -- I never touch the command-line anymore. If you're not an emacs user, I have close friends that swear by Tower
Scientific Computing
Like many who started out young and doing DSP, I was trained on Matlab. But as a computer scientist, the Matlab language always felt awful, and the license-management bullshit drove me crazy.
Fortunately, both Cap and the brilliant hacker-scientist Fabian Kloosterman introduced me to Python in 2003. Python is a dynamically-typed, interpreted language that is now my go-to language for basically everything. Everything I do at least gets prototyped in Python.
Python works for scientific computing thanks to several amazing libraries:
-
numpy -- everything matlab should be and more -- numpy lets you do dense vector/matrix operations very quickly. Vector operations are executed using optimized-under-the-hood C code
-
SciPy -- numpy is just the basic array object, scipy provides every possible libaray you could want on top of that, including advanced linear algebra, signal processing, image processing, stats, etc.
-
matplotlib -- Matplotlib has come a very long way since it was a loose wrapper around Agg for rendering. Check out the gallery for a quick overview of what you can do. People keep trying to make better-than-matplotlib tools, but nothing is mature enough yet for publishing figures.
-
Cython -- sometimes you need to write pure python code (say, when your algorithm isn't easily vectorized) and that can be slow. Cython to the rescue -- cython makes it very easy to write chunks of python-like code that are then automatically compiled down into C. I find myself reaching for cython before C++ these days.
-
IPython / ipython notebooks -- Ipython was always my favorite interactive python interpreter (think of it like the matlab dev environment), but with the recent addition of ipython notebooks, it's even better -- Ipython notebooks work a lot like mathematica notebooks, saving the history of computation and results and displaying plots in-line. All via the web! An amazing resource for prototyping and interactive exploration (of the sort that we do a lot as scientists)
-
ruffus -- scientific pipeline management. Think of this as an advanced version of "make" for scientifc code. Ruffus makes it easy to set up "pipelines" and then run them, caching the intermediate results. Also you can tell ruffus multiprocess=N and it will use all your cores
-
nose -- Nose is by far my favorite unit test framework for python. Nose makes test-discovery very easy -- this means that you can just have a bunch of files called "test_foo.py" in your project and nose will recursively walk through your directories and execute every function named "test_*" inside them, and aggregate the results. It's so much easier than hand-stitching together a bunch of test runners.
-
line_profile I find the state of profiling with python to be mostly abysmal, but line_profile saves me -- you just annotate your function of interest with
@profile
, and it shows you the runtime of each line in that function.
Want it all on OSX? Chris Fonnesbeck maintains the SciPy superpack, a giant pre-built blob of all interworking tested Scientific Python tools, including some I didn't mention here. He updates it often, too.
Projects to Watch
The scientific Python community is a hotbed of innovation, especially from the folks at the newly-formed Continuum Analytics which are advancing the state-of-the art.
-
Blaze : an attempt at the next-generation of numpy-like functionality, supporting much larger datasets "Blaze is designed to handle out-of-core computations on large datasets that exceed the system memory capacity, as well as on distributed and streaming data. Blaze is able to operate on datasets transparently as if they behaved like in-memory NumPy arrays."
-
numba : auto-jit simple python expressions to optimized LLVM code. By the amazing Travis. This can prevent you from having to write cython in some cases, and give you really serious speedups in numerics-intensive code! Here's a great example of what's possible!
-
Theano : an attempt to let you specify dense array matrix/vector/etc operations by building up a call graph, and then Theano can efficiently build and execute the resulting numeric call graph on a GPU. Also performs some automatic differentiation for those of you who really need a Hessian or two.
Production code
I deploy a lot of Python in production -- especially on the web. But for when I need to write super-optimized inference algorithms, meet hard real-time constraints, or get it all running on an embedded system, I reach for C++.
-
Why C++ (vs c) : Modern C++ (C++11) and libaries like the Boost suite provide an amazing, batteries-included experience that lets you quickly get up and running. No void pointers, nice clean namespaces, and objects when you want them
-
Why C++ (vs Java): A lot of the computation I do is memory-bound, and the per-object overhead in Java can be a real killer. Plus, I like to call out to other native code a lot (including some asm/intrinsics) and the JNI overhead is massive.
-
Why C++ and Python: Seriously, use the amazing boost::python to wrap parts of your c++ system in python, and then dynamically assemble them at runtime. It is by far the best scheme I have found to both debug and test your highly-optimized code from a tolerable scripting language. Bonus: for your scientific code, you can do all the testing / debugging in python/numpy.
I don't love CMake as a build system, I hate it. I just hate it less than autotools/scons/etc.
Distributed computation
Python's multiprocess module is pretty convenient, and ruffus takes care of a lot of the grunt work of exploiting many cores for simple tasks. But for more complicated / more granular tasks, I heavily recommend using PiCloud. PiCloud makes it trivial to map your python computation over thousands of cores, and is very useful for all of my advanced machine learning prototyping, especially as it's compute bound. My latest UCSF project has used it extensively. And the guys behind it are great, and continually push the envelope.
Other aspects of science
The other must-have tool in my toolbox is the citation manager Mendeley which keeps my archive of papers in great order and syncs them between all my devices, across linux, OS X, and my iPad. Their new iOS client (now in beta) is amazing! Any time I can read a paper on the train is a win.
Is there anything I'm missing? Let me know!