A Streamlined Workflow for Scientific Reproduciblility and Efficiency

As an undergraduate researcher, I was often unnecessarily bogged down by my poor workflow. I consistently programmed "one off" scripts which sometimes turned into an important component of my research, and I was frequently frustrated by the way things were difficult to use. My workflow improved a little when I got to grad school, but I finally decided to come up with a workflow that I could be happy with early this semester. After collaborating with another student on a project and having a really hard time to integrate into his code, I decided I needed to come up with a good set of practices for my own sanity and for the sake of future collaborators as well. While there is an initial upfront cost of using good programming principles, the long term benefits will usually outweigh the initial costs by a lot. In this blog post, I'll summarize some key pieces to my new workflow. If you are struggling with some of the problems that I mentioned then I hope you will find the content of this blog post valuable and useful.

Here's some pre-requisites

Master the command line in linux
Learn a text based editor such as vim
Use workspaces in linux to switch back and forth between different tasks
Familiar with python

And here are some desiderata for our workflow:

Need to be able change parameter settings without modifying script
Need to be able to understand script even after not looking at it for weeks months

Also other people should be able to run/understand your script

Should produce comprehensive results in an easy to read/manipulate format
Should be able to easily parallelize on a cluster

So here are the some concrete things that have substantially improved my workflow, and will hopefully help yours as well:

Generalize your code using argparse
Create templates for running experiments, plotting, parameter searching, etc.
Use virtual environments
Document code well using an official standard
Using IPython for exploratory stuff
Learning tools for reliable debugging
Using matplotlib style sheets
Robust testing using hypothesis
Parallelizing code

Argparse

Argparse is a python library that allows you to pass in parameters to your script at the command line. It's pretty easy to use, and well worth the initial time investment to learn what you can do with it. Every experimental parameter should take a default value so that you can run the code without specifying anything. Consistently using argparse has been the biggest contributor to my increased workflow efficiency.

Templates

Every time I create a new script I'm tempted to not put in the initial 10-15 minutes to write the argparse and other boilerplate code. Thus, I created a handful of templates which I use as a starting point instead of a blank python file. These templates contain the boilerplate code that doesn't change much between different scripts. It also contains all the imports that I regularly use such as numpy and pandas. It also contains one properly documented dummy function to encourage me to document my code well while not having to look up the correct format comments should be in. I have multiple templates - one simple one that contains the things just described, one specifically for plotting, and one for running experiments. Every time I create a new script, I copy one of these templates. I made these templates available on GitHub.

Virtual Environments

Some of my projects use python3 while others use python2, sometimes it's hard to keep track of all of this. Also one project might have different/inconsistent dependencies than another project. This is a pain to deal with, but virtualenv provides a very easy solution. It gives you the ability to maintain multiple python environments so every project can have it's own environment, and it's very easy to use.

Documentation

One excuse I used to give for not properly documenting my code was that I didn't know the correct standard format a comment should be in. While this is a pretty lame excuse, I decided to invest an hour into learning how to properly document code in python, and incorporated that knowledge into my template scripts so I wouldn't have to go back and re-learn it the next time I wanted to document something correctly.

IPython

IPython and IPython notebooks are great for doing exploratory things in python. When I'm devolping a python script, I like to keep one tab in my terminal dedicated to IPython, so if I want to test a small piece of code I can do that. Since python is a dynamically typed language, a lot of problems can go unnoticed until runtime, and that can make development time slow. That's why I like to verify (using IPython) that a specific function call will work exactly as I expect it should. I like to use IPython notebooks for experimental stuff because I can quickly and easily iterate on an idea within the notebook environment. Once the idea becomes more refined I usually turn it into a python script that I can run from the command line or something.

Debugging

Debugging code is much easier when you know the right tools to use. I used to completely rely on print statements to debug my code. This used to work fairly well for me when I was a java programmer and the majority of my bugs were caught at compile time. But once I switched my primary language to python, it was a pretty inefficient process. Now I like to use pdb.set_trace() and IPython.embed() to stop execution of my program at a particular point to understand what's going wrong there. Also knowing how to profile your program to identify the weak link(s) in a long running program is useful.

Matplotlib

Plotting is a very important component of every scientific workflow. Matplotlib is the go-to library for plotting in python, and one thing I was unaware of until recently is style-sheets, which allow you to change the look and feel of your plot with only 1 line of code (or alternatively, something that can be passed in via command line with argparse!). There is a lot of boilerplate code that goes into making pretty plots, which is why this is one of the templates I frequently use.

Hypothesis

Most of us don't enjoy unit testing, but there is an alternate form of testing called known as "property-based testing" which is actually much more robust than unit-testing and a lot more fun to code. The idea was popularized within the Haskell community with a module called QuickCheck. In python, there is a library called Hypothesis for property based testing. It's well worth a few hours of your time to thoroughly go through the hypothesis documentation and learn how you can incorporate it into your workflow.

Parallelization

While I usually do most of my development on my laptop, sometimes I want to launch a long running experiment, or try a bunch of different parameter settings on a specific script. It would take too long to run all these experiments on my laptop, so I frequently like to run my scripts on my university cluster. It takes some time to learn how to use their resource manager (slurm in my case), but the turnaround time for getting results can be much faster so it's well worth the initial time investment in my opinion.

All of the things I described have an upfront cost to learn, but will pay off in the long run. Hopefully some of the things I talked about in this post will help you improve your workflow. If you are interested in using my templates and modifying them for your own workflow, I am making them available on my github page. Finally, if you have any ideas on how to improve my workflow even more, please feel free to share them in the comments below!

Search This Blog

Ryan's Repository of Random Reflections

A Streamlined Workflow for Scientific Reproduciblility and Efficiency

Comments

Post a Comment

Popular posts from this blog

Optimal Strategy for Farkle Dice

Markov Chains and Expected Value

Beat the Streak: Day Three