A Streamlined Workflow for Scientific Reproduciblility and Efficiency

As an undergraduate researcher, I was often unnecessarily bogged down by my poor workflow.  I consistently programmed "one off" scripts which sometimes turned into an important component of my research, and I was frequently frustrated by the way things were difficult to use.  My workflow improved a little when I got to grad school, but I finally decided to come up with a workflow that I could be happy with early this semester.  After collaborating with another student on a project and having a really hard time to integrate into his code, I decided I needed to come up with a good set of practices for my own sanity and for the sake of future collaborators as well.  While there is an initial upfront cost of using good programming principles, the long term benefits will usually outweigh the initial costs by a lot.  In this blog post, I'll summarize some key pieces to my new workflow.  If you are struggling with some of the problems that I mentioned then I hope you will find the content of this blog post valuable and useful.

Here's some pre-requisites
  • Master the command line in linux
  • Learn a text based editor such as vim
  • Use workspaces in linux to switch back and forth between different tasks
  • Familiar with python
And here are some desiderata for our workflow:
  • Need to be able change parameter settings without modifying script
  • Need to be able to understand script even after not looking at it for weeks months
    • Also other people should be able to run/understand your script 
  • Should produce comprehensive results in an easy to read/manipulate format 
  • Should be able to easily parallelize on a cluster
So here are the some concrete things that have substantially improved my workflow, and will hopefully help yours as well:
  • Generalize your code using argparse
  • Create templates for running experiments, plotting, parameter searching, etc.
  • Use virtual environments
  • Document code well using an official standard
  • Using IPython for exploratory stuff
  • Learning tools for reliable debugging
  • Using matplotlib style sheets
  • Robust testing using hypothesis
  • Parallelizing code
Argparse

Argparse is a python library that allows you to pass in parameters to your script at the command line.  It's pretty easy to use, and well worth the initial time investment to learn what you can do with it.  Every experimental parameter should take a default value so that you can run the code without specifying anything.  Consistently using argparse has been the biggest contributor to my increased workflow efficiency.  

Templates

Every time I create a new script I'm tempted to not put in the initial 10-15 minutes to write the argparse and other boilerplate code.  Thus, I created a handful of templates which I use as a starting point instead of a blank python file.  These templates contain the boilerplate code that doesn't change much between different scripts.  It also contains all the imports that I regularly use such as numpy and pandas.  It also contains one properly documented dummy function to encourage me to document my code well while not having to look up the correct format comments should be in.  I have multiple templates - one simple one that contains the things just described, one specifically for plotting, and one for running experiments.  Every time I create a new script, I copy one of these templates.  I made these templates available on GitHub.

Virtual Environments

Some of my projects use python3 while others use python2, sometimes it's hard to keep track of all of this.  Also one project might have different/inconsistent dependencies than another project.  This is a pain to deal with, but virtualenv provides a very easy solution.  It gives you the ability to maintain multiple python environments so every project can have it's own environment, and it's very easy to use.

Documentation

One excuse I used to give for not properly documenting my code was that I didn't know the correct standard format a comment should be in.  While this is a pretty lame excuse, I decided to invest an hour into learning how to properly document code in python, and incorporated that knowledge into my template scripts so I wouldn't have to go back and re-learn it the next time I wanted to document something correctly.  

IPython

IPython and IPython notebooks are great for doing exploratory things in python.  When I'm devolping a python script, I like to keep one tab in my terminal dedicated to IPython, so if I want to test a small piece of code I can do that.  Since python is a dynamically typed language, a lot of problems can go unnoticed until runtime, and that can make development time slow.  That's why I like to verify (using IPython) that a specific function call will work exactly as I expect it should.  I like to use IPython notebooks for experimental stuff because I can quickly and easily iterate on an idea within the notebook environment.  Once the idea becomes more refined I usually turn it into a python script that I can run from the command line or something.  

Debugging

Debugging code is much easier when you know the right tools to use.  I used to completely rely on print statements to debug my code.  This used to work fairly well for me when I was a java programmer and the majority of my bugs were caught at compile time.  But once I switched my primary language to python, it was a pretty inefficient process.  Now I like to use pdb.set_trace() and IPython.embed() to stop execution of my program at a particular point to understand what's going wrong there.  Also knowing how to profile your program to identify the weak link(s) in a long running program is useful.

Matplotlib

Plotting is a very important component of every scientific workflow.  Matplotlib is the go-to library for plotting in python, and one thing I was unaware of until recently is style-sheets, which allow you to change the look and feel of your plot with only 1 line of code (or alternatively, something that can be passed in via command line with argparse!).  There is a lot of boilerplate code that goes into making pretty plots, which is why this is one of the templates I frequently use.

Hypothesis

Most of us don't enjoy unit testing, but there is an alternate form of testing called known as "property-based testing" which is actually much more robust than unit-testing and a lot more fun to code.  The idea was popularized within the Haskell community with a module called QuickCheck.  In python, there is a library called Hypothesis for property based testing.  It's well worth a few hours of your time to thoroughly go through the hypothesis documentation and learn how you can incorporate it into your workflow.

Parallelization

While I usually do most of my development on my laptop, sometimes I want to launch a long running experiment, or try a bunch of different parameter settings on a specific script.  It would take too long to run all these experiments on my laptop, so I frequently like to run my scripts on my university cluster.  It takes some time to learn how to use their resource manager (slurm in my case), but the turnaround time for getting results can be much faster so it's well worth the initial time investment in my opinion.


All of the things I described have an upfront cost to learn, but will pay off in the long run.  Hopefully some of the things I talked about in this post will help you improve your workflow.  If you are interested in using my templates and modifying them for your own workflow, I am making them available on my github page.  Finally, if you have any ideas on how to improve my workflow even more, please feel free to share them in the comments below!

Comments

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. Such a nice information and please dont forget visit our website :)
    SITUS SGP WLA TERPERCAYA

    ReplyDelete
  3. Thank you for sharing in this article I can learn a lot and could also be a reference I hope to read the next your article update Obat Perangsang Wanita , Thank you for sharing in this article Obat Perangsang Wanita Cair I can learn a lot and could also be a reference I hope to read the next your article update Permen Karet Perangsang , obatlibido.com Thank you for sharing in this article I can learn a lot and could also be a reference I hope to read the next your article update obat perangsang wanita , I can learn a lot and could also be a reference I hope to read the next your article update Thank you for sharing in this article kunjungi website Thanks

    ReplyDelete
  4. Silahkan di kunjungi ya kawan-kawan 100% Memuaskan
    > Hoki anda ada di sini <
    Bandar Q Online Terpercaya dan Teraman di GUNUNGPOKER
    Link daftar : http://bandaraduq.com/Register.aspx?lang=id
    BBM : 56978317

    SEMUA GAME HANYA PAKAI 1 USER ID : Poker, Domino QQ, Capsa Susun, Adu Q, Bandar Poker,

    Segera daftarkan userid anda di GUNUNGPOKER
    Promo Terbaru dari GUNUNGPOKER
    - Minimal DEPOSIT & WITHDRAW Rp 20.000,-
    - Tersedia 7 game dalam 1 USER ID
    - BONUS Turnover 0.5%
    - BONUS Referral 20%

    UNTUK INFORMASI SELANJUTNYA BISA HUB KAMI DI :
    LIVECHAT GUNUNGPOKER 24 JAM ONLINE
    Fanspage FB : @agengunungpoker
    Pin BB : 56978317
    WA : +62812-7287-4416
    LINE : gunungpokercsr1
    WECHAT : gunungpokercsr1
    YM : gunungpokercsr1@yahoo.com
    Agen domino
    BandarQ
    Domino 99
    Domino QQ
    Agen Poker
    Bandar Poker
    Agen Judi QQ
    Judi Online
    Forum Judi Online

    hubungi kami di :
    Line : gunungpokercsr1
    Bbm : 56978317
    Wa : +6281272874416

    ReplyDelete
  5. hi guys, apa klian sudah mengenal BandarJudiQQ ? ketinggalan kalo kalian belum tau ~
    BandarJudiQQ adalah situs TERBAIK dan TERPERCAYA di INDONESIA.
    tidak hanya itu, BandarJudiQQ juga situs yang selalu memberikan bonus menarik buat para pemain baru maupun lama.
    cukup register dan memiliki 1 id kalian bisa menikmati 7 permainan.
    hanya dengan minimal deposit/windraw 15.000 saja kalian bisa menikmati keseruan bermain di BandarJudiQQ
    nikmati bonus Turnover 0.5% & bonus refferal hingga 20% (setiap hari senin setelah jam 12siang) dan banyak keuntungan bonus menanti anda.

    7 permainan BandarJudiQQ :
    Poker
    DominoQQ
    AduQ
    Capsa Susun
    BandarQ
    Bandar Poker
    Sakong

    BANK tersedia :
    -BCA
    -BNI
    -MANDIRI
    -BRI
    -DANAMON
    -CIMB NIAGA

    buat para pemain BandarJudiQQ tidak hanya bermain di komputer tetapi kalian bisa menikmati bermain melalui gadget/smartphone.
    keseruan bermain dimana dan kapan saja sesama player TANPA BOT .
    jika kalian mengalami kendala segera hubungi CustumerService(CS) kami siap membantu melayani anda 24jam :

    -PIN BBM : 336057ED
    -SKYPE : lontotong992@gmail.com
    -FACEBOOK : adrianawong293@gmail.com
    -TWITTER : bandarjudiqq99@gmail.com
    -YAHOO : bandarjudiqq99@yahoo.com
    -PHONE : +855962054695

    ReplyDelete

  6. Pada permainan Poker Online, ada ber aneka macam meja jumlah pemain yang tersedia. Ada meja yang hanya untuk 3 pemain , 4 pemain dan bahkan sampai 9 pemain. Disini anda bisa pilih sesuai dengan keinginan anda apakah ingin bermain rame – rame atau hanya untuk sedikit pemain saja. Dan sudah banyak sekali orang yang mengemari Poker Online tersebut.
    DOMINOQQ ONLINE
    BandarQ Online

    ReplyDelete

Post a Comment

Popular posts from this blog

Efficiently Remove Duplicate Rows from a 2D Numpy Array

Multi-Core Programming with Java

Beat the Streak: Day Three