A Streamlined Workflow for Scientific Reproduciblility and Efficiency
As an undergraduate researcher, I was often unnecessarily bogged down by my poor workflow. I consistently programmed "one off" scripts which sometimes turned into an important component of my research, and I was frequently frustrated by the way things were difficult to use. My workflow improved a little when I got to grad school, but I finally decided to come up with a workflow that I could be happy with early this semester. After collaborating with another student on a project and having a really hard time to integrate into his code, I decided I needed to come up with a good set of practices for my own sanity and for the sake of future collaborators as well. While there is an initial upfront cost of using good programming principles, the long term benefits will usually outweigh the initial costs by a lot. In this blog post, I'll summarize some key pieces to my new workflow. If you are struggling with some of the problems that I mentioned then I hope you will find the content of this blog post valuable and useful.
Here's some pre-requisites
Here's some pre-requisites
- Master the command line in linux
- Learn a text based editor such as vim
- Use workspaces in linux to switch back and forth between different tasks
- Familiar with python
- Need to be able change parameter settings without modifying script
- Need to be able to understand script even after not looking at it for weeks months
- Also other people should be able to run/understand your script
- Should produce comprehensive results in an easy to read/manipulate format
- Should be able to easily parallelize on a cluster
- Generalize your code using argparse
- Create templates for running experiments, plotting, parameter searching, etc.
- Use virtual environments
- Document code well using an official standard
- Using IPython for exploratory stuff
- Learning tools for reliable debugging
- Using matplotlib style sheets
- Robust testing using hypothesis
- Parallelizing code
Argparse
Argparse is a python library that allows you to pass in parameters to your script at the command line. It's pretty easy to use, and well worth the initial time investment to learn what you can do with it. Every experimental parameter should take a default value so that you can run the code without specifying anything. Consistently using argparse has been the biggest contributor to my increased workflow efficiency.
Templates
Every time I create a new script I'm tempted to not put in the initial 10-15 minutes to write the argparse and other boilerplate code. Thus, I created a handful of templates which I use as a starting point instead of a blank python file. These templates contain the boilerplate code that doesn't change much between different scripts. It also contains all the imports that I regularly use such as numpy and pandas. It also contains one properly documented dummy function to encourage me to document my code well while not having to look up the correct format comments should be in. I have multiple templates - one simple one that contains the things just described, one specifically for plotting, and one for running experiments. Every time I create a new script, I copy one of these templates. I made these templates available on GitHub.
Virtual Environments
Some of my projects use python3 while others use python2, sometimes it's hard to keep track of all of this. Also one project might have different/inconsistent dependencies than another project. This is a pain to deal with, but virtualenv provides a very easy solution. It gives you the ability to maintain multiple python environments so every project can have it's own environment, and it's very easy to use.
Documentation
One excuse I used to give for not properly documenting my code was that I didn't know the correct standard format a comment should be in. While this is a pretty lame excuse, I decided to invest an hour into learning how to properly document code in python, and incorporated that knowledge into my template scripts so I wouldn't have to go back and re-learn it the next time I wanted to document something correctly.
IPython
IPython and IPython notebooks are great for doing exploratory things in python. When I'm devolping a python script, I like to keep one tab in my terminal dedicated to IPython, so if I want to test a small piece of code I can do that. Since python is a dynamically typed language, a lot of problems can go unnoticed until runtime, and that can make development time slow. That's why I like to verify (using IPython) that a specific function call will work exactly as I expect it should. I like to use IPython notebooks for experimental stuff because I can quickly and easily iterate on an idea within the notebook environment. Once the idea becomes more refined I usually turn it into a python script that I can run from the command line or something.
Debugging
Debugging code is much easier when you know the right tools to use. I used to completely rely on print statements to debug my code. This used to work fairly well for me when I was a java programmer and the majority of my bugs were caught at compile time. But once I switched my primary language to python, it was a pretty inefficient process. Now I like to use
Matplotlib
Plotting is a very important component of every scientific workflow. Matplotlib is the go-to library for plotting in python, and one thing I was unaware of until recently is style-sheets, which allow you to change the look and feel of your plot with only 1 line of code (or alternatively, something that can be passed in via command line with argparse!). There is a lot of boilerplate code that goes into making pretty plots, which is why this is one of the templates I frequently use.
Hypothesis
Most of us don't enjoy unit testing, but there is an alternate form of testing called known as "property-based testing" which is actually much more robust than unit-testing and a lot more fun to code. The idea was popularized within the Haskell community with a module called QuickCheck. In python, there is a library called Hypothesis for property based testing. It's well worth a few hours of your time to thoroughly go through the hypothesis documentation and learn how you can incorporate it into your workflow.
Parallelization
While I usually do most of my development on my laptop, sometimes I want to launch a long running experiment, or try a bunch of different parameter settings on a specific script. It would take too long to run all these experiments on my laptop, so I frequently like to run my scripts on my university cluster. It takes some time to learn how to use their resource manager (slurm in my case), but the turnaround time for getting results can be much faster so it's well worth the initial time investment in my opinion.
All of the things I described have an upfront cost to learn, but will pay off in the long run. Hopefully some of the things I talked about in this post will help you improve your workflow. If you are interested in using my templates and modifying them for your own workflow, I am making them available on my github page. Finally, if you have any ideas on how to improve my workflow even more, please feel free to share them in the comments below!
pdb.set_trace()
and IPython.embed()
to stop execution of my program at a particular point to understand what's going wrong there. Also knowing how to profile your program to identify the weak link(s) in a long running program is useful.Matplotlib
Plotting is a very important component of every scientific workflow. Matplotlib is the go-to library for plotting in python, and one thing I was unaware of until recently is style-sheets, which allow you to change the look and feel of your plot with only 1 line of code (or alternatively, something that can be passed in via command line with argparse!). There is a lot of boilerplate code that goes into making pretty plots, which is why this is one of the templates I frequently use.
Hypothesis
Most of us don't enjoy unit testing, but there is an alternate form of testing called known as "property-based testing" which is actually much more robust than unit-testing and a lot more fun to code. The idea was popularized within the Haskell community with a module called QuickCheck. In python, there is a library called Hypothesis for property based testing. It's well worth a few hours of your time to thoroughly go through the hypothesis documentation and learn how you can incorporate it into your workflow.
Parallelization
While I usually do most of my development on my laptop, sometimes I want to launch a long running experiment, or try a bunch of different parameter settings on a specific script. It would take too long to run all these experiments on my laptop, so I frequently like to run my scripts on my university cluster. It takes some time to learn how to use their resource manager (slurm in my case), but the turnaround time for getting results can be much faster so it's well worth the initial time investment in my opinion.
All of the things I described have an upfront cost to learn, but will pay off in the long run. Hopefully some of the things I talked about in this post will help you improve your workflow. If you are interested in using my templates and modifying them for your own workflow, I am making them available on my github page. Finally, if you have any ideas on how to improve my workflow even more, please feel free to share them in the comments below!
This comment has been removed by a blog administrator.
ReplyDelete
ReplyDeleteTogel Singapura
Togel HK
Bandar Togel Terpercaya
Toto HK
Promo Bonus Member Baru
ReplyDeleteBonus Member Baru
Promo Togel New Member
Promo Togel Member Baru
Bonus Member Baru
Promo Togel
Promo New Member
Bonus Poker New Member
ReplyDeletePoker New Member
Poker Bonus New Member
Bonus New Member
Situs Togel Terpercaya
ReplyDeleteBandar Togel Terpercaya
Togel Hari Ini
Prediksi Togel
Dont forget visit our website :)
ReplyDeleteSitus Togel SGP HK Terpercaya
Such a nice information and please dont forget visit our website :)
ReplyDeleteSITUS SGP WLA TERPERCAYA
Nice information. Please visit our blog :)
ReplyDeleteBonus Poker New Member
Nice article and best blog i like it
ReplyDeleteSitus Rolet Online Terpercaya
ReplyDeleteGreat information and dont forget to visit our blog
ReplyDeleteBandar Rolet Online Terpercaya
Situs Rolet Online Terpercaya
Great information, dont forget to visit our website :)
ReplyDeleteSitus Resmi Togel Terpercaya
Situs Poker
ReplyDeleteJudi Poker
Situs Judi Poker
Situs Poker Online
Agen Domino
Situs Bandarq
BandarQ
Agen Poker
Dominoqq
Bandarq Online
Agen Domino99
BandarQ
situs BandarQ
situs BandarQ Online
Agen Domino
BandarQ Teraman
Situs Poker Online
Bokep Online
Situs Judi Poker
DOMINOQQ TERBAIK
RGOSAKONG
ReplyDeleteSitus Poker
ReplyDeleteSitus Judi Poker
Situs Poker Online
Situs Poker Terpercaya
Situs Poker Teraman
DominoQQ
Situs DominoQQ
Situs DominoQQ Teraman
DominoQQ teraman
BandarQ Teraman
Situs BandarQ
Agen Domino
Situs Judi Online
Agen Judi Online
situs poker terpopuler
situs poker terbaik
Situs Poker Online Teraman
Cheat BandarQ
ReplyDeleteCheat Sakong
WSAKONG
IDNSAKONG
Foto Bugil Jepang
Foto Bugil Barat
Foto Bugil Korea
Thank you for sharing in this article I can learn a lot and could also be a reference I hope to read the next your article update Obat Perangsang Wanita , Thank you for sharing in this article Obat Perangsang Wanita Cair I can learn a lot and could also be a reference I hope to read the next your article update Permen Karet Perangsang , obatlibido.com Thank you for sharing in this article I can learn a lot and could also be a reference I hope to read the next your article update obat perangsang wanita , I can learn a lot and could also be a reference I hope to read the next your article update Thank you for sharing in this article kunjungi website Thanks
ReplyDeleteAgen Sakong BandarQ Online
ReplyDeleteKontes SEO IDNSakong
Agen Agen Judi Online Terbaik Tahun 2017
Silahkan di kunjungi ya kawan-kawan 100% Memuaskan
ReplyDelete> Hoki anda ada di sini <
Bandar Q Online Terpercaya dan Teraman di GUNUNGPOKER
Link daftar : http://bandaraduq.com/Register.aspx?lang=id
BBM : 56978317
SEMUA GAME HANYA PAKAI 1 USER ID : Poker, Domino QQ, Capsa Susun, Adu Q, Bandar Poker,
Segera daftarkan userid anda di GUNUNGPOKER
Promo Terbaru dari GUNUNGPOKER
- Minimal DEPOSIT & WITHDRAW Rp 20.000,-
- Tersedia 7 game dalam 1 USER ID
- BONUS Turnover 0.5%
- BONUS Referral 20%
UNTUK INFORMASI SELANJUTNYA BISA HUB KAMI DI :
LIVECHAT GUNUNGPOKER 24 JAM ONLINE
Fanspage FB : @agengunungpoker
Pin BB : 56978317
WA : +62812-7287-4416
LINE : gunungpokercsr1
WECHAT : gunungpokercsr1
YM : gunungpokercsr1@yahoo.com
Agen domino
BandarQ
Domino 99
Domino QQ
Agen Poker
Bandar Poker
Agen Judi QQ
Judi Online
Forum Judi Online
hubungi kami di :
Line : gunungpokercsr1
Bbm : 56978317
Wa : +6281272874416
hi guys, apa klian sudah mengenal BandarJudiQQ ? ketinggalan kalo kalian belum tau ~
ReplyDeleteBandarJudiQQ adalah situs TERBAIK dan TERPERCAYA di INDONESIA.
tidak hanya itu, BandarJudiQQ juga situs yang selalu memberikan bonus menarik buat para pemain baru maupun lama.
cukup register dan memiliki 1 id kalian bisa menikmati 7 permainan.
hanya dengan minimal deposit/windraw 15.000 saja kalian bisa menikmati keseruan bermain di BandarJudiQQ
nikmati bonus Turnover 0.5% & bonus refferal hingga 20% (setiap hari senin setelah jam 12siang) dan banyak keuntungan bonus menanti anda.
7 permainan BandarJudiQQ :
Poker
DominoQQ
AduQ
Capsa Susun
BandarQ
Bandar Poker
Sakong
BANK tersedia :
-BCA
-BNI
-MANDIRI
-BRI
-DANAMON
-CIMB NIAGA
buat para pemain BandarJudiQQ tidak hanya bermain di komputer tetapi kalian bisa menikmati bermain melalui gadget/smartphone.
keseruan bermain dimana dan kapan saja sesama player TANPA BOT .
jika kalian mengalami kendala segera hubungi CustumerService(CS) kami siap membantu melayani anda 24jam :
-PIN BBM : 336057ED
-SKYPE : lontotong992@gmail.com
-FACEBOOK : adrianawong293@gmail.com
-TWITTER : bandarjudiqq99@gmail.com
-YAHOO : bandarjudiqq99@yahoo.com
-PHONE : +855962054695
This comment has been removed by a blog administrator.
ReplyDelete