Keeping a paper trail: data management skills for reproducible science

Some of my recent experiences have really hammered home the importance of performing science in as open and reproducible a manner as possible. The good news is that I actually think my experiences have provided a bit of a watershed moment for the field of animal behavior and behavioral ecology in general. It’s just become so apparent that employing open and reproducible practices are critical for maintaining our own, our colleague’s, and the general public’s trust in the data we collect.

The Animal Behavior Society recently held their annual meeting, virtually. Part of this meeting included a symposium (co-organized by Esteban Fernandes-Juricic and Ambika Kamath) on “Research Reproducibility”, in which I was invited to present a talk on a topic of my choice. I decided to focus on data management practices as our data are really the life-blood of research – trusting it forms the very foundation for the knowledge we produce.

The goal of such practices is so that you, or someone else, could trace every step of your data’s path from the day you collected it until the day you publish it.  I always thought that I was pretty good about keeping a paper trail for my data, but wow, it’s so clear now that I still have so much room for improvement!  Here, in my talk, I outline the (very) basics of best data management practices.

I’m elaborating on some of my points below because, well, a 6-minute talk is not nearly long enough!

1 – Maintain hard copies of data collection. This will often be videos or photos or paper data sheets. At the very least, also include a hardcopy lab notebook where you list when, where and how you collected the data. Make sure to include any changes you make to protocols and why. Describe anything unexpected that happens or problems that arise.

2 – Data entry. If moving from hard-copy data sheets to digital format, then data entry is when lots of transcription errors will almost assuredly occur. YOU NEED TO DOUBLE CHECK EVERYTHING. Enter the data as carefully as you can, and then a day later go back through IT ALL. But go through in a different way then you entered it – start at the end of the data, or check columns on the far right first. It’s too easy to get your eyes into a groove where they quickly slide over the numbers without really seeing them.

3 – Once you think the data is “clean”, lock it in place. For me, this means I will create an .xlsx and label it as something like “Project X_data_raw_DO NOT ALTER.xlsx” – as the name says, I basically should never touch this file ever again. Importantly, make sure that this file also includes all the meta-data. I usually just save this on a second sheet in the excel file. The meta-data should have a brief synopsis of what this data set is, how it was collected and by who, who entered it and then list every single variable and define what they are. Years (or even months) later it may not be super clear what “propMoving50” is supposed to be a measure of, or whether the 0’s in your “sex” category are females, or males.

I then save a second copy that is called “Project X_data_analysis.csv” which is the file that I will start with for my analysis (remember .csv files also only save 1 sheet so your meta-data will not be included in this file, hence why you need to have the “DO NOT ALTER” version above). Additionally, I recommend creating a .pdf file of your excel file. It would be such a pain to have to re-enter all the data from this format, I know, but this is the ultimate safe-guard to ensure that you always have an un-corruptable version of how your data looked when you first entered it. You could even print this out in case some digital apocalypse deletes all your back-ups, which seems unlikely, but it is 2020….

4a – Cleaning & analysis in R. There are lots of different programs to analyze data. For the most open and reproducible methods, you need some record of exactly what you did. Here using some sort of scripting statistical software will be super helpful. I use R. I also now try to do all my data cleaning/formatting in R as it’s more reproducible. This way if you have my “Project X_data_analysis.csv” dataset and my R code, you should be able to load this up, and then see all the cleaning/formatting I do (I usually use things in the tidyverse to do this), and then all the statistical tests I perform. The code itself it obviously critical, but just as important are the annotations you will make. Annotate your code! Explain what you want to do, and why in your code. Trust me when I say that if you need to go back to your code after 6 months, a year, or 6 years, you will very much forget what you were doing. Your annotations are a gift to your future self. Be kind to your future self!

You will probably have several analyses you may want to do and you may end up changing your code lots of different times in all different ways. Keeping track of all these changes in your code can be tough. Here Github is a true life-saver. Github will keep track of all your versions of your code and let you see when any changes were made to your code and by who (especially important if you’re working within a collaboration). I will be honest and say that I have not yet fully incorporated Github into my analysis pipeline, but that this is my new year’s resolution. If you want to also learn Github, let me know and we can make a pact by the next ABS meeting in 2021, we’ll both have learned it!  There are boatloads of good tutorials online like this one, or this one.

SHORT ASIDE – there are so many potential pitfalls that can occur during analysis that can also lead to problems with reproducibility. Things like p-hacking or HARK-ing (Hypothesizing After Results are Known) are rife in science, and our field too. Maybe one day I’ll write  a post summarizing my thoughts on this, but for now, all I can say is that in the same way you want to be transparent and reproducible in how you handle your data, you need to be transparent in the analyses you do. Be very honest in your papers about which results you tested for from a priori hypotheses (that you should have written down in your lab notebook before you even started collecting data) and which results you discovered in your data during the analyses (and are exploratory in nature)

4b – Cleaning & analysis NOT in R. If you are not working in R, or some other scripting language that lets you keep track of all the formating/cleaning you do in the program, then you are probably working in excel. This can get tricky because of course you will have no permanent record of exactly how you sorted things or moved things around. So I think the best thing you can do is version control. That is, never overwrite your excel files. Always ‘save as’ a new file. So if you worked on your “Project X_data_analysis.csv” today (July 31, 2020) you would now save this file as “Project X_data_analysis_200731.xlsx”. Adding on the date in the format YYMMDD to the end of the file name will keep all your versions in order, so even though you may create lots of versions of the file, they should be easy to keep track off. It might be a good idea to then create a new folder something like “old data” that you can dump these old versions in to, so you have them if you need them but they’re not cluttering things up.

5 – Finalizing results & figures. After your analysis is complete, you should finalize your code so that it reproduces exactly the results you will put in your paper. For me, I always produce an R markdown file at this point. So I clean up my scripts and then re-run everything to make sure that I am reproducing the results I expected. I always find mistakes/errors/inconsistencies at this point. R Markdown is great as it will join together each piece of code with the output it produces so you can see it all in one place. You could also just use your final R script (which should be easy to find if you use Github!) though this would require a reviewer to actually run the code themselves to reproduce your results which is sometimes a little tedious.

Also, when I know that I will eventually be depositing this R Markdown file online, this knowledge just makes me more careful. Knowing that your work will become open for everyone to see is a little (lot) bit scary and I think that’s a good thing. It gives us the extra motivation we need to check and double-check everything we do and to make sure that our choices along the way are well justified.

6 – Deposit everything. I’m sure you had absolutely no trouble getting your paper published and once you do, deposit 1) your data, 2) your code and 3) your R Markdown file. This is something that is still relatively new for me (it was my new year’s resolution last year to start depositing all my code) but I really think this is one of the most critical steps. I load up my R Markdown file as a supplementary file to the manuscript and then deposit the code and data in a public repository. Depositing your data & code in a repository like Dryad or Figshare means that other people can check your work, but likely more important, that they can learn from you! How annoying is it when you read a paper that did *exactly* the analyses you want to do, but they don’t include your code so you’re stuck sifting through Stats Exchange for weeks trying to fix your syntax problems. Ugh. We all don’t need to reinvent the wheel. Sharing code is good scientifically as we can learn from each other, but potentially also good personally in that folks can now cite you not just for your results, but also your code.

If you remember, I started with “Project X_behavior Y_analysis.csv” data file and then did lots of cleaning and formatting in R. So now I will export my final data file “Project X_behavior Y_results.csv” so that I have this to keep. This is often the data file that I deposit online (I will need just comment out the parts of the code where I performed data cleaning/formatting so that folks can see what I did but they don’t have to do it themselves).

So that’s it! At the end of all this, at a bare minimum, you should have the following files. I normally

“Project X_data_raw_DO NOT ALTER.xlsx”
“Project X_data_raw_DO NOT ALTER.pdf”
“Project X_data_analysis.csv”
“Project X_data_results.csv”
“Project X_code_working.R”
“Project X_deposited results.html” (your R Markdown file)

Open Science Framework. Finally, all these practices are basically things you do yourself. Another option is to use the Open Science Framework. This website lets you create an account (for free!) and then you can basically load everything for your project there. I haven’t yet adopted this framework for my own work, but I wouldn’t be surprised if I did at some point. Especially when I start managing more projects through my students. Something to consider!