Keeping a paper trail: data management skills for reproducible science

Some of my recent experiences have really hammered home the importance of performing science in as open and reproducible a manner as possible. The good news is that I actually think my experiences have provided a bit of a watershed moment for the field of animal behavior and behavioral ecology in general. It’s just become so apparent that employing open and reproducible practices are critical for maintaining our own, our colleague’s, and the general public’s trust in the data we collect.

The Animal Behavior Society recently held their annual meeting, virtually. Part of this meeting included a symposium (co-organized by Esteban Fernandes-Juricic and Ambika Kamath) on “Research Reproducibility”, in which I was invited to present a talk on a topic of my choice. I decided to focus on data management practices as our data are really the life-blood of research – trusting it forms the very foundation for the knowledge we produce.

The goal of such practices is so that you, or someone else, could trace every step of your data’s life from the day you collected it until the day you publish it.  I always thought that I was pretty good about keeping a paper trail for my data, but wow, it’s so clear now that I still have so much room for improvement!  Here, in my talk, I outline the (very) basics of best data management practices.

I’m elaborating on some of my points below because, well, a 6-minute talk is not nearly long enough! These are my current best practices for how I handle my data, but the great news is that new techniques and tools are being developed all the time to help automate, streamline and improve this process so I expect this may change. But generally, I think if you follow these guidelines (and save the minimum files I suggest below) you can consider yourself a good data parent!

1 – Maintain hard copies of data collection.

This will often be videos or photos or paper data sheets. At the very least, also include a hardcopy lab notebook where you list when, where and how you collected the data. Make sure to include any changes you make to protocols and why. Describe anything unexpected that happens or problems that arise. Have a calendar or schedule of when/how data collection happened; trust me, you absolutely will forget these details at some point. Take pictures of your set-ups; these are not only good proof that you said what you did, but also really useful for presentations.

2 – Data entry.

If moving from hard-copy data sheets to digital format, then data entry is when lots of transcription errors will almost assuredly occur. You need to double check everything. Enter the data as carefully as you can, and then a day later go back through IT ALL. But go through in a different way then you entered it – start at the end of the data, or check columns on the far right first. It’s too easy to get your eyes into a groove where they quickly slide over the numbers without really seeing them.

3 – Once you think the data is “clean”, lock it in place.

For me, this means I will create an .xlsx and label it as something like “Project X_data_raw_DO NOT ALTER.xlsx” – as the name says, I basically should never touch this file ever again. Importantly, make sure that this file also includes all the meta-data. I usually just save this on a second sheet in the excel file. The meta-data should have a brief synopsis of what this data set is, how it was collected and by who, who entered it and then list every single variable and define what they are. Years (or even months) later it may not be super clear what “propMoving50” is supposed to be a measure of, or whether the 0’s in your “sex” category are females, or males.

I then save a second copy that is called “Project X_data_analysis.csv” which is the file that I will start with for my analysis (remember .csv files also only save 1 sheet so your meta-data will not be included in this file, hence why you need to have the “DO NOT ALTER” version above).

Unfortunately, it is still super easy to accidentally corrupt any sort of editable file (be it xlsx or csv files) through copy-paste errors or dragging columns or whatever. So because I am now super anxious about this, I also save an UN-editable file (e.g. in pdf format). Yes, it would be absolute hell to have to re-create your excel file from a pdf file but this is insurance for you for any digital doomsday scenarios where things are lost or get corrupted beyond recognition. Even better if you can email it yourself so you have a time-stamp of when you created the file.

4a – Cleaning & analysis in R.

There are lots of different programs to analyze data. For the most open and reproducible methods, you need some record of exactly what you did. Here using some sort of scripting statistical software will be super helpful. I use R. I also now try to do all my data cleaning/formatting in R as it’s more reproducible. This way if you have my “Project X_data_analysis.csv” dataset and my R code, you should be able to load this up, and then see all the cleaning/formatting I do (I usually use things in the tidyverse to do this), and then all the statistical tests I perform. The code itself it obviously critical, but just as important are the annotations you will make. Annotate your code! Explain what you want to do, and why in your code. Trust me when I say that if you need to go back to your code after 6 months, a year, or 6 years, you will very much forget what you were doing. Your annotations are a gift to your future self. Be kind to your future self!

You will probably have several analyses you may want to do and you may end up changing your code lots of different times in all different ways. Keeping track of all these changes in your code can be tough. Here Github is a true life-saver. Github will keep track of all your versions of your code and let you see when any changes were made to your code and by who (especially important if you’re working within a collaboration). I will be honest and say that I have not yet fully incorporated Github into my analysis pipeline, but that this is my new year’s resolution. If you want to also learn Github, let me know and we can make a pact by the next ABS meeting in 2021, we’ll both have learned it!  There are boatloads of good tutorials online like this one, or this one. (UPDATE May 2021: I have learned Github!  I am now using Github for version control for my Rcode which is essentially seamless with the Github desktop app. Highly recommend! I am sure I am now yet fully utilizing all of Github’s functionality but am very pleased with how it’s working for me so far.)

SHORT ASIDE there are so many potential pitfalls that can occur during analysis that can also lead to problems with reproducibility. Things like p-hacking or HARK-ing (Hypothesizing After Results are Known) are rife in science, and our field too. Maybe one day I’ll write  a post summarizing my thoughts on this, but for now, all I can say is that in the same way you want to be transparent and reproducible in how you handle your data, you need to be transparent in the analyses you do. Be very honest in your papers about which results you tested for from a priori hypotheses (that you should have written down in your lab notebook before you even started collecting data) and which results you discovered in your data during the analyses (and are exploratory in nature).

4b – Cleaning & analysis NOT in R.

If you are not working in R, or some other scripting language (note: even if you are using menu-driven SPSS know that there is code that operates behind the scenes that you can pull out and save as a permanent record of your analytical decisions) that lets you keep track of all the formating/cleaning you do in the program, then things can get tricky as you may not have a permanent record of exactly how you sorted things or moved things around. So I think the best thing you can do is version control. That is, never overwrite your excel files. Always ‘save as’ a new file. So if you worked on your “Project X_data_analysis.csv” today (July 31, 2020) you would now save this file as “Project X_data_analysis_200731.csv”. Adding on the date in the format YYMMDD to the end of the file name will keep all your versions in order, so even though you may create lots of versions of the file, they should be easy to keep track off. It might be a good idea to then create a new folder something like “old data” that you can dump these old versions in to, so you have them if you need them but they’re not cluttering things up.

If you are not using Github for automatic version control, then using some sort of manual version control like this for your R code is probably a good idea. So when you add/remove big things to your R code don’t just ‘save’ the new file but make sure to ‘save as’ with a new date so that way if/when you break your code later, you can go back and find out what you did wrong.

SHORT ASIDE I also find this method of version control (appending the date in _YYMMDD to the file name) to be super useful in managing your manuscript (or really any file) edits. I then only append additional suffixes when I hit milestones like submissions and revisions. So the “manuscript” folder in any of my projects folders looks like something:

“Project X_MS_200629.doc”
“Project X_MS_200630.doc”
“Project X_MS_200630_XX.doc” (when I get comments back from co-authors I save them with their initials at the end)
“Project X_MS_200701.doc” (then I update the date when I incorporate their comments)
“Project X_MS_200630_Submitted.doc” (file I submit to journal!)
“Project X_MS_201115_R1.doc” (after I start incorporating revisions)
“Project X_MS_201120_R1.doc”
“Project X_MS_201125_R1_resubmitted.doc” (file that I re-submitted to journal!)

I create subfolders as I go for “Initial submission_journal” then “R1” then “Final submission” and put the appropriate files in each one.  I have no idea if this is The Best Way to organize files, but it seems to work well as a version control method for me as I can quickly find whatever file I need later on. I use similar naming practices for other parts of manuscripts like ‘Project X_Cover Letter_YYMMDD” or “Project X_Supplemental file_YYMMDD”.

5 – Finalizing results & figures.

After your analysis is complete, you should finalize your code so that it reproduces exactly the results you will put in your paper. For me, I now always produce an R markdown file at this point. So I clean up my scripts, annotate everything really well and then re-run everything to make sure that I am reproducing the results I expected. I always find mistakes/errors/inconsistencies at this point. R Markdown is great as it will join together each piece of code with the output it produces so you can see it all in one place. You can also add in extensive notes in between the code chunks creating a really nice narrative that is very easy for someone else to read without them having to load up your R script and run your code themselves. You could also just use your final R script (which should be easy to find if you use Github!) though this requires a reviewer to actually run the code themselves to reproduce your results which is tedious.

Also, when I know that I will eventually be depositing this R Markdown file online, this knowledge just makes me more careful. Knowing that your work will become open for everyone to see is a little (lot) bit scary and I think that’s a good thing. It gives us the extra motivation we need to check and double-check everything we do and to make sure that our choices along the way are well justified. I’ve been seeing lots of chatter on twitter lately with folks saying they actually send them code to a co-author and ask them to check/run it to look for errors; this seems like an excellent idea!

6 – Deposit everything.

 I’m sure you had absolutely no trouble getting your paper published and once you do, deposit 1) your data, 2) your code and 3) your R Markdown file. This is something that is still relatively new for me (it was my 2019’s resolution to start depositing all my code) but I really think this is one of the most critical steps. I load up my R Markdown file as a supplementary file to the manuscript (which is much easier for a reader to read) and then deposit the code and data in a public repository. Depositing your data & code in a repository like Dryad or Figshare means that other people can check your work, but just as importantly, that they can learn from you! How annoying is it when you read a paper that did *exactly* the analyses you want to do, but they didn’t include their code so you’re stuck sifting through Stats Exchange for weeks trying to fix your syntax problems. Ugh. We all don’t need to reinvent the wheel. Sharing code is good scientifically as we can learn from each other, but potentially also good personally in that folks can now cite you not just for your results, but also for your code.

If you remember, I started with “Project X_data_analysis.csv” data file and then did lots of cleaning and formatting in R. So now I will export my final data file “Project X_data_results.csv” so that I have this to keep. This is often the data file that I deposit online (I will need just comment out the parts of the code where I performed data cleaning/formatting so that folks can see what I did but they don’t have to do it themselves).

So that’s it! At the end of all this, at a bare minimum, you should have all those (versions of) files to help you trace the life cycle of your data. This is what it usually looks like for me:

“Project X_data_raw_DO NOT ALTER.xlsx” (this includes meta-data on a separate sheet)
“Project X_data_raw_DO NOT ALTER.pdf” (this is your digital doomsday insurance)
“Project X_data_analysis.csv” (file you load into R to start cleaning/analysis)
“Project X_data_results.csv” (final data file after cleaning)
“Project X_code_working.R” (working code)
“Project X_deposited results.html” (your R Markdown file that shows exactly the code and results that are shown in your paper)

7 – Other resources:

  • Open Science Framework. Finally, all these practices are basically things you do yourself. Another option is to use the Open Science Framework. This website lets you create an account (for free!) and then you can basically load everything for your project there. I haven’t yet adopted this framework for my own work, but I wouldn’t be surprised if I did at some point. Especially when I start managing more projects through my students. Something to consider!
  • SORTEE. Some very cool folks recently started the Society for Open, Reliable and Transparent Ecology and Evolutionary Biology (SORTEE) as a society to advocate for, well, open, reliable and transparent research practices in our fields! Their website has oodles of good resources for you if you want to learn more about current best practices and they are even hosting their first conference summer 2021 to help share ideas.

Be a good parent to your precious data !

Documenting our data’s journey as carefully as we can from collection to publication is so critical for so many reasons. First, we are scientists. We live and breath by the data we collect and so taking good care of it and proving that we’ve done so is sort of a no-brainer. It really isn’t something that needs any justification. If we consider ourselves natural skeptics who want to be able to see the evidence for ourselves, then isn’t this just part of our jobs?

Second, it protects ourselves. Being open from the very get-go makes it very difficult for others to insinuate you are doing something inappropriate. This is sort of an unpleasant to thing to think about, but as there are now several high profile examples of allegations of negligence, misconduct or full-on fraud, it’s not a bad idea to be a little bit proactive about protecting yourself. This way any mistakes that do happen or irregularities that may appear in your data, hopefully have an explanation that you can quickly trace. Mistakes *will* happen, we are humans and it is inevitable. But much better for it to be an obviously honest mistake than having any whiff of it being something else…

Third, along these lines, knowing that you are performing your science in a glasshouse will make you a better scientist! Just like how you finally see the typos in your manuscript (or blog post!) right after you click submit (and know someone else is going to see it), I think the same principle holds for all other parts of science, including the collection and the analysis. Thinking about how we justify all the decisions that go along with data collection, cleaning and analysis as we are making them and carefully documenting this will make our work more rigorous and reproducible.