The Hidden Work of Science Beyond the Paper

Long ago during my master’s degree I found myself getting annoyed with academic literature. Often, the deeper you dig the more dead ends you run into, cycles of chasing references, trying to find the original source, all in hopes of seeing the exact data or trying to figure out how exactly something was accomplished. This is much more common in older literature but is something that can still occur today. Why? Well because it’s a lot of work to meticulously document the original data and any transformations that occur while maintaining a level of simplicity for someone to pick it up and reproduce or replicate your work. Shortly into my Ph. D. I realized I naturally began setting up process and workflows to try and make this process easier. I strove to use platforms like GitHub and Zenodo, while working on good old-fashioned documentation. Later I learned about F.A.I.R. data and began improving the way I prepared a manuscript for publication.

F.A.I.R. Data

To be truly useful data needs to be findable, accessible, interoperable, and reusable (F.A.I.R). Findable is the first part, as knowledge of the data and where to find it is critical to data management. After all, if no one can find it, no one will use it! Accessibility is what comes after the data is found. If the data exists but can’t be accessed easily then it is still unlikely to get used again, which is not good because the more the data gets used the more the hard efforts become worthwhile. Those first two points are mostly making sure people can find and download the data. The last two are making sure they can do something with the data. Interoperability entails adding the metadata and using formats that can be used in a wide range of settings. The final one, reusable, is entire about metadata to optimize the reuse of the data beyond the single time for a single paper. To help illustrate this process I wanted to pick apart what I did for my most recent publication, Performance and Sensitivity of the Energy Cascade Models for Lettuce Production in Bioregenerative Life Support Systems.

Findable

To accompany that publication I have a Zenodo repository of all the raw data and input files, a GitHub repository for all the code used to process and analyze the data, and a supplemental file filled with tables of processed data. Each of these four items has their own set of metadata and identifiers pointing to each other in hopes of making everything findable.

Accessible

To make it all accessible is rather easy, they are available through open access online. Beyond that, however, is ensuring long term access, so selecting evergreen locations that (should) always be available is important, but as a last resort I got a few back-up copies I manage personally which I could distribute or reupload.

Interoperable

Interoperability is where the bulk of the demands on the scientist’s effort really begins. Unless you’re a superstar scientist and make sure everything is perfectly labeled throughout the project it’s likely going to be towards the end that you revisit all the files and add the necessary metadata. For me I had countless unlabeled columns which I either knew or could figure what they were from the code. No one else would bother figuring that out unless they were desperate, so it was up to me to write it all down clearly. For this project it probably took me two full days of dedicated effort documenting the data and code. I’m trying to set things up better in advance to avoid this in the future.

Reusable

Lastly, to make the code reusable, I needed to not just label all the data and columns but provide directions of how exactly I accomplished the process laid out in the manuscript. After all, the data can be reusable if you know what the column is, but the collection of a couple dozen different scripts with no directions would take a long time to figure. As the only person who knew all the ins and outs, it was up to me to create all the documentation necessary to make the project reusable.

Conclusion

Overall, the bare minimum expected of me as a scientist is simply just the publication, and maybe the supplemental file. The days of effort to make the entire project F.A.I.R. are often considered above and beyond. I have no guarantees that anyone will see, use or even appreciate the extra details. However, to me, that is truly what each publication should be accompanied by. Compared to 50 years ago, it is easier than ever to publish all the data, processes, and methods used to arrive at our conclusions. Doing so reinforces the key tenets of reproducibility and replicability of science, making all the current science, and what builds upon it better.

Proudly written without large language models.

https://doi.org/10.5281/zenodo.19986058

This work is licensed under CC BY 4.0