DISCLAIMER: This is a post about statistics

I was really anxious about studying mathematics and statistics in my undergraduate degree. I had heard all sorts of horror stories about the stats unit in particular from other students and started to panic.

In hindsight, I should have ignored them.

I remember emailing both unit coordinators for advice beforehand. I hadn't been a strong mathematics student in high school, but this was due to pure disinterest rather than my ability. I just didn't like it so I didn't even bother trying. It was pretty immature of me at the time, but I didn't have a crystal ball...

The advice from the statistics coordinator was:

Statistics is not hard but it is unusual. The fail rates fall within the university's guidelines (or better). If a student follows the directions, and keeps up to date with the work, passing or better is guaranteed. I do notice that some students take the subject too casually and skip lectures and assignments. By the time they attempt to regather their efforts, too many new ideas have been introduced and they are in very unfamiliar territory. So don't be spooked by the gossip. Pass rates are high.

Yes, I still have the actual email from him.

So I did what any normal person would do, and I decided to tackle first-year mathematics and statistics concurrently in Semester 1. I was a part-time student working a full-time job, so enrolling in two units was my comfortable limit. And I figured if I failed one or both units, they were on offer again in Semester 2.

I passed both units. Better than passed, actually - I smashed them: 88% for maths and 90% for stats. Statistics still doesn't come "naturally" to me, but with a methodical approach and advice/guidance from helpful academics, my understanding of the subject has only increased and my ability improved. I have had to use my statistical knowledge in every zoology/ecology unit in my undergraduate degree since.

Do not be afraid of statistics, is what I'm saying. Make friends with them early on because, at the end of the day, you will need them if you want to study zoological science and be a zoologist or entomologist, or a scientist of any description. It's true that you will most probably have access to a statistician later in your research careers, but that doesn't mean you don't need to understand statistics at an intermediate level yourself. It is essential. Full stop.

A screenshot of my computer desktop during my undergraduate degree.

A screenshot of my computer desktop during my undergraduate degree.

 

Mislan et al. (2015) have just published a paper in Trends in Ecology & Evolution about the importance of making statistical coding available when research in released (which I have reproduced below in line with the user licence), and this is what prompted me to write this little post.


Elevating The Status of Code in Ecology

Code is increasingly central to ecological research but often remains unpublished and insufficiently recognised. Making code available allows analyses to be more easily reproduced and can facilitate research by other scientists. We evaluate journal handling of code, discuss barriers to its publication, and suggest approaches for promoting and archiving code.

The Role of Code in Ecological Research

Most ecologists now commonly write code as part of their laboratory, field, or modelling research. The transition to a greater reliance on code has been driven by increases in the quantity and types of data used in ecological studies, alongside improvements in computing power and software [1]. Code is written in programming languages such as R and Python, and is used by ecologists for a wide variety of tasks including manipulating, analyzing, and graphing data. A benefit of this transition to code-based analyses is that code provides a precise record of what has been done, making it easy to reproduce, adapt, and expand existing analyses.

Scientific code can be separated into two general categories – analysis code and scientific software. Analysis code is code that is used to correct errors in data, simulate model results, conduct statistical analyses, and create figures [2]. Release of analysis code is necessary for the results of a study to be reproducible [2]. The majority of code written for ecological studies is analysis code, and making this code available is valuable even if it is rough because it documents precisely what analyses have been conducted [2, 3, 4]. Scientific software is more general and is designed to be used in many different projects (e.g., R and Python packages). The development of ecological software is becoming more common and software is increasingly recognised as a research product [5, 6].

Current Standards for Code in Ecology

Journals are the primary method that ecologists use to communicate results of studies. Therefore, the way journals handle code is important for evaluating the current status of code in ecology. To explore the current status of code in ecology journals, we identified journals through a search of the Journal Citation Reports (JCR) using the following search terms: ‘Ecology’ for category, ‘2013’ for year, ‘SCIE’ (Science Citation Index) and ‘SSCI’ (Social Sciences Citation Index) for editions checked, and ‘Web of Science’ for the category schema. We selected the top 100 results for analysis and, after excluding museum bulletins, a book, and a journal with broken website links, evaluated a total of 96 journals. We searched the author guidelines for each journal to determine if there was any mention of code or software in the context of scientific research. We also conducted more specific searches to determine if journals had a section for documentation of scientific software releases, and if journals had a policy requiring the release of code and/or data for article publication. Data release policies provide a useful comparison to code release policies because there have been ongoing efforts to encourage or require the release of data once results are published (e.g., [7]).

As of June 1, 2015, more than 75% of ecology journals do not mention scientific code in the author guidelines (Figure 1). Of the journals that mention scientific code, only 14% require code to be made available. Nearly threefold more journals (38%) require data to be made available. A very small subset of journals (7%) have created a special section for software releases or have added software releases to a list of options for existing methods sections (Figure 1). These findings are similar to a recent analysis of journal code policies in other scientific fields [8].

Figure 1. Current Status of Code in Ecology Journals. Most ecology journals do not have requirements or guidelines (as of June 1, 2015) for making code and data available. Ecology journals listed in the Journal Citation Reports (JCR) in 2013 were evaluated. Data and code available at http://dx.doi.org/10.5281/zenodo.34689.

Figure 1. Current Status of Code in Ecology Journals. Most ecology journals do not have requirements or guidelines (as of June 1, 2015) for making code and data available. Ecology journals listed in the Journal Citation Reports (JCR) in 2013 were evaluated. Data and code available at http://dx.doi.org/10.5281/zenodo.34689.

Barriers to Publishing Code in Ecology

Elevating the status of code in ecology will require changes in attitude and policy by both journals and researchers. Researchers are often concerned about making their code public for a variety of reasons [4, 9]. One of the main concerns is that publishing code takes time and researchers do not receive sufficient credit to justify this effort. This is compounded by concerns that releasing code may increase the risk of being scooped or hinder the researcher's (or their institution's) ability to commercialize the software [9]. In ecology, we believe that the benefits of publishing code outweigh the potential risks. There is little potential for commercialization of ecological analysis code, or even software, and reuse of code by others will raise the impact of the publications by the author of the code. It is also common for scientists to believe that their code is not useful and that the description of what their code does (typically in the methods section of a journal article) is sufficient to allow the analysis to be reproduced. However, computational and statistical methods have become increasingly complicated, and access to the analysis code is now crucial to understanding precisely how analyses were conducted [2, 4, 9]. Even code that is rough and difficult to run on other systems (owing to software dependencies and differences in computing platforms) still provides valuable information as part of detailed documentation of the analyses [2, 4, 9]. Given the relatively low risk and potentially large benefit to science of releasing code, sufficient incentives are needed to motivate scientists to take the time to do so.

Promoting Code in Ecology

Journals can promote the release of code used in ecological studies by increasing the visibility and discoverability of code and software. One way to increase visibility is to indicate code availability in the table of contents of all formats of the journal and provide direct links from the online table of contents to the code (Figure 2A) . In the article, links to code prominently displayed on the first page will also increase visibility (Figure 2B). This article format for data has already been adopted by some ecology journals, including The American Naturalist. In addition, journals can require and verify that code is made available at the time an article is submitted for review or is accepted for publication [10]. Requirements by journals for data to be made available have been very successful [3]. Specialized software sections in journals go a step further in promoting highly refined code that can be used broadly for ecological analyses and visualization, and provide an associated publication [11]. Communicating the availability of software in a well-described journal format to the ecology community highlights software as a product of ecological research. Discoverability can be enhanced if searchable databases for articles (e.g., journal archives, Web of Science, and PubMed) include an option for searching for articles with code. This search capability would make it more feasible to find, compare, and adapt code from multiple research articles for a new study. To increase the value of code releases within the existing academic incentive structure, papers and other scientific products that use publicly-available code need to cite the code and associated publication (if there is one). Journals should encourage or require the citing of code, and provide instructions and examples for how to do so in the author instructions. Citing code will increase the impact of journal articles which include code, and provide credit to ecologists developing valuable software resources.

Figure 2. Recommended Journal Page Layouts. (A) For the table of contents of a journal. (B) For the first page of a journal article. The recommendations are for all formats of the journal including html, pdf, and print versions. An important feature is that active links can be clicked in electronic versions to directly access the code. The article titles and author names were made-up for the examples.

Figure 2. Recommended Journal Page Layouts. (A) For the table of contents of a journal. (B) For the first page of a journal article. The recommendations are for all formats of the journal including html, pdf, and print versions. An important feature is that active links can be clicked in electronic versions to directly access the code. The article titles and author names were made-up for the examples.

It is also important to consider how best to make ecological code publicly available. Ecologists may not be aware of the steps needed to archive code or the ease of doing so with available resources [3, 12, 13]. Table 1 compares some of the common resources available for archiving code. A license, which states the conditions under which the code can be used, should be included with a submission to an archive. If a submission does not include a license, then no one will be able to use the code. Most of the resources in Table 1 provide a license or license options, making it easy to add a license when code is submitted. Archives need to be long-term, assuring continuous availability ([14], https://caseybergman.wordpress.com/2012/11/08/on-the-preservation-of-published-bioinformatics-code-on-github/). All of the resources in Table 1 store submissions for the long-term except for GitHub and Bitbucket. Some of the archives assign code submissions a digital object identifier (DOI), which makes code straightforward to cite in scientific publications. Other considerations are whether it is possible to search specifically for code within the archive, the process for uploading code, and the cost of archiving code. Most of the archives host code for free if the code is made publicly available. Overall, Zenodo, Figshare, Dryad, and PANGAEA are good options for archiving because they provide licenses, are long-term, and are easily citable (Table 1).

Journals can have a significant impact on increasing the value of code within the ecology community. We believe that broad adoption of the suggestions to increase visibility and discoverability of code, require archiving of code, and increase citation incentives for doing so, will motivate more authors to release both analysis code and scientific software. By fostering reproducibility and reuse, more available code can improve the quality and accelerate the rate of research in ecology.

Acknowledgments

K.A.S. was supported by the Washington Research Foundation Fund for Innovation in Data-Intensive Discovery and the Moore/Sloan Data Science Environments Project at the University of Washington. This work was supported in part by the Gordon and Betty Moore Foundation Data-Driven Discovery Initiative (grants GBMF4553 to J.M.H. and GBMF4563 to E.P.W.). We thank Carl Boettiger for thoughtful comments that significantly improved the paper.

References

1. Hampton, S.E. et al. (2013) Big data and the future of ecology. Front. Ecol. Environ. 11, 156–162
2. Peng, R.D. (2011) Reproducible research in computational science. Science 334, 1226–1227
3. Hampton, S.E. et al. (2015) The Tao of open science for ecology. Ecosphere 6, 120
4. Barnes, N. (2010) Publish your computer code: it is good enough. Nature 467, 753
5. Rubenstein, M.A. (2012) Dear Colleague Letter – Issuance of a New NSF Proposal & Award Policies and Procedures Guide, National Science Foundation www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp
6. Poisot, T. (2015) Best publishing practices to improve user confidence in scientific software. Ideas Ecol. Evol. 8, 50–54
7. Whitlock, M.C. et al. (2010) Data archiving. Am. Nat. 175, 145–146
8. Stodden, V. et al. (2013) Toward reproducible computational research: an empirical analysis of data and code policy adoption by journals. PLoS ONE 8, e67111
9. Ince, D.C. et al. (2012) The case for open computer programs. Nature 482, 485–488
10. Nosek, B.A. et al. (2015) Promoting an open research culture. Science 348, 1422–1425
11. Pettersson, L.B. and Rahbek, C. (2008) Editorial: launching Software Notes. Ecography 31, 3
12. Stodden, V. and Miguez, S. (2014) Best practices for computational science: software infrastructure and environments for reproducible and extensible research. J. Open Res. Software 2, 1–6
13. Wilson, G. et al. (2014) Best practices for scientific computing. PLoS Biol. 12, e1001745
14. White, E.P. (2015) Some thoughts on best publishing practices for scientific software. Ideas Ecol. Evol. 8, 55–57


Citation K.A.S. Mislan, Jeffrey M. Heer, Ethan P. White. Elevating The Status of Code in Ecology. Trends in Ecology & Evolution, published online December 15 2015, DOI: http://dx.doi.org/10.1016/j.tree.2015.11.006