Open, Reliable and Transparent Data

Iain R. Moodie

Stockholm Mini-symposium

2024-02-28

A brief annecdote

Sexual selection in plants

Pollen tubes interacting with pistil tissues - Jeanne Tonnabel

Bateman gradients in angiosperms
- N = 2 (in 2021)
Project goal
- Conduct a meta-analysis 🤔
Find datasets that could be re-analysed in this new context
Combine into a meta-analysis to test predictions

Sexual selection in plants

Initial search
- N=2167 😊
After sorting
- N=30 🥹
After trying to source data
- N=9 😐

Photo by Annie Spratt

Datasets we couldn’t use

Data not archived
- No way to contact author
- No response to contact
- Data had been lost
- Not willing to share data

Data archived
- Inaccessible
- Incomplete
- Incomprehensible

Lost from science

Exxon Valdez oil spill 1989

40.8 million litres of crude oil spilled
Settlement funds from Exxon used for research and monitoring the impacts of the spill
Between 1989 and 2010, 419 projects were funded
In 2012, NCEAS tried to compile all historic datasets
70% were unrecoverable

Lost from science

Transparency in research

Opaque research

Publication bias
- Not all research is published
Incomplete or insufficiently detailed methods
Selective reporting in results
- Confirmation bias
- “HARKing”
- “P-hacking”
Unaccessible underlying data

Photo by Clem Onojeghuo

Opaque research limits science

Harder to replicate or re-use methods
Harder to build upon to progress the field
Harder to interpret results
Harder to trust the conclusions

Photo by Karl Hedin

Open, Reliable and Transparent Science

Open, Reliable and Transparent Data

And why you should care about it.

Reproducible and reliable results

Promotes accountability and trust
Mistakes can be corrected¹
Analytical decisions can be justified
Scientific misconduct can’t hide

New questions & new methods

Photo by Monika Manenti

Built upon more effectively
- Deeper understanding of data & analysis
- Used to develop new tools/methods/protocols
- E.g. Bumpus 1899
Viewed in a new light
- Beyond the original paper
- Paradigm shifts
Analysed using the latest methods
Meta analysis

More accurate meta-analysis

Easy extraction of accurate data
- No need to extract from figure
- Reduces ambiguity and error
Go beyond the results section
- Helps reduce bias from selective reporting
- Capture the full picture of the study
Extends the life the dataset
- Can always be accessed

Gerstner et al. (2017) Methods in Ecology and Evolution

Learning and teaching

Long Term Ecological Research program (LTER) Datasets

Teaching students using example datasets
- Real biological “quirks”
- Real scenarios
- Can teach good practises from the start
Learning and understanding new methods
- Complexity can be broken down
- Walkthrough when code also available

Benefits for the data archiver

Increased exposure, reach, and trustworthiness
Citation advantage (+25%) ¹
Your own best collaborator
- Data is clean and ready to use
- Well annotated
- Cannot be lost

Photo by Anton

Reducing research loss & waste

Roche et al. (2015) PLOS Biology

Removes need for duplicated data collection effort
- Time/location/event dependant data
- Research animal use
Reduces cost of research

How are things going?

Transparency and Openness Promotion (TOP) guidelines

“A set of standards applied to journals to measure their alignment with open scientific principles”
- Specific guidance on data transparency:
  - Level 3: open data + peer review of dataset and analysis
  - Level 2: open data in trusted repository
  - Level 1: mandatory data statement
>5000 journals are signatories
Field specific advice for ecology and evolution

Nosek et al. 2015

Top down pressure

Journals
- Mandated archiving has become “the norm”
Funding sources
- Open access requirements extending to datasets
Institutions
- To help staff meet requirements of the above

Community driven approaches

Positive attitudes towards data transparency are common
- 95% of scientists in ecology and evolution think that data should be publically archived (Whitlock et al. 2010)
Lack of data transparency is seen as a problem
- 67% of scientists think that lack of access to data is a major impediemnt to progress in science (Tenopir et al. 2011)

How well are we doing?

Tenopir et al. 2011

How well are we doing?

Published without sufficient data to replicate:

89% (N=18) of micro-array gene expression analyses (Ioannidis et al. 2009)
35% (N=19) of population genetic studies (Gilbert et al. 2012)
64% (N=100) of non-molecular eco/evo studies in journals that mandate data archiving (Roche et al. 2015)

Photo by Steven Wright

How do we improve things?

Why we don’t share data?
- Knowledge barriers
- Re-use concerns
- Disincentives
How to work towards data transparency

Knowledge barriers

What’s the process?

Do not know how to share data effectively
- Which online data repository to use?
- What format to share data in?

What’s the process?

Online guides and primers
- British Ecological Society “Guides to Better Science”
- UKRN Primers
- SORTEE (coming soon)

British Ecological Society Primer Series

What’s the process?

Institutional libraries
- Often under-utilised advice and guidance
FAIR templates and guides
Any data is better than no data!
- Learn by doing

The FAIR Principles

Insecurities

Blog post by Andrew Anderson

Early career researchers can feel especially vulnerable
Fear, insecurity and embarressment are powerful emotions

Insecurities

Blog post by Andrew Anderson

Share before publication
- Lab meetings or data review sessions
- Pre-print (private or open)
Data being hard to understand is bigger issue
Culture that prioritises learning over citisism

Don’t see value in their data

Too niche
Too small
Why would someone be interested?

Photo by Diego PH

Don’t see value in their data

Highly subjective
Hard to predict future use
+ all other benefits

Photo by Diego PH

Re-use concerns

Misinterpretation

Fear of inappropriate use
- Lack of familiarity with particular dataset
- Miss crucial details and draw misleading conclusions

Misinterpretation

High quality metadata
- Peer review
Contactable
Not a unique problem to data

Sensitive information

Dual use problem
Weigh up benefits and costs
Ethical (and legal) implications
Sharing limited subset
- Species example guidelines: Chapman 2020

GBIF

Disincentives

Scooping

Fear of:

A researcher performs an analysis on publicly shared data that the original data collected had not done yet
- Being “scooped”
Reduced collaborations
Loss of future publications
- Metric used to assess performance

Photo by Saher Suthriwala

Scooping

Less likely than you would imagine:

Ideas are plentiful
Original collectors in best position to act
Most analyses by original authors on published data happen within 2 years¹
Most analysis by other researchers peak at 5 years^1

Photo by Saher Suthriwala

Scooping

Pre-print to “claim”
If major concern:
- restrictions on use of data can be made
- embargo periods
Change in mindset to see data as a valuable contribution

Photo by Saher Suthriwala

How to work towards data transparency

1. Plan to publish your data!

What data needs to be recorded?
What metadata might be needed?
How raw/cleaned should my data be?
Talk with collaborators early about plans

2. Identify an appropriate repository

Field specific
Data type specific
Journal preferences
Good starting place: re3data.org

Subjects covered by re3data.org

3. Make a nice README file

One or more plain text files that describe the data in detail
Write early!
Check repository guidelines
Document your data

4. Pre-peer-review peer-review

Ask a colleague to look through your README and dataset
- Data/code review sessions
- Can they make sense of it?

Photo by Jason Goodman

5. Publish your data

Make sure it has a citable DOI
Cite your data in your publication!
Talk about it with your colleagues

Thank you for listening

Slides & references:
- irmoodie.com/slides/datatransparency-stockholm-2024
Want to learn more:
- www.sortee.org
Contact me:
- iain.moodie@biol.lu.se or irmoodie.com
Questions?