SOAPdenovo2 Case Study

Download .zip Download .tar.gz View on GitHub

Overview | Galaxy Workflows | Data Models | Queries | Contributors

Overview

The SOAPdenovo2 case study is a reproducibility study aimed at exploring how existing research objects and workflow enactment engines can help assess, record and preserve scientific workflows and associated findings by reviewing a comparison between sequence assembly algorithm performance in the light of development of the SOAPdenovo2 de novo genome assembler. The case study was a joint effort by the GigaScience journal, the Investigation/Study/Assay (ISA) infrastructure, Nanopublication (Nanopub) and Research Object (RO) communities and SOAPdenovo2 de novo genome assembler developers.

The following presentation, delivered at the International Society for Molecular Biology (ISMB) 2014 workshop on What bioinformaticians need to know about digital publishing beyond the PDF2" in Boston, USA, showcases the SOAPdenovo2 study and explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.

ISMB Workshop 2014 from Alejandra Gonzalez-Beltran

The following publication describes the SOAPdenovo2 case study and our recommendations to improving scholarly publishing using research object models.

PLOS One article
PLOS One Article
DOI: 10.1371/journal.pone.0127612

An earlier version was available as a pre-print.

Also, this work was presented at the Bionformatics Open Source Conference (BOSC) 2015 and these are the slides:

From peer-reviewed to peer-reproduced: a role for research objects in scholarly publishing in the life sciences from Alejandra Gonzalez-Beltran

Galaxy Workflows

The Galaxy workflows and corresponding histories for reproducing part of the SOAPdenovo2 study, i.e. Table 2 from the original paper, can be found at the GigaGalaxy server, and these are the specific links, classified by organism and genome assembler:

Organism / Assembler SOAPdenovo2 SOAPdenovo1 AllPathsLG
S. Aureus Workflow Workflow Workflow
History History History
R. Sphaeroides Workflow Workflow Workflow
History History History

Data Models: ISA, RO and Nanopublication

The ISA model, with its focus on experimental design, insists on the declaration of study plans (e.g., experimental factors considered) and provides cues for reviewers to assess content and suitability of the plans. Furthermore, the underlying model ensures that inputs and outputs of processes, or workflows, are declared and identified, referring to existing database identifiers when relevant. Initially intended to draw the graph of sample processing through to coarse data processing, the ISA grammar is generic enough to cover computational processing while allowing referencing to more granular forms such as Galaxy files.

ISA and RO both provide means to track experimental and computational workflows respectively, with some level of acknowledged overlap which is handled by deferring to the domain specific resources, with Research Object project recommending ISA for the biological domain. Finally, since describing how data are acquired, generated and analyzed is only part of the story, the description of the findings requires attention. The Nanopublication model tackles what used to represent the blind side of data reporting: capturing experimental conclusion.

Investigation-Study-Assay

Scope:
Experimental Design, Variable, Material Processing, Data Processing workflows.
Outcomes:
  • an tab-delimited archive presenting an overview of the SOAPdenovo2 experiment following the ISA-TAB specification: it includes a description of the experimental design (e.g. independent and response variables), the genomes and data used in SOAPdenovo2 together with stable identifiers, a description of the experimental and computational workflows for evaluation of SOAPdenovo2 with its predecessors SOAPdenovo1 and ALLPATHSLG, including their inputs and outputs, provenance and attribution information
  • an explicit OWL/RDF semantic representation generated using ISA2OWL software component and relying on mappings between the ISA syntax and ontological resources such as the Ontology for Biomedical Investigations (OBI) and the Provenance ontology (PROV-O).

Nanopublication

Scope:
Key findings, supporting evidence
Outcome:

Research Object

Scope:
Scientific workflow artifacts
Outcome:

Queries

We provide a set of queries demonstrating how the data models can be used to inspect the information about the SOAPdenovo2 study and its results. The following table summarises the queries and the model(s) used to answer them. The queries themselves and links to execute them can be found through the table and below.

Query Case RO ISA Nanopub
Who were involved in the study? See and execute query
What are the inputs and outputs for all the data transformations in the study? (for inputs) See query See and execute query
What are the Galaxy workflows related to the SOAPdenovo2 case study? See and execute query
What was the study design? See and execute query
What are the study factors (or independent variables) and their levels (or values they assumed)? See and execute query
How many study groups are there? See and execute query
Which are the study groups? See and execute query
Which are the members of the study groups? See and execute query
What are the sizes of the study groups? See and execute query
What was funding agency of the study? See and execute query
What is the licence for the metadata? See and execut query
What is the PubMed identifier for the associated publications(s) for the study? See and execute query
Find all the nanopublications related to the study See query See and execute query See and execute query
Find the authors for each assertion in the nanopublications See and execute query
Find the nanopublications and their associated authors See and execute query

Next, we list the different queries and provide links to the results of executing them in an SPARQL endpoint.

Queries over the ISA research model

Who where the people involved in creating the ISA-Tab representation and what were their roles?

contacts.sparql

Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).

What are the inputs and outputs for all the data transformations in the study?

data_transf_inputs_outputs.sparql

Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).

What are the Galaxy workflows related to the SOAPdenovo2 case study?

galaxy_workflows.sparql

Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).

What was the study design?

study_design_type.sparql

Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).

What are the study factors (or independent variables) and their levels?

factors_and_levels.sparql

Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).

How many study groups are there?

study_group_count.sparql

Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).

Which are the study groups?

study_groups.sparql

Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).

Which are the members of the study groups?

study_group_members.sparql

Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).

What are the sizes of the study groups?

study_groups_sizes.sparql

Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).

What was funding agency of the study?

study_funding_agency.sparql

Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).

What is the licence for the metadata?

study_metadata_licence.sparql

Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).

What is the PubMed identifier for the associated publications(s) for the study?

study_publication_pubmedid.sparql

Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).

What are the nanopublications genereated for the study?

isa_nanopubs.sparql

Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).

Queries over the Nanopublication research model

Find all the nanopublications related to the study

all_nanopubs.sparql

Execute this SPARQL query.

Find the authors for each assertion in the nanopublications

assertion_author.sparql

Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).

Find the nanopublications and their associated authors

nanopub_author.sparql

Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).

Queries over the Research Object

hybrid_inputs_isa_provenance.sparql

hybrid_ro_nanopub.sparql

hybrid_workflow_isa_study.sparql

inputs_derived_gage_output.sparql

inputs_role.sparql

workflow_generated_gage_results.sparql

workflow_in_the_ro.sparql

workflow_in_the_ro.sparql

Contributors

  • Alejandra Gonzalez-Beltran (@agbeltran), Oxford e-Research Centre, University of Oxford, UK
  • Peter Li (@pli888), GigaScience, BGI HK Research Institute, Hong Kong.
  • Jun Zhao, InfoLab21, Lancaster University
  • Mark Thompson, Department of Human Genetics, Leiden University Medical Center, The Netherlands
  • Maria Susana Avila-Garcia, Nuffield Department of Medicine, Experimental Medicine Division, John Radcliffe Hospital,, Oxford, UK .
  • Ruibang Luo, HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory & Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong.
  • Tak-Wah Lam, HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory & Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong.
  • Tin-Lap Lee, School of Biomedical Sciences and CUHK-BGI Innovation Institute of Trans-omics, The Chinese University of Hong Kong, Shatin, Hong Kong.
  • Marco Roos, Department of Human Genetics, Leiden University Medical Center, The Netherlands
  • Scott Edmunds, GigaScience, BGI HK Research Institute, Hong Kong.
  • Susanna-Assunta Sansone, Oxford e-Research Centre, University of Oxford
  • Philippe Rocca-Serra, Oxford e-Research Centre, University of Oxford

Support or Contact

For discussions about the SOAPdenovo2 case study, please contact Alejandra, Peter and Philippe.

The issue tracker is available at: SOAPdenovo2 case study GitHub site. Please, feel free to report issues or feature requests.