=Paper= {{Paper |id=Vol-2137/paper_18.pdf |storemode=property |title=A Maturity Model for Biomedical Data Curation |pdfUrl=https://ceur-ws.org/Vol-2137/paper_18.pdf |volume=Vol-2137 |authors=Mariam Alqasab,Suzanne M. Embury,Sandra de F. Mendes Sampaio |dblpUrl=https://dblp.org/rec/conf/icbo/AlqasabES17 }} ==A Maturity Model for Biomedical Data Curation== https://ceur-ws.org/Vol-2137/paper_18.pdf
                      A Maturity Model for Biomedical Data Curation
                       Mariam Alqasab∗, Suzanne M. Embury and Sandra Sampaio
            Department of Computer Science, The University of Manchester, Oxford Road, Manchester, UK




ABSTRACT                                                                   of curation that can be done by available curators (Baumgartner, Jr
   Quality is an important aspect that needs to be managed in              et al., 2007). Data curation is a vital task, but one that must be done
databases, as the importance of data is determined by its quality. This    (and done well) with a fraction of the resources needed to complete
draws the attention of many database providers to care about curating      the work wholly manually.
their data in order to maintain data quality over time. Also, this leads      This has led curators and researchers to examine and propose
database providers and researchers to investigate the area of data         ways to speed up and improve the curation process. A review of
curation and propose ways to improve it, either through providing          the biomedical literature for the last 5 years (2012-2017) indicates
tools to automate the process or to support human curators in making       a number of publications proposing tools for data curation (such as
changes to the data. However, among all available suggestions to           OntoMate (Liu et al., 2015), PubTator (Wei et al., 2013), MIntAct
improve data curation, to the best of our knowledge, no a general          (Orchard et al., 2013) and Data Tamer (Stonebraker et al., 2013)), as
description of the curation process has been given that also provides      well as others describing specific approaches to curation in certain
solutions to improve it, and that can help database providers to           fields (such as using a graph-based approach to improve detecting
assess how mature their approach to data curation is. To fill this gap,    problems in records (Croset et al., 2016), and proposing a middle
this paper proposes a maturity model, that describes the maturity          layer to unify curation results (Sernadela et al., 2015)).
levels of biomedical data curation. The proposed Maturity Model aims          These improvement efforts are good news for biomedical science.
to help data providers to identify limitations in their current curation   However, individual communities are at different stages in terms
methods and enhance their curation process.                                of how their curation is performed. Some communities are well
                                                                           established, with documented, agreed-upon processes for data
1   INTRODUCTION                                                           curation and access to a repertoire of curation resources, such as
With the growth of data-driven science, the curation of public and         rich ontologies defining agreed shared vocabularies. Others are just
community data sets has become a necessary task for ensuring               starting out, and are following ad hoc procedures with few quality
the long-term usefulness of scientific data. Scientific data typically     controls. This is often the case, for example, when some new
comes in two forms: experimental results (measurements) and the            experimental technique is developed; it can take a little time before
interpretation of those results in the form of statements about the        community repositories for storing the results can be created, and for
structure, organisation and function of the things being observed.         the needs of the communities using the new data to be understood
There are curation challenges with both types of data, but the most        and supported. During this period, curation of data is less of a focus
substantial difficulties lie in the curation of the interpretive data.     for the burgeoning community than just getting up and running.
This data describes the models and hypotheses about reality that           These communities need a quick and efficient way to introduce
prevails within the community that owns the data. As such, it is           curation regimes to protect and amplify the value of this early data.
often complex in form (requiring several ontologies to describe), it          At present, there is little general advice for curators of bio-
can change rapidly or remain current for many years, it is subject         medical data. An exception is a useful proposal by Hirschman
to disagreement within the community, and can be superseded as             et al. for a general biocuration workflow, but even this proposes
new experimental results come in. Perhaps most significantly, the          a one-size-fits-all solution, which may not be appropriate for
source of this data is not a machine, which spits out experimental         all communities. Instead, we propose the creation of a maturity
results at high volume but in regular and predictable format. This         model for biomedical data curation. A maturity model indicates
interpretive data comes from people, in the form of scientific             the different stages of “maturity” of an organisation or group in
publications. The principal task of a biomedical curator is to ensure      performing some tasks. The stages describe good practice (and even
that the interpretive data in the resource they curate (sometimes          best practice) for aspects of the task under consideration, as well
called metadata or annotations) is kept up-to-date with the prevailing     as commonly occurring forms of poorer practice. The underlying
view of the field as presented in the scientific literature.               assumption behind maturity models is that it is not usually possible
   Thus, the task of biomedical data curation goes beyond fixing           for a group of people to carry out best practice in a new area from
defects in data (although this is part of the curator’s task). Instead,    scratch. The need to understand the particular needs of the task and
curation must be done by human experts in the domain of the data,          the particular abilities of the group mean that time and experience is
who are capable of interpreting the scientific literature, resolving       needed to learn the best approaches. The maturity model can tell a
conflicting interpretations, and reflecting the results in the data. The   group where they currently stand in terms of good practice, and can
curation task is time-consuming, and it is not always easy to recruit      indicate plausible steps for gradual improvement over time. Using
curators with the breadth and depth of expertise to be able to do the      the model, newer groups can avoid the mistakes made by other
job well. The speed of arrival of new experimental results, and new        groups, and can improve more quickly. More established groups can
interpretations of these and past results, easily out-paces the amount     identify areas where their (often scarce) resources can be deployed
                                                                           for maximum improvement effect.
∗ Corresponding author: mariam.alqasab@postgrad.manchester.ac.uk




                                                                                                                                                1
Mariam Alqasab et al



   Such a maturity model will need the support and assistance of          literature to curate, and the extraction of data from the literature.
the biomedical community to refine and test. As a first step, we          Liu et al. (2015) proposed OntoMate, a text annotation tool that
present in this paper our initial version of a maturity model for         tags abstracts of PubMed articles with terms from 16 ontologies
biomedical data curation. The model was created from a survey             using machine learning. The curators can specify query terms
of the research literature on biocuration, and in this first version      and Ontomate returns the abstracts with matching tags. Ontomate
is focussed on literature-based curation. The paper is organised as       will also filter and rank the resulted papers. Wei et al. (2013)
follows. We begin by surveying the literature on biomedical data          implemented PubTator, a web-based tool that also searches for
curation (Section 2) and on maturity models (Section 3). We then          articles in PubMed, retrieves them and adds annotations in order
present our tentative Biomedical Data Curation Maturity Model             to ease the curators’ job. PubTator allows curators to select articles
(Section 4) and illustrate its use with an example (Section 5).           from the list of search results, indicates whether the article is
Finally, we conclude.                                                     curatable or not, and add specifications to data type and relations.
                                                                             One of the most important procedures in data curation is adding
2   STATE OF THE ART: DATA CURATION                                       annotations to the curated data. Verspoor et al. (2013) propose
                                                                          a schema for representing annotations describing human genetic
Scientific data curation1 is the process of associating semantic
                                                                          variants and their relation to disease. The schema was designed for
information with experimental results, to describe their interpretation
                                                                          a specific community but is intended to be more widely used, and
in terms of current scientific thought. The semantic information
                                                                          to save curators the need to redo the design work when creating a
typically takes the form of terms from a controlled vocabulary
                                                                          data format for their annotations. Generally, the schema works as a
or ontology, or of links with other databases. In addition to
                                                                          fundamental stage for those, who look for text mining solutions in
adding new annotations, curators are responsible for the overall
                                                                          human variome. More generally, Goldberg et al. (2015) emphasised
quality of the data, including resolving defects reported in the
                                                                          the importance of providing linked annotations between resources.
experimental data and in the interpretation annotations added
                                                                          Their claim is that such links assist manual curation, since they give
previously, or by other curators. The process is expensive, as we
                                                                          curators ready access to all (or most) available resources that are
have mentioned, because it is important that any such interpretive
                                                                          connected with the artefact under curation.
data is supported by the available scientific evidence. It is also
                                                                             While the literature contains a variety of proposals to improve
important that these annotations are as complete as possible, so
                                                                          the way data curation is carried out, most of the work concerns or
that data-driven science performed on them produces useful results.
                                                                          serves a specific target community or data source. We were not able
Some communities/source owners are able to employ experts to
                                                                          to find much in the way of guidance for setting up a data curation
work as full-time data curators, while others must rely on volunteers
                                                                          activity, nor much work giving general guidance applicable across
from the community giving their time and knowledge. Because of
                                                                          the biomedical field. We propose our maturity model in an attempt
this, some communities have created their own specialist processes
                                                                          to (partially) address this gap.
and tools to try to increase the efficiency and accuracy of data
curation.
   Recent years have seen an increase in the number of publications       3   STATE OF THE ART: MATURITY MODELS
presenting research in the area of biomedical data curation. While
                                                                          Maturity models grew out of work in the 1980s and 90s on business
different aspects of curation are covered by different proposals,
                                                                          process improvement (e.g., Crosby’s Quality Management Maturity
all share the same goal of improving the outcome of the curation
                                                                          Grid (Crosby, 1979)) and, especially, software engineering (e.g.,
process, while using the same or fewer resources.
                                                                          the Capability Maturity Model (Paulk et al., 1993)). Since then, a
   Some authors and communities have attempted to make
                                                                          variety of different maturity models, covering a variety of different
collaboration between curators easier, to avoid overlapping curation
                                                                          fields and process types, have been defined.
work and to make better use of the curation effort available. Orchard
                                                                             Briefly, a maturity model is a sequence of levels or stages that
et al. initiated a project called MIntAct (Orchard et al., 2013), which
                                                                          show the progress needed to reach a mature level of practice in some
merged the IntAct Molecular Interaction database2 with the MINT
                                                                          tasks or areas (Paulk et al., 1993). Each level has specific criteria
database of verified protein-protein interactions3 . MINT is manually
                                                                          that need to be fulfilled in order to move from one level to the next.
curated by experts from the scientific literature. The MIntAct project
                                                                          For example, one of the most well-known and well-used maturity
focussed on sharing the curation efforts from 11 different databases,
                                                                          models, the Capability Maturity Model for Software, aims to model
to gain the maximum value from the curation work performed at
                                                                          maturity of software development processes (Paulk et al., 1993). It
each individual source. Thinking along similar lines, Ravagli et
                                                                          consists of five levels:
al. (2016) created OntoBrowser, an on-line collaboration tool for
                                                                           1.Initial The software process used by teams at this level is
curators, that allows them to work on a single shared working copy,
                                                                             characterised as ad hoc, and occasionally even chaotic. Few
to avoid redundant curation work. Campos et al. (2014) also created
                                                                             processes are defined, and success depends on individual effort.
a curation tool, called Egas, that allows for real-time collaboration
                                                                           2.Repeatable Basic project management processes are established
curation from the scientific literature.
                                                                             to track cost, schedule, and functionality. The necessary process
   Others have implemented tools to speed up data curation by
                                                                             discipline is in place to repeat earlier successes on projects with
automating aspects of the search for relevant papers in the scientific
                                                                             similar applications.
                                                                           3.Defined The software processes for both management and
1 www.dcc.ac.uk/resources/curation-lifecycle-model                           engineering activities are documented, standardised, and
2 www.ebi.ac.uk/intact                                                       integrated into a standard software process for the organisation.
3 mint.bio.uniroma2.it                                                       All projects use an approved, tailored version of the organisation’s


2
                                                                                                                                Maturity Model



   standard software process for developing and maintaining                Other researchers have studied the whole concept of maturity
   software.                                                            models, and have proposed ways in which new maturity models
 4.Managed Detailed measures of the software process and product        can be created and for making better use of existing models. For
   quality are collected. Both the software process and products are    example, the Institute of Internal Auditors, in the Netherlands,
   quantitatively understood and controlled.                            offers a guide for selecting maturity models for use on business
 5.Optimizing Continuous process improvement is enabled by              process improvement consulting projects The Institute of Internal
   quantitative feedback from the process and from piloting             Auditors, 2013. The guide contains a description of a maturity
   innovative ideas and technologies.                                   model, and illustrates how to design a maturity model. Pöppelbußet
Maturity models have several uses (Pöppelbuß and Röglinger,           al. (2011b) focused on investigating the literature of maturity
2011a). They can be used to assess the current level of a group, in     models in business process management. From this investigation,
order simply to understand how the group is performing relative to      they derived them some general design principles that can help in
the norms in the field. They can be used to compare the performance     designing a maturity model.
of two different groups (for example, to look for opportunities for
partners for fruitful interactions and discussions — a group may        4     BIOMEDICAL DATA CURATION MATURITY
find it more useful to work with a partner one level higher than              MODEL (BIOC-MM)
it in the maturity model than with one at the other extreme of
                                                                        In order to create a maturity model for biocuration, it was necessary
the model). Principally, however, they are a tool for long-term,
                                                                        to gain a picture of the breadth of activity being undertaken (to
sustained improvement. By assessing a group’s current standing
                                                                        identify the dimensions for our model), and to gather examples of
against the model, and comparing this with the group’s desired level,
                                                                        best practice across different biomedical domains. In order to do
a sequence of manageable improvement actions can be planned.
                                                                        this, we reviewed the literature on curation activities in five different
With the model’s help, the group can target its efforts on areas of
                                                                        biomedical databases, covering a spread of topics across the field:
its performance where there is most scope for useful improvement.
And by looking at the criteria for performing at the level just             •UniProt4
above it’s current performance, achievable improvement steps can            •BioGRID5
be identified, that can be implemented with the resources available.        •FlyBase6
   A large number of maturity models have been proposed, since              •Saccharomyces Genome Database (SGD)7
their inception in the 1980s. A full survey is beyond the scope of          •Rat Genome Database (RGD)8
this paper, but we mention some representative examples of work in
this area, to give a flavour of what is being done.                     We also aimed to examine sources from both long-established
   New maturity models have been proposed in areas that go well         and more newly established communities, on the grounds that the
beyond the original business process/software process focus of the      longer established communities would (typically) have more mature
earliest models Ofner et al. (2015), for example, built a maturity      processes in place than those just getting started. (Unfortunately,
model for data quality management at an enterprise level. Another       very new communities are not usually in a position to publish details
proposed maturity model is called the Student Engagement Success        of their curation processes, and are less likely to have the time or
and Retention Maturity Model (SESR-MM) (Clarke et al., 2013). It        confidence to do so.)
focuses on helping higher education institutions (HEIs) to provide         According to our observations of practices in use with these data
a good environment for their students. The model covered different      sources, we found that the curation process mainly takes two forms:
aspects that can raise the level of student engagement to improve       data-oriented curation and literature-oriented curation. The data-
academic success rates and retention. Yet another model is aimed        oriented curation means that the focus of the curation process is
at innovation capabilities within organisations, and the kinds of       to look for defects in the data, whereas literature-oriented curation
support and facility needed to enhance it (Essmann and Du Preez,        means curating data when a new related publication appears in
2009).                                                                  the area, by extracting relevant information from the paper and
   In addition to these business focussed models, a handful of          associating it with the data. The literature-oriented curation has
maturity models in the area of scientific data and data management      three main tasks: searching for new publications, extracting data
have been proposed. For example, Bates and Privette (2012)              from the abstract and extracting data from the full-text.
proposed a maturity matrix for the quality assurance processes used        These observations led us to divide our Maturity Model (1) into
in managing climate data records. Specifically, the model looks         five components as follows:
at whether best practice is employed in the task of converting the      1.Adding and editing repository data.
raw experimental data into a high-quality product. Crowston and         2.Searching for and selecting from new literature.
Qin (2011) proposed a model based on the CMM for Software but           3.Reading and extracting data from the abstract.
adapted for the management of scientific data. They describe key        4.Reading and extracting data from the full paper.
processes and practices that should be in place for effective data      5.Documenting curation results.
management. A further example is provided by a team at Sandia
National Labs in the US, where Oberkampf et al. have constructed a      4
maturity model for computer modelling and simulation (Oberkampf           uniprot.org
                                                                        5 thebiogrid.org
et al., 2007). The model includes a check on the tools and techniques
                                                                        6 flybase.org
used to verify the geometric and physical fidelity of any model
                                                                        7 yeastgenome.org
created.
                                                                        8 rgd.mcw.edu




                                                                                                                                               3
Mariam Alqasab et al



   We identified 5 broad levels for the maturity model from the              of significance or urgency for curation, and will include some
literature. At level 1, all curation is performed manually (as might         notion of paper quality and readiness for curation (e.g. using tools
be the case, for example, in a community that is new to curation).           such as the MiniRECH reporting quality checklist9 ). At level
Then, the process gradually changes to adopt semi-automated ways             4, the tools would include some element of learning, based on
to curate data. The final level is not full automation, which is not         curators decisions about what to curate previously, that removes
likely to be possible (or desirable) in the foreseeable future due to        some of the search labour for curators. Searches would be run
the need for expert interpretation and decision making, but instead          automatically, rather than being triggered by the curators, and
aims for an optimal distribution of work between the human experts           work is scheduled across available curators, who are notified of
(curators) and the supporting software tools.                                the arrival of papers relevant to them that could be curated. At
   The provisional model is presented in Table 1. We now describe            level 5, text analysis of the paper is used to make good quality
each dimension (column) of the model in turn.                                decisions as to which papers to curate, leaving curators only the
                                                                             task of choosing from amongst a very small number of papers.
Adding and editing repository data: This dimension model levels
                                                                         Reading and extracting data from the abstract: This dimension
 of practice in finding defects in the repository data and correcting
                                                                           relates to the second step of literature-oriented curation, in which
 them. At the initial level, the curators perform their job by
                                                                           annotations that are supported by the abstract of the paper under
 manually searching for defects in data and fixing them. End
                                                                           curation are decided. At level 1, curators read and extract data
 users may also report data errors, too. At this level, we do not
                                                                           from the abstract entirely manually. At level 2, the curation
 pay attention to the format of data, as manual curation can deal
                                                                           process continues to be manual, but authors of the paper can
 flexibly with a range of formats, to identify how to access and
                                                                           participate in the process. In other words, authors are given
 retrieve the data.
                                                                           the chance to fill in a form with some information about their
 Level 2 focuses on making the curation process more organised
                                                                           publications. At level 3, a semi-automatic tool can be used to
 and repeatable compared with the initial level, as in this level
                                                                           highlight and extract data from the abstract. However, at this
 a number of guidelines to define the process of curation and
                                                                           point, only limited formats of abstract will be covered.
 the things that curators should consider to find defects in data
                                                                           At level 4, tools will support the curator by looking for specific
 are documented. In addition, curators are asked to add an audit
                                                                           features in the abstract, based on a specification of needs from
 trail when making changes to data, giving the reason behind the
                                                                           the curator. For example, specific protein interaction information
 decision to make the change.
                                                                           could be located in the text of the abstract. At level 5, the tool will
 However, curators need semi-automated or automated ways to
                                                                           learn from previous interactions what data needs to be extracted,
 help them cope with the rapid arrival of new experimental results
                                                                           meaning that the curator does not need to do much configuration
 needing curation. This leads to level 3, in which automatic or
                                                                           of the tool.
 semi-automatic tools that can detect defects in data and suggest
 solutions for the detected defects are adopted. The curators can        Reading and extracting data from the full-text: After extracting
 monitor the results of the tools (perhaps through some dashboard)         data from the selected publication(s), the paper need to be curated
 and authorise changes if applicable.                                      in full — that is, the full text of the paper is examined for
 The next level, level 4, starts from the idea that a number               information relevant to the annotation task. As in the other
 of communities may be working with the data under curation,               dimensions, the curation process of the full-text is done manually
 meaning that multiple curators might be at work on the data.              at level 1. At level 2, the curation process can be assessed using
 This leads to the possibility of redundant curation being done. At        a tool such as Kwon et al., 2014. At level 3, collaboration
 this level, therefore, we look for some support for collaboration         and sharing tools are brought into play, to assist curators in
 between communities of curators. This can be achieved by                  working together to curate a set of papers, sharing information
 providing a common curation platform or provide a sharing                 and avoiding redundant work. For example, one curator might
 mechanism. For example, MIntAct proposed a curation platform              mark up the relevant phrases in a paper, and this markup would
 which allows 11 different databases to share their curation efforts       be visible to other curators. This collaboration can be done by
 (Orchard et al., 2013). In case of sharing data, it is important to       providing curation platform.
 provide a catalogue that standardises the annotations to be created       At level 4, we start to use tools that extract relevant information
 by all communities. This will help curators to be familiar with the       from the paper full text automatically (creating the kinds of mark-
 meaning of other communities’ annotations.                                up that curators create at level 3, but by software rather than
 In level 5, all automatable parts of the process are done                 manually). At level 5, the tools used must go beyond extracting
 automatically, including the creation of links between data items         data from the text of a paper, but will also highlight relevant
 in the curated sources, and links to relevant external sources.           figures and tables. Besides, supplementary materials will also be
                                                                           considered and processed for relevance.
Searching for and selecting from new literature: This dimension
  is concerned with the first step in literature-oriented curation,      Documenting curation results: This dimension focuses on recording
  the identification of the scientific papers that will be the subject    and displaying the curation results, which might help curators
  of the curation. At level 1, searching for new publications in          from varies communities to understand the curation process of
  a specific area is done manually by searching with existing             a specific community. In level 1, any documentation of curation
  publisher web resources. At level 2, semi-automatic tools are           results is done manually, and at the discretion of individual
  used to check for the arrival of new publications and provide the
  results. At level 3, tools will also be used to rank papers in order   9   github.com/miniRECH



4
                                                                                                                                               Maturity Model



    curators. At level 2, a semi-automatic tool can be used to highlight   dimension 3, the community needs to find a tool that can extract
    recent changes made to data items of interest to curators and          relevant information from the abstracts of paper. They find a suitable
    end-users, but audit trail information is gathered manually and        text mining tool, but need to put some effort into configuring it to
    informally. At level 3, the capture of audit trail information will    work with their preferred ontologies. The team has access to text
    be documented and standardised across the community, with tools        mining expertise, and decide to go ahead with this improvement.
    to assist in the capture of this information. At level 4, audit           The last dimension to be improved is dimension 5. The team
    trail information will not only be captured, but will be displayed     decides to jump 2 levels, since they realise that they can adapt
    and be capable of being queried. At level 5, tools will be able        an audit trail model from another closely related community, and
    to aggregate audit trail information across a data source or set       also make use of tools provided by that community. The maturity
    of curators, providing graphs for each attribute and divide the        model has helped them to make informed and defensible decisions
    results by change type and reason. This information will be used       about how to obtain the most improvement value from the available
    to identify lapses from the documented curation process, and to        resources.
    advise on areas where more curation effort is needed.
                                                                           6    CONCLUSION AND FUTURE WORK
5     USAGE OF OUR PROPOSED MATURITY MODEL
                                                                           The main goal of this paper is to propose a tentative maturity model
This section illustrates how our proposed Maturity Model might             for biomedical data curation, with the aim of soliciting preliminary
be used in practice, by describing an example. In this example, a          feedback from the biomedical and curation communities. The model
community that has only recently started to curate its data wishes to      gives a general explanation of how to identify the maturity level of
make improvements. They will use BioC-MM to identify possible              each curation step and suggest improvements to reach a sufficient
“quick wins” for improvement, based on their current practices.            level of maturity. The aim is to achieve the maximum quality of
  The community needs to carry out the following steps:                    curation with current or fewer resources.
1.Identify the current maturity level of the community curation               Feedback at this early stage in the work is sought on the overall
  process against each dimension in the model.                             idea of creating a maturity model for curation, and also on the
2.Identify the dimensions where improvement is most needed, and            details of the form the model takes. At this stage, we make no
  select the desired maturity level of each one. The desired maturity      strong claims for this set of levels being the “right ones”, nor for
  level should be close to the current level for this exercise. The        the set of dimensions being complete. Our current work involves
  assumption behind the use of maturity models is that there is no         gathering feedback from curators and researchers on the model, and
  point in trying to jump from level 2 to level 5 (say) too quickly.       incorporating feedback. Once a more stable model has been created,
3.For each dimension where improvement is needed, use the                  we will create a web resource to allow curation teams to assess their
  descriptions of the levels between the current level and the target      current model, and to obtain suggestions for improvements based on
  level to plan a series of staged improvements.                           their target maturity levels. We hope that the final maturity model
                                                                           will benefit a range of biomedical communities, by allowing ideas,
Let’s consider a simple example of a community that wishes to use          tools and best practice to be shared and refined.
BioC-MM to improve its processes. Assume that this community
uses a tool downloaded from elsewhere to extract new publications
from the literature every week, and that it can semi-automatically
                                                                           REFERENCES
detect and extract data from the abstract using a bespoke tool             Bates, J. J. and Privette, J. L. (2012). A maturity model for assessing the completeness
                                                                              of climate data records. Eos, Transactions American Geophysical Union, 93(44),
they have developed. The community uses a basic collaboration                 441–441.
platform, to curate the full text of new publications. However, the        Baumgartner, Jr, W., Cohen, K., Fox, L., Acquaah-Mensah, G., and Hunter, L.
repository data is still edited manually, and no audit trail information      (2007). Manual curation is not sufficient for annotation of genomic databases.
is gathered (apart from notes kept informally by curators).                   Bioinformatics, 23(13), i41.
                                                                           Campos, D., Lourenço, J., Matos, S., and Oliveira, J. L. (2014). Egas: a collaborative
   Based on the description of the community mentioned above, this
                                                                              and interactive document curation platform. Database, 2014, bau048.
community is at level 1 for dimension 1, at level 2 for dimension 2,       Clarke, J. A., Nelson, K. J., and Stoodley, I. D. (2013). The place of higher education
at level 3 for dimension 3, at level 3 for dimension 4, and at level          institutions in assessing student engagement, success and retention: A maturity
1 for dimension 5. The curators feel they are spending too long               model to guide practice.
searching through new publications to find the ones they need to           Crosby, P. B. (1979). Quality is free: The art of marketing quality certain. New York:
                                                                              New American Library.
pay attention to, and are beginning to struggle with the lack of any       Croset, S., Rupp, J., and Romacker, M. (2016). Flexible data integration and curation
formal audit trail, as errors introduced by inexperienced curators            using a graph-based approach. Bioinformatics, 32(6), 918–925.
are hard to detect and correct. So, the goal is set to reach level         Crowston, K. and Qin, J. (2011). A capability maturity model for scientific data
3 in dimension 2 and level 2 or 3 in dimension 5. Interest is also            management: Evidence from the literature. Proceedings of the American Society
                                                                              for Information Science and Technology, 48(1), 1–9.
expressed in making data changes easier, so a target of level 2 is set
                                                                           Essmann, H. and Du Preez, N. (2009). An innovation capability maturity model–
for dimension 1.                                                              development and initial application. World Academy of Science, Engineering and
   After deciding the target maturity levels, it is time to go through        Technology, 53(1), 435–446.
each dimension which is below its target, to improve it. Dimension         Goldberg, T., Vinchurkar, S., Cejuela, J. M., Jensen, L. J., and Rost, B. (2015). Linked
1 should be moved from manually editing repository data to semi-              annotations: a middle ground for manual curation of biomedical databases and text
                                                                              corpora. In BMC Proceedings, volume 9, page A4. BioMed Central.
automatic editing. If no existing tool can be found, then a bespoke        Kwon, D., Kim, S., Shin, S.-Y., Chatr-aryamontri, A., and Wilbur, W. J. (2014).
tool will need to be created. The team might decide that this is              Assisting manual literature curation for protein–protein interactions using bioqrator.
not cost-effective for them at the present time. To reach level 3 in          Database, 2014, bau067.



                                                                                                                                                                  5
Mariam Alqasab et al



            Component           Level 1                    Level 2                    Level 3                    Level 4                      Level 5
                                                                                      Semi-automatic             -    Providing       a
                                                                                      tool    to     detect      catalog           that
                                                           - Define criteria to       problems in data           link    all     types
                                                           go through each            and          suggest       of annotations -
            Adding              Manually identify                                                                                             Completely
                                                           data record and            solutions          to      Collaboration and
            and editing         problems in the                                                                                               automated way
                                                           fix data - Adding          fix        problems.       Data         Sharing
            repository          data records and                                                                                              to detect and fix
                                                           annotations when           The curator can            providing            a
            data                fix them                                                                                                      problems in data
                                                           editing        data        then go through            common curation
                                                           (manually)                 suggestions      and       platform to share
                                                                                      authorise the ideal        curation      efforts
                                                                                      suggestion                 between databases
            Searching                                                                                            Set the tool to              Totally automated
                                Check for new
            and                                            Semi-automated             The tool can rank          work every specific          way to search
                                publications     in
            choosing                                       tool to search for         and     order     the      period of time, and          literature      and
                                the      literature
            for      new                                   literature                 extracted literature       search in different          split the extracted
                                manually
            literature                                                                                           sources of literature        papers by type
                                                           Collaboration
                                                           allow           the                                   The tool can also
            Reading and
                                Reading           and      authors of new             Semi-automated             semi-automatically           The tool can
            extracting
                                extracting        data     publication                tool to highlight          find protein-protein         perform its job
            data from
                                manually                   to      participate        and extract                interaction     and          automatically
            the abstract
                                                           partially in the                                      relationship
                                                           curation process
                                                                                                                                              Extend the tool,
                                                                                      Collaboration
                                                                                                                                              so it covers tables,
            Reading and                                                               collaborative
                                Reading           and                                                            A tool to extract            figures etc. At
            extracting                                     A tool to asses            curation platform
                                extracting        data                                                           data from text semi-         least point out if
            data from                                      manual curation            between
                                manually                                                                         automatically                it has something
            the full-text                                                             communities
                                                                                                                                              that need to be
                                                                                      and curators
                                                                                                                                              reviewed
                                                           A semi-automatic
                        Does not pay                       tool to help in            The tool has extra
            Documenting                                                                                                                       A tool to analyse
                        attention    for                   extracting results         feature such as            The     tool    will
            Curation                                                                                                                          the      curation
                        documenting any                    of the curation for        specifying     the         display the reason
            Results                                                                                                                           results
                        results                            a specific type of         period of time
                                                           data
                                                            Table 1. Biomedical Data Curation Maturity Model




Liu, W., Laulederkind, S. J., Hayman, G. T., Wang, S.-J., Nigam, R., Smith, J. R.,            in business process management. In ECIS.
    De Pons, J., Dwinell, M. R., and Shimoyama, M. (2015). Ontomate: a text-mining         Ravagli, C., Pognan, F., and Marc, P. (2016). Ontobrowser: a collaborative tool for
    tool aiding curation at the rat genome database. Database, 2015, bau129.                  curation of ontologies by subject matter experts. Bioinformatics, page btw579.
Oberkampf, W. L., Trucano, T. G., and Pilch, M. M. (2007). Predictive capability           Sernadela, P., Lopes, P., Campos, D., Matos, S., and Oliveira, J. L. (2015). A
    maturity model for computational modeling and simulation. Technical report,               semantic layer for unifying and exploring biomedical document curation results.
    Sandia National Laboratories.                                                             In International Conference on Bioinformatics and Biomedical Engineering, pages
Ofner, M., Otto, B., and Österle, H. (2015). A maturity model for enterprise data            8–17. Springer.
    quality management. Enterprise Modelling and Information Systems Architectures,        Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G., Cherniack, M., Zdonik, S. B.,
    8(2), 4–24.                                                                               Pagan, A., and Xu, S. (2013). Data curation at scale: The data tamer system. In
Orchard, S., Ammari, M., Aranda, B., Breuza, L., Briganti, L., Broackes-Carter, F.,           CIDR.
    Campbell, N. H., Chavali, G., Chen, C., Del-Toro, N., et al. (2013). The mintact       The Institute of Internal Auditors (2013). Practice guide: Selecting, using and creating
    projectintact as a common curation platform for 11 molecular interaction databases.       maturity models: a tool for assurance and consulting engagements.
    Nucleic acids research, page gkt1115.                                                  Verspoor, K., Yepes, A. J., Cavedon, L., McIntosh, T., Herten-Crabb, A., Thomas,
Paulk, M., Curtis, W., Chrissis, M., and Weber, C. (1993). Capability maturity model,         Z., and Plazzer, J.-P. (2013). Annotating the biomedical literature for the human
    version 1.1. IEEE Software, 10(4), 18–27.                                                 variome. Database, 2013, bat019.
Pöppelbuß, J. and Röglinger, M. (2011a). What makes a useful maturity model? a           Wei, C.-H., Kao, H.-Y., and Lu, Z. (2013). Pubtator: a web-based text mining tool for
    framework of general design principles for maturity models and its demonstration          assisting biocuration. Nucleic acids research, page gkt441.
    in business process management. In ECIS.
Pöppelbuß, J. and Röglinger, M. (2011b). What makes a useful maturity model? a
    framework of general design principles for maturity models and its demonstration




6