=Paper=
{{Paper
|id=Vol-2137/paper_18.pdf
|storemode=property
|title=A Maturity Model for Biomedical Data Curation
|pdfUrl=https://ceur-ws.org/Vol-2137/paper_18.pdf
|volume=Vol-2137
|authors=Mariam Alqasab,Suzanne M. Embury,Sandra de F. Mendes Sampaio
|dblpUrl=https://dblp.org/rec/conf/icbo/AlqasabES17
}}
==A Maturity Model for Biomedical Data Curation==
A Maturity Model for Biomedical Data Curation
Mariam Alqasab∗, Suzanne M. Embury and Sandra Sampaio
Department of Computer Science, The University of Manchester, Oxford Road, Manchester, UK
ABSTRACT of curation that can be done by available curators (Baumgartner, Jr
Quality is an important aspect that needs to be managed in et al., 2007). Data curation is a vital task, but one that must be done
databases, as the importance of data is determined by its quality. This (and done well) with a fraction of the resources needed to complete
draws the attention of many database providers to care about curating the work wholly manually.
their data in order to maintain data quality over time. Also, this leads This has led curators and researchers to examine and propose
database providers and researchers to investigate the area of data ways to speed up and improve the curation process. A review of
curation and propose ways to improve it, either through providing the biomedical literature for the last 5 years (2012-2017) indicates
tools to automate the process or to support human curators in making a number of publications proposing tools for data curation (such as
changes to the data. However, among all available suggestions to OntoMate (Liu et al., 2015), PubTator (Wei et al., 2013), MIntAct
improve data curation, to the best of our knowledge, no a general (Orchard et al., 2013) and Data Tamer (Stonebraker et al., 2013)), as
description of the curation process has been given that also provides well as others describing specific approaches to curation in certain
solutions to improve it, and that can help database providers to fields (such as using a graph-based approach to improve detecting
assess how mature their approach to data curation is. To fill this gap, problems in records (Croset et al., 2016), and proposing a middle
this paper proposes a maturity model, that describes the maturity layer to unify curation results (Sernadela et al., 2015)).
levels of biomedical data curation. The proposed Maturity Model aims These improvement efforts are good news for biomedical science.
to help data providers to identify limitations in their current curation However, individual communities are at different stages in terms
methods and enhance their curation process. of how their curation is performed. Some communities are well
established, with documented, agreed-upon processes for data
1 INTRODUCTION curation and access to a repertoire of curation resources, such as
With the growth of data-driven science, the curation of public and rich ontologies defining agreed shared vocabularies. Others are just
community data sets has become a necessary task for ensuring starting out, and are following ad hoc procedures with few quality
the long-term usefulness of scientific data. Scientific data typically controls. This is often the case, for example, when some new
comes in two forms: experimental results (measurements) and the experimental technique is developed; it can take a little time before
interpretation of those results in the form of statements about the community repositories for storing the results can be created, and for
structure, organisation and function of the things being observed. the needs of the communities using the new data to be understood
There are curation challenges with both types of data, but the most and supported. During this period, curation of data is less of a focus
substantial difficulties lie in the curation of the interpretive data. for the burgeoning community than just getting up and running.
This data describes the models and hypotheses about reality that These communities need a quick and efficient way to introduce
prevails within the community that owns the data. As such, it is curation regimes to protect and amplify the value of this early data.
often complex in form (requiring several ontologies to describe), it At present, there is little general advice for curators of bio-
can change rapidly or remain current for many years, it is subject medical data. An exception is a useful proposal by Hirschman
to disagreement within the community, and can be superseded as et al. for a general biocuration workflow, but even this proposes
new experimental results come in. Perhaps most significantly, the a one-size-fits-all solution, which may not be appropriate for
source of this data is not a machine, which spits out experimental all communities. Instead, we propose the creation of a maturity
results at high volume but in regular and predictable format. This model for biomedical data curation. A maturity model indicates
interpretive data comes from people, in the form of scientific the different stages of “maturity” of an organisation or group in
publications. The principal task of a biomedical curator is to ensure performing some tasks. The stages describe good practice (and even
that the interpretive data in the resource they curate (sometimes best practice) for aspects of the task under consideration, as well
called metadata or annotations) is kept up-to-date with the prevailing as commonly occurring forms of poorer practice. The underlying
view of the field as presented in the scientific literature. assumption behind maturity models is that it is not usually possible
Thus, the task of biomedical data curation goes beyond fixing for a group of people to carry out best practice in a new area from
defects in data (although this is part of the curator’s task). Instead, scratch. The need to understand the particular needs of the task and
curation must be done by human experts in the domain of the data, the particular abilities of the group mean that time and experience is
who are capable of interpreting the scientific literature, resolving needed to learn the best approaches. The maturity model can tell a
conflicting interpretations, and reflecting the results in the data. The group where they currently stand in terms of good practice, and can
curation task is time-consuming, and it is not always easy to recruit indicate plausible steps for gradual improvement over time. Using
curators with the breadth and depth of expertise to be able to do the the model, newer groups can avoid the mistakes made by other
job well. The speed of arrival of new experimental results, and new groups, and can improve more quickly. More established groups can
interpretations of these and past results, easily out-paces the amount identify areas where their (often scarce) resources can be deployed
for maximum improvement effect.
∗ Corresponding author: mariam.alqasab@postgrad.manchester.ac.uk
1
Mariam Alqasab et al
Such a maturity model will need the support and assistance of literature to curate, and the extraction of data from the literature.
the biomedical community to refine and test. As a first step, we Liu et al. (2015) proposed OntoMate, a text annotation tool that
present in this paper our initial version of a maturity model for tags abstracts of PubMed articles with terms from 16 ontologies
biomedical data curation. The model was created from a survey using machine learning. The curators can specify query terms
of the research literature on biocuration, and in this first version and Ontomate returns the abstracts with matching tags. Ontomate
is focussed on literature-based curation. The paper is organised as will also filter and rank the resulted papers. Wei et al. (2013)
follows. We begin by surveying the literature on biomedical data implemented PubTator, a web-based tool that also searches for
curation (Section 2) and on maturity models (Section 3). We then articles in PubMed, retrieves them and adds annotations in order
present our tentative Biomedical Data Curation Maturity Model to ease the curators’ job. PubTator allows curators to select articles
(Section 4) and illustrate its use with an example (Section 5). from the list of search results, indicates whether the article is
Finally, we conclude. curatable or not, and add specifications to data type and relations.
One of the most important procedures in data curation is adding
2 STATE OF THE ART: DATA CURATION annotations to the curated data. Verspoor et al. (2013) propose
a schema for representing annotations describing human genetic
Scientific data curation1 is the process of associating semantic
variants and their relation to disease. The schema was designed for
information with experimental results, to describe their interpretation
a specific community but is intended to be more widely used, and
in terms of current scientific thought. The semantic information
to save curators the need to redo the design work when creating a
typically takes the form of terms from a controlled vocabulary
data format for their annotations. Generally, the schema works as a
or ontology, or of links with other databases. In addition to
fundamental stage for those, who look for text mining solutions in
adding new annotations, curators are responsible for the overall
human variome. More generally, Goldberg et al. (2015) emphasised
quality of the data, including resolving defects reported in the
the importance of providing linked annotations between resources.
experimental data and in the interpretation annotations added
Their claim is that such links assist manual curation, since they give
previously, or by other curators. The process is expensive, as we
curators ready access to all (or most) available resources that are
have mentioned, because it is important that any such interpretive
connected with the artefact under curation.
data is supported by the available scientific evidence. It is also
While the literature contains a variety of proposals to improve
important that these annotations are as complete as possible, so
the way data curation is carried out, most of the work concerns or
that data-driven science performed on them produces useful results.
serves a specific target community or data source. We were not able
Some communities/source owners are able to employ experts to
to find much in the way of guidance for setting up a data curation
work as full-time data curators, while others must rely on volunteers
activity, nor much work giving general guidance applicable across
from the community giving their time and knowledge. Because of
the biomedical field. We propose our maturity model in an attempt
this, some communities have created their own specialist processes
to (partially) address this gap.
and tools to try to increase the efficiency and accuracy of data
curation.
Recent years have seen an increase in the number of publications 3 STATE OF THE ART: MATURITY MODELS
presenting research in the area of biomedical data curation. While
Maturity models grew out of work in the 1980s and 90s on business
different aspects of curation are covered by different proposals,
process improvement (e.g., Crosby’s Quality Management Maturity
all share the same goal of improving the outcome of the curation
Grid (Crosby, 1979)) and, especially, software engineering (e.g.,
process, while using the same or fewer resources.
the Capability Maturity Model (Paulk et al., 1993)). Since then, a
Some authors and communities have attempted to make
variety of different maturity models, covering a variety of different
collaboration between curators easier, to avoid overlapping curation
fields and process types, have been defined.
work and to make better use of the curation effort available. Orchard
Briefly, a maturity model is a sequence of levels or stages that
et al. initiated a project called MIntAct (Orchard et al., 2013), which
show the progress needed to reach a mature level of practice in some
merged the IntAct Molecular Interaction database2 with the MINT
tasks or areas (Paulk et al., 1993). Each level has specific criteria
database of verified protein-protein interactions3 . MINT is manually
that need to be fulfilled in order to move from one level to the next.
curated by experts from the scientific literature. The MIntAct project
For example, one of the most well-known and well-used maturity
focussed on sharing the curation efforts from 11 different databases,
models, the Capability Maturity Model for Software, aims to model
to gain the maximum value from the curation work performed at
maturity of software development processes (Paulk et al., 1993). It
each individual source. Thinking along similar lines, Ravagli et
consists of five levels:
al. (2016) created OntoBrowser, an on-line collaboration tool for
1.Initial The software process used by teams at this level is
curators, that allows them to work on a single shared working copy,
characterised as ad hoc, and occasionally even chaotic. Few
to avoid redundant curation work. Campos et al. (2014) also created
processes are defined, and success depends on individual effort.
a curation tool, called Egas, that allows for real-time collaboration
2.Repeatable Basic project management processes are established
curation from the scientific literature.
to track cost, schedule, and functionality. The necessary process
Others have implemented tools to speed up data curation by
discipline is in place to repeat earlier successes on projects with
automating aspects of the search for relevant papers in the scientific
similar applications.
3.Defined The software processes for both management and
1 www.dcc.ac.uk/resources/curation-lifecycle-model engineering activities are documented, standardised, and
2 www.ebi.ac.uk/intact integrated into a standard software process for the organisation.
3 mint.bio.uniroma2.it All projects use an approved, tailored version of the organisation’s
2
Maturity Model
standard software process for developing and maintaining Other researchers have studied the whole concept of maturity
software. models, and have proposed ways in which new maturity models
4.Managed Detailed measures of the software process and product can be created and for making better use of existing models. For
quality are collected. Both the software process and products are example, the Institute of Internal Auditors, in the Netherlands,
quantitatively understood and controlled. offers a guide for selecting maturity models for use on business
5.Optimizing Continuous process improvement is enabled by process improvement consulting projects The Institute of Internal
quantitative feedback from the process and from piloting Auditors, 2013. The guide contains a description of a maturity
innovative ideas and technologies. model, and illustrates how to design a maturity model. Pöppelbußet
Maturity models have several uses (Pöppelbuß and Röglinger, al. (2011b) focused on investigating the literature of maturity
2011a). They can be used to assess the current level of a group, in models in business process management. From this investigation,
order simply to understand how the group is performing relative to they derived them some general design principles that can help in
the norms in the field. They can be used to compare the performance designing a maturity model.
of two different groups (for example, to look for opportunities for
partners for fruitful interactions and discussions — a group may 4 BIOMEDICAL DATA CURATION MATURITY
find it more useful to work with a partner one level higher than MODEL (BIOC-MM)
it in the maturity model than with one at the other extreme of
In order to create a maturity model for biocuration, it was necessary
the model). Principally, however, they are a tool for long-term,
to gain a picture of the breadth of activity being undertaken (to
sustained improvement. By assessing a group’s current standing
identify the dimensions for our model), and to gather examples of
against the model, and comparing this with the group’s desired level,
best practice across different biomedical domains. In order to do
a sequence of manageable improvement actions can be planned.
this, we reviewed the literature on curation activities in five different
With the model’s help, the group can target its efforts on areas of
biomedical databases, covering a spread of topics across the field:
its performance where there is most scope for useful improvement.
And by looking at the criteria for performing at the level just •UniProt4
above it’s current performance, achievable improvement steps can •BioGRID5
be identified, that can be implemented with the resources available. •FlyBase6
A large number of maturity models have been proposed, since •Saccharomyces Genome Database (SGD)7
their inception in the 1980s. A full survey is beyond the scope of •Rat Genome Database (RGD)8
this paper, but we mention some representative examples of work in
this area, to give a flavour of what is being done. We also aimed to examine sources from both long-established
New maturity models have been proposed in areas that go well and more newly established communities, on the grounds that the
beyond the original business process/software process focus of the longer established communities would (typically) have more mature
earliest models Ofner et al. (2015), for example, built a maturity processes in place than those just getting started. (Unfortunately,
model for data quality management at an enterprise level. Another very new communities are not usually in a position to publish details
proposed maturity model is called the Student Engagement Success of their curation processes, and are less likely to have the time or
and Retention Maturity Model (SESR-MM) (Clarke et al., 2013). It confidence to do so.)
focuses on helping higher education institutions (HEIs) to provide According to our observations of practices in use with these data
a good environment for their students. The model covered different sources, we found that the curation process mainly takes two forms:
aspects that can raise the level of student engagement to improve data-oriented curation and literature-oriented curation. The data-
academic success rates and retention. Yet another model is aimed oriented curation means that the focus of the curation process is
at innovation capabilities within organisations, and the kinds of to look for defects in the data, whereas literature-oriented curation
support and facility needed to enhance it (Essmann and Du Preez, means curating data when a new related publication appears in
2009). the area, by extracting relevant information from the paper and
In addition to these business focussed models, a handful of associating it with the data. The literature-oriented curation has
maturity models in the area of scientific data and data management three main tasks: searching for new publications, extracting data
have been proposed. For example, Bates and Privette (2012) from the abstract and extracting data from the full-text.
proposed a maturity matrix for the quality assurance processes used These observations led us to divide our Maturity Model (1) into
in managing climate data records. Specifically, the model looks five components as follows:
at whether best practice is employed in the task of converting the 1.Adding and editing repository data.
raw experimental data into a high-quality product. Crowston and 2.Searching for and selecting from new literature.
Qin (2011) proposed a model based on the CMM for Software but 3.Reading and extracting data from the abstract.
adapted for the management of scientific data. They describe key 4.Reading and extracting data from the full paper.
processes and practices that should be in place for effective data 5.Documenting curation results.
management. A further example is provided by a team at Sandia
National Labs in the US, where Oberkampf et al. have constructed a 4
maturity model for computer modelling and simulation (Oberkampf uniprot.org
5 thebiogrid.org
et al., 2007). The model includes a check on the tools and techniques
6 flybase.org
used to verify the geometric and physical fidelity of any model
7 yeastgenome.org
created.
8 rgd.mcw.edu
3
Mariam Alqasab et al
We identified 5 broad levels for the maturity model from the of significance or urgency for curation, and will include some
literature. At level 1, all curation is performed manually (as might notion of paper quality and readiness for curation (e.g. using tools
be the case, for example, in a community that is new to curation). such as the MiniRECH reporting quality checklist9 ). At level
Then, the process gradually changes to adopt semi-automated ways 4, the tools would include some element of learning, based on
to curate data. The final level is not full automation, which is not curators decisions about what to curate previously, that removes
likely to be possible (or desirable) in the foreseeable future due to some of the search labour for curators. Searches would be run
the need for expert interpretation and decision making, but instead automatically, rather than being triggered by the curators, and
aims for an optimal distribution of work between the human experts work is scheduled across available curators, who are notified of
(curators) and the supporting software tools. the arrival of papers relevant to them that could be curated. At
The provisional model is presented in Table 1. We now describe level 5, text analysis of the paper is used to make good quality
each dimension (column) of the model in turn. decisions as to which papers to curate, leaving curators only the
task of choosing from amongst a very small number of papers.
Adding and editing repository data: This dimension model levels
Reading and extracting data from the abstract: This dimension
of practice in finding defects in the repository data and correcting
relates to the second step of literature-oriented curation, in which
them. At the initial level, the curators perform their job by
annotations that are supported by the abstract of the paper under
manually searching for defects in data and fixing them. End
curation are decided. At level 1, curators read and extract data
users may also report data errors, too. At this level, we do not
from the abstract entirely manually. At level 2, the curation
pay attention to the format of data, as manual curation can deal
process continues to be manual, but authors of the paper can
flexibly with a range of formats, to identify how to access and
participate in the process. In other words, authors are given
retrieve the data.
the chance to fill in a form with some information about their
Level 2 focuses on making the curation process more organised
publications. At level 3, a semi-automatic tool can be used to
and repeatable compared with the initial level, as in this level
highlight and extract data from the abstract. However, at this
a number of guidelines to define the process of curation and
point, only limited formats of abstract will be covered.
the things that curators should consider to find defects in data
At level 4, tools will support the curator by looking for specific
are documented. In addition, curators are asked to add an audit
features in the abstract, based on a specification of needs from
trail when making changes to data, giving the reason behind the
the curator. For example, specific protein interaction information
decision to make the change.
could be located in the text of the abstract. At level 5, the tool will
However, curators need semi-automated or automated ways to
learn from previous interactions what data needs to be extracted,
help them cope with the rapid arrival of new experimental results
meaning that the curator does not need to do much configuration
needing curation. This leads to level 3, in which automatic or
of the tool.
semi-automatic tools that can detect defects in data and suggest
solutions for the detected defects are adopted. The curators can Reading and extracting data from the full-text: After extracting
monitor the results of the tools (perhaps through some dashboard) data from the selected publication(s), the paper need to be curated
and authorise changes if applicable. in full — that is, the full text of the paper is examined for
The next level, level 4, starts from the idea that a number information relevant to the annotation task. As in the other
of communities may be working with the data under curation, dimensions, the curation process of the full-text is done manually
meaning that multiple curators might be at work on the data. at level 1. At level 2, the curation process can be assessed using
This leads to the possibility of redundant curation being done. At a tool such as Kwon et al., 2014. At level 3, collaboration
this level, therefore, we look for some support for collaboration and sharing tools are brought into play, to assist curators in
between communities of curators. This can be achieved by working together to curate a set of papers, sharing information
providing a common curation platform or provide a sharing and avoiding redundant work. For example, one curator might
mechanism. For example, MIntAct proposed a curation platform mark up the relevant phrases in a paper, and this markup would
which allows 11 different databases to share their curation efforts be visible to other curators. This collaboration can be done by
(Orchard et al., 2013). In case of sharing data, it is important to providing curation platform.
provide a catalogue that standardises the annotations to be created At level 4, we start to use tools that extract relevant information
by all communities. This will help curators to be familiar with the from the paper full text automatically (creating the kinds of mark-
meaning of other communities’ annotations. up that curators create at level 3, but by software rather than
In level 5, all automatable parts of the process are done manually). At level 5, the tools used must go beyond extracting
automatically, including the creation of links between data items data from the text of a paper, but will also highlight relevant
in the curated sources, and links to relevant external sources. figures and tables. Besides, supplementary materials will also be
considered and processed for relevance.
Searching for and selecting from new literature: This dimension
is concerned with the first step in literature-oriented curation, Documenting curation results: This dimension focuses on recording
the identification of the scientific papers that will be the subject and displaying the curation results, which might help curators
of the curation. At level 1, searching for new publications in from varies communities to understand the curation process of
a specific area is done manually by searching with existing a specific community. In level 1, any documentation of curation
publisher web resources. At level 2, semi-automatic tools are results is done manually, and at the discretion of individual
used to check for the arrival of new publications and provide the
results. At level 3, tools will also be used to rank papers in order 9 github.com/miniRECH
4
Maturity Model
curators. At level 2, a semi-automatic tool can be used to highlight dimension 3, the community needs to find a tool that can extract
recent changes made to data items of interest to curators and relevant information from the abstracts of paper. They find a suitable
end-users, but audit trail information is gathered manually and text mining tool, but need to put some effort into configuring it to
informally. At level 3, the capture of audit trail information will work with their preferred ontologies. The team has access to text
be documented and standardised across the community, with tools mining expertise, and decide to go ahead with this improvement.
to assist in the capture of this information. At level 4, audit The last dimension to be improved is dimension 5. The team
trail information will not only be captured, but will be displayed decides to jump 2 levels, since they realise that they can adapt
and be capable of being queried. At level 5, tools will be able an audit trail model from another closely related community, and
to aggregate audit trail information across a data source or set also make use of tools provided by that community. The maturity
of curators, providing graphs for each attribute and divide the model has helped them to make informed and defensible decisions
results by change type and reason. This information will be used about how to obtain the most improvement value from the available
to identify lapses from the documented curation process, and to resources.
advise on areas where more curation effort is needed.
6 CONCLUSION AND FUTURE WORK
5 USAGE OF OUR PROPOSED MATURITY MODEL
The main goal of this paper is to propose a tentative maturity model
This section illustrates how our proposed Maturity Model might for biomedical data curation, with the aim of soliciting preliminary
be used in practice, by describing an example. In this example, a feedback from the biomedical and curation communities. The model
community that has only recently started to curate its data wishes to gives a general explanation of how to identify the maturity level of
make improvements. They will use BioC-MM to identify possible each curation step and suggest improvements to reach a sufficient
“quick wins” for improvement, based on their current practices. level of maturity. The aim is to achieve the maximum quality of
The community needs to carry out the following steps: curation with current or fewer resources.
1.Identify the current maturity level of the community curation Feedback at this early stage in the work is sought on the overall
process against each dimension in the model. idea of creating a maturity model for curation, and also on the
2.Identify the dimensions where improvement is most needed, and details of the form the model takes. At this stage, we make no
select the desired maturity level of each one. The desired maturity strong claims for this set of levels being the “right ones”, nor for
level should be close to the current level for this exercise. The the set of dimensions being complete. Our current work involves
assumption behind the use of maturity models is that there is no gathering feedback from curators and researchers on the model, and
point in trying to jump from level 2 to level 5 (say) too quickly. incorporating feedback. Once a more stable model has been created,
3.For each dimension where improvement is needed, use the we will create a web resource to allow curation teams to assess their
descriptions of the levels between the current level and the target current model, and to obtain suggestions for improvements based on
level to plan a series of staged improvements. their target maturity levels. We hope that the final maturity model
will benefit a range of biomedical communities, by allowing ideas,
Let’s consider a simple example of a community that wishes to use tools and best practice to be shared and refined.
BioC-MM to improve its processes. Assume that this community
uses a tool downloaded from elsewhere to extract new publications
from the literature every week, and that it can semi-automatically
REFERENCES
detect and extract data from the abstract using a bespoke tool Bates, J. J. and Privette, J. L. (2012). A maturity model for assessing the completeness
of climate data records. Eos, Transactions American Geophysical Union, 93(44),
they have developed. The community uses a basic collaboration 441–441.
platform, to curate the full text of new publications. However, the Baumgartner, Jr, W., Cohen, K., Fox, L., Acquaah-Mensah, G., and Hunter, L.
repository data is still edited manually, and no audit trail information (2007). Manual curation is not sufficient for annotation of genomic databases.
is gathered (apart from notes kept informally by curators). Bioinformatics, 23(13), i41.
Campos, D., Lourenço, J., Matos, S., and Oliveira, J. L. (2014). Egas: a collaborative
Based on the description of the community mentioned above, this
and interactive document curation platform. Database, 2014, bau048.
community is at level 1 for dimension 1, at level 2 for dimension 2, Clarke, J. A., Nelson, K. J., and Stoodley, I. D. (2013). The place of higher education
at level 3 for dimension 3, at level 3 for dimension 4, and at level institutions in assessing student engagement, success and retention: A maturity
1 for dimension 5. The curators feel they are spending too long model to guide practice.
searching through new publications to find the ones they need to Crosby, P. B. (1979). Quality is free: The art of marketing quality certain. New York:
New American Library.
pay attention to, and are beginning to struggle with the lack of any Croset, S., Rupp, J., and Romacker, M. (2016). Flexible data integration and curation
formal audit trail, as errors introduced by inexperienced curators using a graph-based approach. Bioinformatics, 32(6), 918–925.
are hard to detect and correct. So, the goal is set to reach level Crowston, K. and Qin, J. (2011). A capability maturity model for scientific data
3 in dimension 2 and level 2 or 3 in dimension 5. Interest is also management: Evidence from the literature. Proceedings of the American Society
for Information Science and Technology, 48(1), 1–9.
expressed in making data changes easier, so a target of level 2 is set
Essmann, H. and Du Preez, N. (2009). An innovation capability maturity model–
for dimension 1. development and initial application. World Academy of Science, Engineering and
After deciding the target maturity levels, it is time to go through Technology, 53(1), 435–446.
each dimension which is below its target, to improve it. Dimension Goldberg, T., Vinchurkar, S., Cejuela, J. M., Jensen, L. J., and Rost, B. (2015). Linked
1 should be moved from manually editing repository data to semi- annotations: a middle ground for manual curation of biomedical databases and text
corpora. In BMC Proceedings, volume 9, page A4. BioMed Central.
automatic editing. If no existing tool can be found, then a bespoke Kwon, D., Kim, S., Shin, S.-Y., Chatr-aryamontri, A., and Wilbur, W. J. (2014).
tool will need to be created. The team might decide that this is Assisting manual literature curation for protein–protein interactions using bioqrator.
not cost-effective for them at the present time. To reach level 3 in Database, 2014, bau067.
5
Mariam Alqasab et al
Component Level 1 Level 2 Level 3 Level 4 Level 5
Semi-automatic - Providing a
tool to detect catalog that
- Define criteria to problems in data link all types
go through each and suggest of annotations -
Adding Manually identify Completely
data record and solutions to Collaboration and
and editing problems in the automated way
fix data - Adding fix problems. Data Sharing
repository data records and to detect and fix
annotations when The curator can providing a
data fix them problems in data
editing data then go through common curation
(manually) suggestions and platform to share
authorise the ideal curation efforts
suggestion between databases
Searching Set the tool to Totally automated
Check for new
and Semi-automated The tool can rank work every specific way to search
publications in
choosing tool to search for and order the period of time, and literature and
the literature
for new literature extracted literature search in different split the extracted
manually
literature sources of literature papers by type
Collaboration
allow the The tool can also
Reading and
Reading and authors of new Semi-automated semi-automatically The tool can
extracting
extracting data publication tool to highlight find protein-protein perform its job
data from
manually to participate and extract interaction and automatically
the abstract
partially in the relationship
curation process
Extend the tool,
Collaboration
so it covers tables,
Reading and collaborative
Reading and A tool to extract figures etc. At
extracting A tool to asses curation platform
extracting data data from text semi- least point out if
data from manual curation between
manually automatically it has something
the full-text communities
that need to be
and curators
reviewed
A semi-automatic
Does not pay tool to help in The tool has extra
Documenting A tool to analyse
attention for extracting results feature such as The tool will
Curation the curation
documenting any of the curation for specifying the display the reason
Results results
results a specific type of period of time
data
Table 1. Biomedical Data Curation Maturity Model
Liu, W., Laulederkind, S. J., Hayman, G. T., Wang, S.-J., Nigam, R., Smith, J. R., in business process management. In ECIS.
De Pons, J., Dwinell, M. R., and Shimoyama, M. (2015). Ontomate: a text-mining Ravagli, C., Pognan, F., and Marc, P. (2016). Ontobrowser: a collaborative tool for
tool aiding curation at the rat genome database. Database, 2015, bau129. curation of ontologies by subject matter experts. Bioinformatics, page btw579.
Oberkampf, W. L., Trucano, T. G., and Pilch, M. M. (2007). Predictive capability Sernadela, P., Lopes, P., Campos, D., Matos, S., and Oliveira, J. L. (2015). A
maturity model for computational modeling and simulation. Technical report, semantic layer for unifying and exploring biomedical document curation results.
Sandia National Laboratories. In International Conference on Bioinformatics and Biomedical Engineering, pages
Ofner, M., Otto, B., and Österle, H. (2015). A maturity model for enterprise data 8–17. Springer.
quality management. Enterprise Modelling and Information Systems Architectures, Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G., Cherniack, M., Zdonik, S. B.,
8(2), 4–24. Pagan, A., and Xu, S. (2013). Data curation at scale: The data tamer system. In
Orchard, S., Ammari, M., Aranda, B., Breuza, L., Briganti, L., Broackes-Carter, F., CIDR.
Campbell, N. H., Chavali, G., Chen, C., Del-Toro, N., et al. (2013). The mintact The Institute of Internal Auditors (2013). Practice guide: Selecting, using and creating
projectintact as a common curation platform for 11 molecular interaction databases. maturity models: a tool for assurance and consulting engagements.
Nucleic acids research, page gkt1115. Verspoor, K., Yepes, A. J., Cavedon, L., McIntosh, T., Herten-Crabb, A., Thomas,
Paulk, M., Curtis, W., Chrissis, M., and Weber, C. (1993). Capability maturity model, Z., and Plazzer, J.-P. (2013). Annotating the biomedical literature for the human
version 1.1. IEEE Software, 10(4), 18–27. variome. Database, 2013, bat019.
Pöppelbuß, J. and Röglinger, M. (2011a). What makes a useful maturity model? a Wei, C.-H., Kao, H.-Y., and Lu, Z. (2013). Pubtator: a web-based text mining tool for
framework of general design principles for maturity models and its demonstration assisting biocuration. Nucleic acids research, page gkt441.
in business process management. In ECIS.
Pöppelbuß, J. and Röglinger, M. (2011b). What makes a useful maturity model? a
framework of general design principles for maturity models and its demonstration
6