=Paper= {{Paper |id=Vol-2042/paper43 |storemode=property |title=Construction of Knowledge-Base for Clinical Interpretation of Genomic Variants |pdfUrl=https://ceur-ws.org/Vol-2042/paper43.pdf |volume=Vol-2042 |authors=Mayumi Kamada,Toshiaki Katayama,Shuichi Kawashima,Ryosuke Kojima,Masahiko Nakatsui,Yasushi Okuno |dblpUrl=https://dblp.org/rec/conf/swat4ls/KamadaKKKNO17 }} ==Construction of Knowledge-Base for Clinical Interpretation of Genomic Variants== https://ceur-ws.org/Vol-2042/paper43.pdf
           Construction of Knowledge-base for Clinical
              Interpretation of Genomic Variants

      Mayumi Kamada1, Toshiaki Katayama2, Shuichi Kawashima2, Fumie Ono1,
             Ryosuke Kojima1, Masahiko Nakatsui1, Yasushi Okuno1
              1
                  Kyoto University, 54 Shogoin, Sakyo-ku, 606-8397 Kyoto, Japan
2 Database Center for Life Science, 178-4-4 Wakashiba, Kashiwa-shi, 277-0871 Chiba, JAPAN

                             mkamada@kuhp.kyoto-u.ac.jp

        Abstract.
        Clinical interpretation for variants of uncertain significance is important to pro-
        vide appropriate medical treatment. However, enormous effort and specialized
        knowledge are required to give a clinical interpretation to variants. To reduce the
        burden, it is necessary to develop an automated estimation system of clinical sig-
        nificance using aggregated knowledge from public databases and literature for
        interpretation. We are constructing a database that collects disease-related vari-
        ants in Japanese population in order to improve interpretation of Japanese vari-
        ants. In this work, we carry out RDF conversion of public databases that are
        needed to interpret variants, and integration of them to apply to the estimation
        system using a machine learning method.

        Keywords: Clinical interpretation of variants, Estimation of clinical signifi-
        cance, Integrated knowledge base.


1       Introduction
    The improvement of genome sequencing technology enables us to apply clinical se-
quence using next generation sequencer on clinical diagnosis. The purpose of clinical
sequence is to provide an appropriate medical treatment policy, based on individual
genetic background. However, many of the detected sequence variants are unclear in
relation to mechanism of disease and often do not lead to clinical determination. These
variants are called as variants of uncertain significance (VUS) which is one of the prob-
lem to obstruct precision medicine. In order to clarify the disease relevance of VUS, it
is needed to obtain 1) specialized knowledge in each disease domain, 2) comprehensive
interpretation of enormous information in literature, and 3) the clinical background of
individual patients. The aggregation of such knowledge leads us to the realization of a
system that can automatically estimate the disease relevance.
    The Database Center for Life Science (DBCLS) in Japan has been working on inte-
gration of public databases using Resource Description Framework (RDF) by promot-
ing international collaborations on standardization of semantics in life sciences and bi-
omedical domains1. We have been developing a machine learning method to estimate
clinical significance for each variant, for which graph-structured data is used as learning

1   https://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-5-5
2


data. In order to fertilize the learning data, we carry out RDF conversion of databases
that are required to interpret disease relevance.

2       Target variants and concept of integrated knowledge-base
   The goal of our project is to construct a database to give appropriate interpretation
for Japanese variants. It is well known that disease association is affected by the ge-
nomic background difference in a population. For Japanese population, we have been
constructing a disease-related genomic information database. The database is going to
store variants and clinical data collected from the fields of “cancer”, “rare disease”,
“infectious disease”, “dementia”, “hearing loss”.
   The germline variant is a mutation in a reproductive cell (egg or sperm), which in-
duces single-gene disorders in many cases. Interpretation of germline variants is often
done with a guideline developed by the American College of Medical Genetics and
Genomics (ACMG)2. Based on the ACMG guideline for making medical treatment de-
cisions, we have been converting the following databases into RDF and also integrating
them with the existing RDF datasets on reference genome and protein annotations.

    • ClinVar3, COSMIC4 (Pathogenicity of variants)
    • dbNSFP5 (Effects by variants)
    • dbSNP6, dbVar7 (Genetic variants and their frequency in population)
    • DGIdb8 (Drug-gene interaction)
    • HINT9, INstruct10 (Molecular interaction)

If different terms meaning the same object are used in individual databases, we cannot
use them in an integrated manner. Thus, we promote the unification of terms by ontol-
ogy development, and the standardization of URI by using the same prefix (http://iden-
tifiers.org). In fact, for each database, the converter from database dump files (csv, tsv
etc.) and RDB to RDF will be released as the Docker11 containers. We also plan to
provide our database of Japanese disease-related genomic data as RDF so that it can be
seamlessly integrated with the constructed knowledge graph in the above.

Acknowledgements. This research is supported by the Program for an Integrated Da-
tabase of Clinical and Genomic Information from Japan Agency for Medical Research
and development (AMED).


2  Richards S. et al., Genet. Med., 17(5):405-24, 2015.
3  https://www.ncbi.nlm.nih.gov/clinvar/
4 http://cancer.sanger.ac.uk/cosmic
5 https://sites.google.com/site/jpopgen/dbNSFP
6 https://www.ncbi.nlm.nih.gov/projects/SNP/
7 https://www.ncbi.nlm.nih.gov/dbvar
8 http://dgidb.org/
9 http://hint.yulab.org/
10 http://instruct.yulab.org/
11 https://www.docker.com/