=Paper= {{Paper |id=Vol-1986/SML17_paper_9 |storemode=property |title=Construction of Viral Hepatitis Bilingual Bibliographic Database with Protein Text Mining and Information Integration Functions |pdfUrl=https://ceur-ws.org/Vol-1986/SML17_paper_9.pdf |volume=Vol-1986 |authors=Heng Chen,Yongjuan Zhang,Chunhong Lin,Liwen Zhang,Tao Chen |dblpUrl=https://dblp.org/rec/conf/ijcai/ChenZLZC17 }} ==Construction of Viral Hepatitis Bilingual Bibliographic Database with Protein Text Mining and Information Integration Functions== https://ceur-ws.org/Vol-1986/SML17_paper_9.pdf
               Construction of Viral Hepatitis Bilingual Bibliographic
                  Database with Protein Text Mining and Information
                                              Integration Functions

    Heng Chen*               Yongjuan Zhang             Chunhong Lin                Liwen Zhang                     Tao Chen

 Shanghai Institutes        Shanghai Institutes         ShangTex               Shanghai Institutes for       Shanghai Institutes for
    for Biological             for Biological        Workers’ College,          Biological Sciences,          Biological Sciences,
  Sciences, Chinese          Sciences, Chinese       Changshou Road             Chinese Academy of            Chinese Academy of
     Academy of                 Academy of            652, Shanghai           Sciences, Yueyang Road           Sciences, Yueyang
 Sciences, Yueyang          Sciences, Yueyang         200060, China            320, Shanghai 200031,          Road 320, Shanghai
 Road 320, Shanghai         Road 320,Shanghai                                          China                     200031, China
   200031, China              200031, China
chenheng@sibs.ac.cn         zhangyj@sbs.ac.cn         linch@fzzd.sh.cn        zhangliwen@sibs.ac.cn          Chentao01@sibs.ac.cn

                                                                       resources, as well as cross lingual
                                                                       information retrieval, integration, and
                                                                       mining.
                         Abstract
      With fast development of viral hepatitis                    1 Introduction
      research, a large number of the research
      achievements have been generated and                        At present, global mass information floods and
      scattered in various literatures. Information               affects all aspects of human life. As one of the most
      service providers are meeting the challenge                 active research fields, life science generates
      of satisfying readers’ needs for more                       countless achievements and datasets that scatter in
      efficient and intelligent retrieval. Data                   various literatures every year. In life science field,
      mining and information integration are                      viral hepatitis is a seriously infecting disease
      basically the promising and effective ways                  resulted from various hepatitis viruses. So, viral
      which become more and more important.                       hepatitis is, arguably, one of the most intensely
      Our study describes how to build the viral                  studied viruses in the history of biomedical research
      hepatitis bibliographic database, how the                   over the world. With fast development of viral
      viral hepatitis related protein information                 hepatitis research, a large number of the research
      is mined from the viral hepatitis                           achievements have been generated and scattered in
      bibliographic database, and integrated with                 various literatures. Although most of them are
      corresponding information in the Universal                  accessible through databases and web sites, it is still
      protein resource - the Uniprot database                     a problem for readers to identify what they really
      from EBI. With the help of Chinese and                      need from enormous search results. So mining and
      English      bilingual     protein     control              information integration are essential to meet
      vocabulary built by ourselves, mining of                    readers’ needs for more efficient and intelligent
      the viral hepatitis related protein text in the             retrieval. Different useful information resources can
      bilingual bibliographic database is realized                be further integrated after the information is
      and integration with corresponding protein                  filtered , digitized and mined, The integration of
      information in the Uniprot database is                      information resources could be chosen, organized
      achieved. In a word, our paper describes                    and processed according to the needs of different
      the integration and mapping between                         readers or users so as to yield the new information
      Chinese-English bilingual bibliographic                     resources and new knowledge formation. The
      databases and the authoritative factual                     integration of digital information resources includes:
      databases (the Uniprot database) through                    data integration, information integration, knowledge
      relevant text mining works. It would be                     integration, in which knowledge integration is at the
      useful for extension, utilization and mining                highest level of resource integration system, which
      of Chinese-English bilingual bibliographic                  is based on the inevitable requirement and result of
                                                                  data and information integration to a certain stage.

        *
            Copyright © by the paper’s authors. Copying permitted for private and academic purposes.
                                  th
         In:Proceedings
        In:  Proceedingsofof the 4Workshop
                           IJCAI     International Workshop
                                            on Semantic     on Semantic
                                                        Machine           Machine
                                                                Learning (SML     Learning
                                                                              2017),        (SML
                                                                                     Aug 19-25    2017).
                                                                                               2017,     19th August,
                                                                                                     Melbourne,        2017,
                                                                                                                Australia.
        Melbourne, VIC, Australia.

                                                                 1
Knowledge mining is a complex process of                      study the deep processing of the subject
identifying effective, novel, potentially useful              classification index of the literature in the
information and knowledge from the information                knowledge database from the user's needs so as to
database (Feng and Wang, 2008). Information                   facilitate the readers’ use and retrieval.
integration allows users to get the most extensive            As you know, literature database and protein
information, while knowledge mining allows users              science database are the ones of the most important
to quickly find the knowledge they want from the              support source for hepatitis virus researchers. So in
infinite information ocean. The application of                this paper, we build the viral hepatitis bilingual
information integration and knowledge mining                  bibliographic database and perform viral hepatitis
technology and the establishment of linked and                related protein text mining and integrating with the
integrated database knowledge service system will             Uniprot protein database so as to give our vigorous
allow users to quickly and efficiently find the               support for the sino-foreign hepatitis virus
necessary information and knowledge (Zhang et al.,            researchers’ information retrieval and knowledge
2010).                                                        discovery.
Nowadays, many professional databases have been
developed to the era of data mining and integration,          2 Materials, Methods, Design and
knowledge mining and discovery, and greatly focus             Results
on information integration and knowledge mining
so as to realize link and integration between
different type of database through the one-way or
two-way mode, which makes the relevant different              2.1   Materials
types of database connected into a interactive                Data resources: Medline database which is from
organic whole, and enriches the extension and                 NCBI for English dataset, CNKI database which is
expansion capabilities of the relevant database.              from China National Knowledge Infrastructure for
Some successful works have been carried out, such             Chinese dataset, and Uniprot protein database
as GOPubMed, which can automatically recognize                which is from EBI (European Bioinformatics
concepts from user’s search query to PubMed and               Institute) for protein dataset.
display papers containing relevant terms (Doms and            Methods and procedure:
Schroeder, 2005), and Entrez, an integrated search                   ① Collect, select and process the viral
system that enables access to multiple National               hepatitis and hepatitis virus A, B, and C related
Center for Biotechnology Information (NCBI)                   dataset (literature data) from the above Chinese and
databases (Maglott et al., 2011). Similar works are           English database;
also reported by Alexopoulou et al (2008), Chen et                   ② Build the bilingual text mining control
al. (2013), McGarry et al. (2006), Pasquier (2008),           vocabulary (dictionary);
and Sahoo et al. (2007). Different useful                            ③ Perform text mining of viral hepatitis
information resources can be further integrated after         related proteins in the viral hepatitis bilingual
this information is filtered, digitized and mined.            literature database;
The innovation of database design and construction                   ④ Perform preliminary research on
makes users deeply experience the charm and                   eliminating the false positive ones from mining
potential of information integration and knowledge            results;
mining.                                                              ⑤ Integrate the viral hepatitis bilingual
In summary, with the development of international             literature database with the Uniprot protein database
scientific database, information integration and              on the basis of the mined hepatitis virus A, B and C
knowledge mining has become the mainstream and                related protein.
the trend of digital information resources processing
and utilization. the semantic network is the
environment of information integration, ontology is           2.2   Design
the core of semantic web construction and                     System design
foundation. Construction of the professional domain           1. System architecture: 3-tier structure based on B/S
ontology, based on the integration and mining of              model ( separateness of web server and database
digital information resources will become the focus           server). See fig.1 as follows:
of information integration and knowledge mining
research (Yan, 2008). Based on the analysis of
domestic and foreign database information
integration and knowledge mining theory and
application, authors learning from advanced foreign
information integration and knowledge mining
technology explore the association and integration
of the Chinese and English bilingual literature
databases of viral hepatitis and the related scientific
data databases at home and abroad in the innovation
construction of the viral hepatitis special literature
knowledge database, moreover, the authors further

                                                          2
                                                             Database software: MySql 5.6.22
           Figure 1 System architecture                      Development language: C++ for information index
                                                             module and data mining module, and PHP for web
2. System hardware platform: IBM 4 core servers              application module.
3. System software platform:                                 4. Integration design architecture of database
Operating system: Linux, Ubuntu 9.04                         system platform. See fig.2 as follows:
WEB server: Nginx 0.87




                                Figure 2 Database system platform structure

Figure 2 demonstration: On the one hand, literature
records about viral hepatitis A, B and C from
Medline database of Web of Science platform in
English and from CNKI database of China in
Chinese were screened, collected and processed
into the viral hepatitis related literature knowledge
data warehouse. On the other hand. The control
vocabulary of Uniprot protein database from EBI
was also screened, collected, processed and
translated into the Chinese & English bilingual viral
hepatitis related protein text mining control
vocabulary. Then the indexed viral hepatitis subject
literature knowledge database was built by index
program including improved index procedure
control and optimizing index algorithm through
application of the protein text mining control
vocabulary in the processed viral hepatitis related
literature data warehouse. Finally, integration of the
indexed viral hepatitis subject literature knowledge
database and Uniprot protein database was realized
by mapping ruler through protein text or knowledge
mining algorithm and machine learning.
5. Viral hepatitis related literature indexing and
processing. See fig.3 as follows:




                                                             Figure 3 literature indexing and processing flow
                                                                                   chart


                                                         3
Figure 3 demonstration: The literatures in the viral          http://lifecenter.sgst.cn/protein/cn/quickSearch.do?e
hepatitis knowledge data warehouse were indexed               ntrezWord=RNA-irected%20DNA%20polymerase
and processed according to three stages in the flow           While for the mined Chinese protein text in
chart. Stage 1 is preprocessing before index. Stage 2         literature:
is control during indexing procedure. Stage 3 is              Translate the Chinese protein into English protein
feedback control after index. Aim of all three stages         text in advance, such as “乙型肝炎 e 抗原”is
above is to protect protein text mining from false            translated into “ HBeAg”, “ 衣 壳 蛋 白 质 ” is
positive indexing and mining results.                         translated into “Capsid protein ” , then performing
6. Database system function module components:                information integrating and hyperlinking according
      ① Information issue/management system                   to regulations above and examples.
      ② Literature         knowledge        database          Main performance index of the database system:
           processing/maintaining system                           1. The biggest record number for the literature
      ③ Administration system for user right and              information: 0.2 billion.
           IP address                                              2. Index and data mining time:
      ④ Information index system                                   at current condition of the database system
      ⑤ Knowledge mining system                               containing one million four hundred and seventy
      ⑥ Knowledge inquiry system                              thousand (1,470,000) control vocabularies and
      ⑦ Data maintaining system                               about twenty thousand (20,000) literature records,
      ⑧ Web site visiting and statistical system              the index and data mining time is about eighteen
Construction of Chinese English bilingual                     minutes.
control vocabulary dictionary                                      The index and data mining time is about five
Part exemplary diagram for the bilingual control              minutes after the single literature record is added.
vocabulary. See fig.4 as follows:                                  3. The average retrieval time: < 0.03s (second)
                                                                   4. The amount of concurrency (the number of
                                                              users simultaneous access): >50 people
                                                              Viral hepatitis subject literature knowledge
                                                              database extends three functions through data
                                                              mining,        information        integration      and
                                                              hyperlinking
                                                                   1. Obtain the protein sequence and annotation
                                                              information
                                                                   2. Perform homological analysis of the protein
                                                              sequences (BLAST)
                                                                   3. Perform different alignment of the protein
                                                              sequences and evolutionary tree mapping

                                                              2.3   Results
                                                              Function realization and result display of the
    Figure 4 Demonstration diagram of part
                                                              viral hepatitis subject literature knowledge
 exemplary for the bilingual control vocabulary
                                                              database
       of viral hepatitis (A, B, C) protein
                                                              Homepage of the viral hepatitis subject literature
                                                              knowledge database. See fig.5 as follows:
Information integrating and hyperlinking
regulation and examples for the mined protein
text in literature using Chinese English bilingual
control vocabulary
Using the HBV related protein text as example to
demonstrate        information       integrating    and
hyperlinking regulation for the mined English
protein text in literature. See as follows:
①      HBeAg,
              http://lifecenter.sgst.cn/protein/cn/quic
              kSearch.do?entrezWord=HBeAg
②      Capsid protein,

http://lifecenter.sgst.cn/protein/cn/quickSearch.do?e
ntrezWord=Capsid%20protein
③      Large envelope protein,                                Figure 5 Homepage of the viral hepatitis subject
http://lifecenter.sgst.cn/protein/cn/quickSearch.do?e                  literature knowledge database
ntrezWord=Large%20envelope%20protein                             Realization of protein mining for the viral
④      RNA-directed DNA polymerase                                hepatitis literature knowledge database.


                                                          4
The viral hepatitis related proteins are successfully       database is protein database through the protein
mined by using the bilingual control vocabulary,            mining and information integration. See the fig.6, 7,
algorithm and computer program in the viral                 8 as follows:
hepatitis bilingual bibliographic database. Moreover,
the viral hepatitis bilingual bibliographic




                          Figure 6 Page of the hepatitis viral protein mining (1)




                          Figure 7 Page of the hepatitis viral protein mining (2)




                                                        5
 Figure 8 Page of the hepatitis viral protein of literature database integrating and hyperlinking to the
                                   Uniprot protein scientific database

Viral hepatitis subject literature knowledge              annotation information. See fig.9 as follows:
database extends three functions through data             Result of homological analysis of the protein
mining,      information         integration    and       sequences (BLAST). See fig.10 as follows:
hyperlinking                                              Obtain the evolutionary tree mapping. See fig.11 as
Obtain the hepatitis viral protein sequence and           follows:




             Figure 9 Page of the protein sequence and annotation information of HBcAg




                                                      6
Figure 10 Page of homological analysis result of the HBcAg protein sequences (BLAST)




        Figure 11 Page of the evolutionary tree mapping of the HBcAg protein




                                         7
                                                              database can be integrated with relevant factual
3 Discussion, Conclusion and Future                           scientific databases, it is certainly very helpful and
Work                                                          convenient for users. This is an interesting direction
                                                              for information integration and knowledge mining.

3.1   Discussion                                              3.2   Conclusion

The viral hepatitis bilingual bibliographic database          With the fast development of the viral hepatitis
was successfully built, and protein text was also             research, to satisfy user’s information needs is
successfully mined, and two different classes of              becoming an inevitable challenge. So, construction
databases were also triumphantly integrated, but we           of the viral hepatitis bilingual literature database is
encountered some problems, especially such as                 important, significant and useful. Integration of two
false positive mining results in bilingual protein text       different classes of databases via data mining and
mining. Having investigated the false positive                linking is innovative and trend for database
questions, we think there are probably three causes           development. Moreover, information integration
resulting in the false positive mining results:               and data mining are playing a more and more
     1) Low quality of the original datasets collected;       important role in big data era.
     2) The accuracy and unity of a specialized word
usage is not enough in building of bilingual control          3.3   Future work
vocabulary;
     3) In data mining and integration, computer              In order to solve the problems above, future work
algorithms, mining mode and route selection, and              must be done as follows:
algorithm itself are unreasonable or the system has               1) Constantly extend and update datasets in
defects.                                                      viral hepatitis bilingual literature database;
As for the problems above, we use artificial quality              2) Constantly improve mining and integrating
control to handle the collected original datasets;            quality so as to decrease the false      positive results
refer to specialized dictionary and consult the               as low as possible through algorithm improvement
experts to solve the accuracy and unity question of           and machine learning;
a specialized word usage; try to explore different                3) Further improve accuracy and unity of the
algorithms, mining mode and route to solve                    bilingual control vocabulary;
accuracy and efficiency question of data mining and               4) The viral hepatitis bilingual literature
integration.                                                  database will be linked more factual scientific
After the viral hepatitis bilingual bibliographic             atabase via data mining and information integration.
database was used and demonstrated, we have got
many feedbacks from users. Most of them love the              Acknowledgements
convenience of easily searching hepatitis viral               This work is supported by The Chosen Excellent
protein names, locating highlighted viral protein             Program for Introduced Outstanding Talent of
names in search results, and accessing UniProt                Chinese Academy of Sciences in the Fields of
database for the detailed protein information                 Bibliographical Information and Periodical
through information integration and links. But they           Publication 2010 (Subject field 100 talent program)
also raised some questions and proposed many                  and Chinese National Science and Technology
advices. Overall, however, the feedback is very               Support Project (No.2013BAH21B06)
positive so far. According to users’ suggestions and
problems, we have discovered, following issues are            Reference
currently being considered and actually some of                 Alexopoulou, D., Wächter, T., Pickersgill, L.,
them are being undertaken in order to further                     Eyre, C. and Schroeder, M.: Terminologies for
enhance the system and make it more efficient and                 text-mining; an experiment in the lipoprotein
convenient:                                                       metabolism domain. BMC Bioinformatics.
     1) add more hepatitis viral protein names and                9(Suppl 4), S2, 2008
their     features     into    the    English-Chinese           Chen Heng, Jin Yi, Zhao Yan, Zhang Yongjuan,
Controlled-vocabulary dictionary. This work is                     Chen Chengcai, Sun Jilin, Zhang Shen.
continuously being conducted and actually we also                  Mining and Information Integration Practice
plan to add relationships of hepatitis viral proteins              for Chinese Bibliographic Database of Life
and other relevant information so as to finally                    Sciences. Book title : Advances in Data
construct a Chinese hepatitis viral protein ontology.              Mining: Applications and Theoretical
Then it would be possible to realize semantic-based                Aspects; Vol.7987, pp.1-10, 2013. Publisher:
text     mining      and     provide     users     with            Springer Berlin Heidelberg. Book subtitle:
knowledge-based information service.                               13th Industrial Conference, ICDM 2013,
     2) integrate more factual scientific databases,               NewYork, NY, USA, July 16-21, 2013,
especially factual gene databases. Some users are                  Proceedings.                          (DOI:
also interested in other special fields, such as                   10.1007/978-3-642-39736-3)
evidence-based medicine, AIDS, etc. If search                   Doms, A. and Schroeder, M.: GoPubMed:
results of a special topic from a bibliographic                    exploring PubMed with the gene ontology.

                                                          8
   Nucleic Acids Research. Vol.33: 783-786,
   2005
Feng Xinmin and Wang Jiandong. The concept
   dilemma of knowledge mining and the
   broad-sense knowledge mining. Journal of
   Information, Vol.27 (7): 63-65, 2008
Maglott, D., Ostell, J., Pruitt, K. and Tatusova, T.:
   Entrez Gene: gene-centered information at
   NCBI. Nucleic Acids Research. Vol.39:
   52-57, 2011
McGarry, K., Garfield, S. and Morris, N.: Recent
   trends in knowledge and data integration for
   the life sciences. Expert Systems. Vol.23(5):
   330-341, 2006
Pasquier, C.: Biological data integration using
   Semantic Web technologies. Biochimie.
   Vol.90: 584-594, 2008
Sahoo, S., Bodenreider, O., Zeng, K. and Sheth,
   A.: An experiment in integrating large
   biomedical knowledge resources with RDF:
   Application to associating genotype and
   phenotype        information.        In:    16th
   International World Wide Web Conference
   (WWW2007) on Health Care and Life
   Sciences Data Integration for the Semantic
   Web, pp. 8-12. Banff, Canada(2007)
 Yan Zhihong. Research on the integration mode
   of digital information resources in Chinese
   University libraries. Thesis for Master degree,
   Chong Qing University, 2008
 Zhang Xiaojuan, Zhang Yutao, Zhang Jieli and
   Wang Juncheng. The central research issues
   of information resources integration in china.
   Journal of the China Society for Scientific
   andTechnical Information, Vol.28 (5):
   791-800, 2010




                                                        9