=Paper=
{{Paper
|id=Vol-1986/SML17_paper_9
|storemode=property
|title=Construction of Viral Hepatitis Bilingual Bibliographic Database with Protein Text Mining and Information Integration Functions
|pdfUrl=https://ceur-ws.org/Vol-1986/SML17_paper_9.pdf
|volume=Vol-1986
|authors=Heng Chen,Yongjuan Zhang,Chunhong Lin,Liwen Zhang,Tao Chen
|dblpUrl=https://dblp.org/rec/conf/ijcai/ChenZLZC17
}}
==Construction of Viral Hepatitis Bilingual Bibliographic Database with Protein Text Mining and Information Integration Functions==
Construction of Viral Hepatitis Bilingual Bibliographic
Database with Protein Text Mining and Information
Integration Functions
Heng Chen* Yongjuan Zhang Chunhong Lin Liwen Zhang Tao Chen
Shanghai Institutes Shanghai Institutes ShangTex Shanghai Institutes for Shanghai Institutes for
for Biological for Biological Workers’ College, Biological Sciences, Biological Sciences,
Sciences, Chinese Sciences, Chinese Changshou Road Chinese Academy of Chinese Academy of
Academy of Academy of 652, Shanghai Sciences, Yueyang Road Sciences, Yueyang
Sciences, Yueyang Sciences, Yueyang 200060, China 320, Shanghai 200031, Road 320, Shanghai
Road 320, Shanghai Road 320,Shanghai China 200031, China
200031, China 200031, China
chenheng@sibs.ac.cn zhangyj@sbs.ac.cn linch@fzzd.sh.cn zhangliwen@sibs.ac.cn Chentao01@sibs.ac.cn
resources, as well as cross lingual
information retrieval, integration, and
mining.
Abstract
With fast development of viral hepatitis 1 Introduction
research, a large number of the research
achievements have been generated and At present, global mass information floods and
scattered in various literatures. Information affects all aspects of human life. As one of the most
service providers are meeting the challenge active research fields, life science generates
of satisfying readers’ needs for more countless achievements and datasets that scatter in
efficient and intelligent retrieval. Data various literatures every year. In life science field,
mining and information integration are viral hepatitis is a seriously infecting disease
basically the promising and effective ways resulted from various hepatitis viruses. So, viral
which become more and more important. hepatitis is, arguably, one of the most intensely
Our study describes how to build the viral studied viruses in the history of biomedical research
hepatitis bibliographic database, how the over the world. With fast development of viral
viral hepatitis related protein information hepatitis research, a large number of the research
is mined from the viral hepatitis achievements have been generated and scattered in
bibliographic database, and integrated with various literatures. Although most of them are
corresponding information in the Universal accessible through databases and web sites, it is still
protein resource - the Uniprot database a problem for readers to identify what they really
from EBI. With the help of Chinese and need from enormous search results. So mining and
English bilingual protein control information integration are essential to meet
vocabulary built by ourselves, mining of readers’ needs for more efficient and intelligent
the viral hepatitis related protein text in the retrieval. Different useful information resources can
bilingual bibliographic database is realized be further integrated after the information is
and integration with corresponding protein filtered , digitized and mined, The integration of
information in the Uniprot database is information resources could be chosen, organized
achieved. In a word, our paper describes and processed according to the needs of different
the integration and mapping between readers or users so as to yield the new information
Chinese-English bilingual bibliographic resources and new knowledge formation. The
databases and the authoritative factual integration of digital information resources includes:
databases (the Uniprot database) through data integration, information integration, knowledge
relevant text mining works. It would be integration, in which knowledge integration is at the
useful for extension, utilization and mining highest level of resource integration system, which
of Chinese-English bilingual bibliographic is based on the inevitable requirement and result of
data and information integration to a certain stage.
*
Copyright © by the paper’s authors. Copying permitted for private and academic purposes.
th
In:Proceedings
In: Proceedingsofof the 4Workshop
IJCAI International Workshop
on Semantic on Semantic
Machine Machine
Learning (SML Learning
2017), (SML
Aug 19-25 2017).
2017, 19th August,
Melbourne, 2017,
Australia.
Melbourne, VIC, Australia.
1
Knowledge mining is a complex process of study the deep processing of the subject
identifying effective, novel, potentially useful classification index of the literature in the
information and knowledge from the information knowledge database from the user's needs so as to
database (Feng and Wang, 2008). Information facilitate the readers’ use and retrieval.
integration allows users to get the most extensive As you know, literature database and protein
information, while knowledge mining allows users science database are the ones of the most important
to quickly find the knowledge they want from the support source for hepatitis virus researchers. So in
infinite information ocean. The application of this paper, we build the viral hepatitis bilingual
information integration and knowledge mining bibliographic database and perform viral hepatitis
technology and the establishment of linked and related protein text mining and integrating with the
integrated database knowledge service system will Uniprot protein database so as to give our vigorous
allow users to quickly and efficiently find the support for the sino-foreign hepatitis virus
necessary information and knowledge (Zhang et al., researchers’ information retrieval and knowledge
2010). discovery.
Nowadays, many professional databases have been
developed to the era of data mining and integration, 2 Materials, Methods, Design and
knowledge mining and discovery, and greatly focus Results
on information integration and knowledge mining
so as to realize link and integration between
different type of database through the one-way or
two-way mode, which makes the relevant different 2.1 Materials
types of database connected into a interactive Data resources: Medline database which is from
organic whole, and enriches the extension and NCBI for English dataset, CNKI database which is
expansion capabilities of the relevant database. from China National Knowledge Infrastructure for
Some successful works have been carried out, such Chinese dataset, and Uniprot protein database
as GOPubMed, which can automatically recognize which is from EBI (European Bioinformatics
concepts from user’s search query to PubMed and Institute) for protein dataset.
display papers containing relevant terms (Doms and Methods and procedure:
Schroeder, 2005), and Entrez, an integrated search ① Collect, select and process the viral
system that enables access to multiple National hepatitis and hepatitis virus A, B, and C related
Center for Biotechnology Information (NCBI) dataset (literature data) from the above Chinese and
databases (Maglott et al., 2011). Similar works are English database;
also reported by Alexopoulou et al (2008), Chen et ② Build the bilingual text mining control
al. (2013), McGarry et al. (2006), Pasquier (2008), vocabulary (dictionary);
and Sahoo et al. (2007). Different useful ③ Perform text mining of viral hepatitis
information resources can be further integrated after related proteins in the viral hepatitis bilingual
this information is filtered, digitized and mined. literature database;
The innovation of database design and construction ④ Perform preliminary research on
makes users deeply experience the charm and eliminating the false positive ones from mining
potential of information integration and knowledge results;
mining. ⑤ Integrate the viral hepatitis bilingual
In summary, with the development of international literature database with the Uniprot protein database
scientific database, information integration and on the basis of the mined hepatitis virus A, B and C
knowledge mining has become the mainstream and related protein.
the trend of digital information resources processing
and utilization. the semantic network is the
environment of information integration, ontology is 2.2 Design
the core of semantic web construction and System design
foundation. Construction of the professional domain 1. System architecture: 3-tier structure based on B/S
ontology, based on the integration and mining of model ( separateness of web server and database
digital information resources will become the focus server). See fig.1 as follows:
of information integration and knowledge mining
research (Yan, 2008). Based on the analysis of
domestic and foreign database information
integration and knowledge mining theory and
application, authors learning from advanced foreign
information integration and knowledge mining
technology explore the association and integration
of the Chinese and English bilingual literature
databases of viral hepatitis and the related scientific
data databases at home and abroad in the innovation
construction of the viral hepatitis special literature
knowledge database, moreover, the authors further
2
Database software: MySql 5.6.22
Figure 1 System architecture Development language: C++ for information index
module and data mining module, and PHP for web
2. System hardware platform: IBM 4 core servers application module.
3. System software platform: 4. Integration design architecture of database
Operating system: Linux, Ubuntu 9.04 system platform. See fig.2 as follows:
WEB server: Nginx 0.87
Figure 2 Database system platform structure
Figure 2 demonstration: On the one hand, literature
records about viral hepatitis A, B and C from
Medline database of Web of Science platform in
English and from CNKI database of China in
Chinese were screened, collected and processed
into the viral hepatitis related literature knowledge
data warehouse. On the other hand. The control
vocabulary of Uniprot protein database from EBI
was also screened, collected, processed and
translated into the Chinese & English bilingual viral
hepatitis related protein text mining control
vocabulary. Then the indexed viral hepatitis subject
literature knowledge database was built by index
program including improved index procedure
control and optimizing index algorithm through
application of the protein text mining control
vocabulary in the processed viral hepatitis related
literature data warehouse. Finally, integration of the
indexed viral hepatitis subject literature knowledge
database and Uniprot protein database was realized
by mapping ruler through protein text or knowledge
mining algorithm and machine learning.
5. Viral hepatitis related literature indexing and
processing. See fig.3 as follows:
Figure 3 literature indexing and processing flow
chart
3
Figure 3 demonstration: The literatures in the viral http://lifecenter.sgst.cn/protein/cn/quickSearch.do?e
hepatitis knowledge data warehouse were indexed ntrezWord=RNA-irected%20DNA%20polymerase
and processed according to three stages in the flow While for the mined Chinese protein text in
chart. Stage 1 is preprocessing before index. Stage 2 literature:
is control during indexing procedure. Stage 3 is Translate the Chinese protein into English protein
feedback control after index. Aim of all three stages text in advance, such as “乙型肝炎 e 抗原”is
above is to protect protein text mining from false translated into “ HBeAg”, “ 衣 壳 蛋 白 质 ” is
positive indexing and mining results. translated into “Capsid protein ” , then performing
6. Database system function module components: information integrating and hyperlinking according
① Information issue/management system to regulations above and examples.
② Literature knowledge database Main performance index of the database system:
processing/maintaining system 1. The biggest record number for the literature
③ Administration system for user right and information: 0.2 billion.
IP address 2. Index and data mining time:
④ Information index system at current condition of the database system
⑤ Knowledge mining system containing one million four hundred and seventy
⑥ Knowledge inquiry system thousand (1,470,000) control vocabularies and
⑦ Data maintaining system about twenty thousand (20,000) literature records,
⑧ Web site visiting and statistical system the index and data mining time is about eighteen
Construction of Chinese English bilingual minutes.
control vocabulary dictionary The index and data mining time is about five
Part exemplary diagram for the bilingual control minutes after the single literature record is added.
vocabulary. See fig.4 as follows: 3. The average retrieval time: < 0.03s (second)
4. The amount of concurrency (the number of
users simultaneous access): >50 people
Viral hepatitis subject literature knowledge
database extends three functions through data
mining, information integration and
hyperlinking
1. Obtain the protein sequence and annotation
information
2. Perform homological analysis of the protein
sequences (BLAST)
3. Perform different alignment of the protein
sequences and evolutionary tree mapping
2.3 Results
Function realization and result display of the
Figure 4 Demonstration diagram of part
viral hepatitis subject literature knowledge
exemplary for the bilingual control vocabulary
database
of viral hepatitis (A, B, C) protein
Homepage of the viral hepatitis subject literature
knowledge database. See fig.5 as follows:
Information integrating and hyperlinking
regulation and examples for the mined protein
text in literature using Chinese English bilingual
control vocabulary
Using the HBV related protein text as example to
demonstrate information integrating and
hyperlinking regulation for the mined English
protein text in literature. See as follows:
① HBeAg,
http://lifecenter.sgst.cn/protein/cn/quic
kSearch.do?entrezWord=HBeAg
② Capsid protein,
http://lifecenter.sgst.cn/protein/cn/quickSearch.do?e
ntrezWord=Capsid%20protein
③ Large envelope protein, Figure 5 Homepage of the viral hepatitis subject
http://lifecenter.sgst.cn/protein/cn/quickSearch.do?e literature knowledge database
ntrezWord=Large%20envelope%20protein Realization of protein mining for the viral
④ RNA-directed DNA polymerase hepatitis literature knowledge database.
4
The viral hepatitis related proteins are successfully database is protein database through the protein
mined by using the bilingual control vocabulary, mining and information integration. See the fig.6, 7,
algorithm and computer program in the viral 8 as follows:
hepatitis bilingual bibliographic database. Moreover,
the viral hepatitis bilingual bibliographic
Figure 6 Page of the hepatitis viral protein mining (1)
Figure 7 Page of the hepatitis viral protein mining (2)
5
Figure 8 Page of the hepatitis viral protein of literature database integrating and hyperlinking to the
Uniprot protein scientific database
Viral hepatitis subject literature knowledge annotation information. See fig.9 as follows:
database extends three functions through data Result of homological analysis of the protein
mining, information integration and sequences (BLAST). See fig.10 as follows:
hyperlinking Obtain the evolutionary tree mapping. See fig.11 as
Obtain the hepatitis viral protein sequence and follows:
Figure 9 Page of the protein sequence and annotation information of HBcAg
6
Figure 10 Page of homological analysis result of the HBcAg protein sequences (BLAST)
Figure 11 Page of the evolutionary tree mapping of the HBcAg protein
7
database can be integrated with relevant factual
3 Discussion, Conclusion and Future scientific databases, it is certainly very helpful and
Work convenient for users. This is an interesting direction
for information integration and knowledge mining.
3.1 Discussion 3.2 Conclusion
The viral hepatitis bilingual bibliographic database With the fast development of the viral hepatitis
was successfully built, and protein text was also research, to satisfy user’s information needs is
successfully mined, and two different classes of becoming an inevitable challenge. So, construction
databases were also triumphantly integrated, but we of the viral hepatitis bilingual literature database is
encountered some problems, especially such as important, significant and useful. Integration of two
false positive mining results in bilingual protein text different classes of databases via data mining and
mining. Having investigated the false positive linking is innovative and trend for database
questions, we think there are probably three causes development. Moreover, information integration
resulting in the false positive mining results: and data mining are playing a more and more
1) Low quality of the original datasets collected; important role in big data era.
2) The accuracy and unity of a specialized word
usage is not enough in building of bilingual control 3.3 Future work
vocabulary;
3) In data mining and integration, computer In order to solve the problems above, future work
algorithms, mining mode and route selection, and must be done as follows:
algorithm itself are unreasonable or the system has 1) Constantly extend and update datasets in
defects. viral hepatitis bilingual literature database;
As for the problems above, we use artificial quality 2) Constantly improve mining and integrating
control to handle the collected original datasets; quality so as to decrease the false positive results
refer to specialized dictionary and consult the as low as possible through algorithm improvement
experts to solve the accuracy and unity question of and machine learning;
a specialized word usage; try to explore different 3) Further improve accuracy and unity of the
algorithms, mining mode and route to solve bilingual control vocabulary;
accuracy and efficiency question of data mining and 4) The viral hepatitis bilingual literature
integration. database will be linked more factual scientific
After the viral hepatitis bilingual bibliographic atabase via data mining and information integration.
database was used and demonstrated, we have got
many feedbacks from users. Most of them love the Acknowledgements
convenience of easily searching hepatitis viral This work is supported by The Chosen Excellent
protein names, locating highlighted viral protein Program for Introduced Outstanding Talent of
names in search results, and accessing UniProt Chinese Academy of Sciences in the Fields of
database for the detailed protein information Bibliographical Information and Periodical
through information integration and links. But they Publication 2010 (Subject field 100 talent program)
also raised some questions and proposed many and Chinese National Science and Technology
advices. Overall, however, the feedback is very Support Project (No.2013BAH21B06)
positive so far. According to users’ suggestions and
problems, we have discovered, following issues are Reference
currently being considered and actually some of Alexopoulou, D., Wächter, T., Pickersgill, L.,
them are being undertaken in order to further Eyre, C. and Schroeder, M.: Terminologies for
enhance the system and make it more efficient and text-mining; an experiment in the lipoprotein
convenient: metabolism domain. BMC Bioinformatics.
1) add more hepatitis viral protein names and 9(Suppl 4), S2, 2008
their features into the English-Chinese Chen Heng, Jin Yi, Zhao Yan, Zhang Yongjuan,
Controlled-vocabulary dictionary. This work is Chen Chengcai, Sun Jilin, Zhang Shen.
continuously being conducted and actually we also Mining and Information Integration Practice
plan to add relationships of hepatitis viral proteins for Chinese Bibliographic Database of Life
and other relevant information so as to finally Sciences. Book title : Advances in Data
construct a Chinese hepatitis viral protein ontology. Mining: Applications and Theoretical
Then it would be possible to realize semantic-based Aspects; Vol.7987, pp.1-10, 2013. Publisher:
text mining and provide users with Springer Berlin Heidelberg. Book subtitle:
knowledge-based information service. 13th Industrial Conference, ICDM 2013,
2) integrate more factual scientific databases, NewYork, NY, USA, July 16-21, 2013,
especially factual gene databases. Some users are Proceedings. (DOI:
also interested in other special fields, such as 10.1007/978-3-642-39736-3)
evidence-based medicine, AIDS, etc. If search Doms, A. and Schroeder, M.: GoPubMed:
results of a special topic from a bibliographic exploring PubMed with the gene ontology.
8
Nucleic Acids Research. Vol.33: 783-786,
2005
Feng Xinmin and Wang Jiandong. The concept
dilemma of knowledge mining and the
broad-sense knowledge mining. Journal of
Information, Vol.27 (7): 63-65, 2008
Maglott, D., Ostell, J., Pruitt, K. and Tatusova, T.:
Entrez Gene: gene-centered information at
NCBI. Nucleic Acids Research. Vol.39:
52-57, 2011
McGarry, K., Garfield, S. and Morris, N.: Recent
trends in knowledge and data integration for
the life sciences. Expert Systems. Vol.23(5):
330-341, 2006
Pasquier, C.: Biological data integration using
Semantic Web technologies. Biochimie.
Vol.90: 584-594, 2008
Sahoo, S., Bodenreider, O., Zeng, K. and Sheth,
A.: An experiment in integrating large
biomedical knowledge resources with RDF:
Application to associating genotype and
phenotype information. In: 16th
International World Wide Web Conference
(WWW2007) on Health Care and Life
Sciences Data Integration for the Semantic
Web, pp. 8-12. Banff, Canada(2007)
Yan Zhihong. Research on the integration mode
of digital information resources in Chinese
University libraries. Thesis for Master degree,
Chong Qing University, 2008
Zhang Xiaojuan, Zhang Yutao, Zhang Jieli and
Wang Juncheng. The central research issues
of information resources integration in china.
Journal of the China Society for Scientific
andTechnical Information, Vol.28 (5):
791-800, 2010
9