=Paper=
{{Paper
|id=Vol-2619/paper2
|storemode=property
|title=Enriching Consumer Health Vocabulary Using Enhanced GloVe Word Embedding
|pdfUrl=https://ceur-ws.org/Vol-2619/paper2.pdf
|volume=Vol-2619
|authors=Mohammed Ibrahim,Susan Gauch,Omar Salman,Mohammed Alqahatani
|dblpUrl=https://dblp.org/rec/conf/ecir/IbrahimGSA20
}}
==Enriching Consumer Health Vocabulary Using Enhanced GloVe Word Embedding==
Enriching Consumer Health Vocabulary Using Enhanced
GloVe Word Embedding
Mohammed Ibrahim [0000-0001-6842-3745], Susan Gauch [0000-0001-5538-7343], Omar Salman [0000-0003-4797-
6927]
, and Mohammed Alqahatani [0000-0002-8872-6513]
University of Arkansas, Fayetteville AR 72701, USA
{ msibrahi,sgauch,oasalman,ma063}@uark.edu
Abstract. Open-Access and Collaborative Consumer Health Vocabulary (OAC
CHV, or CHV for short), is a collection of medical terms written in plain Eng-
lish. It provides a list of simple, easy, and clear terms that laymen prefer to use
rather than an equivalent professional medical term. The National Library of
Medicine (NLM) has integrated and mapped the CHV terms to their Unified
Medical Language System (UMLS). These CHV terms mapped to 56000 pro-
fessional concepts on the UMLS. We found that about 48% of these laymen’s
terms are still jargon and matched with the professional terms on the UMLS. In
this paper, we present an enhanced word embedding technique that generates
new CHV terms from a consumer-generated text. We downloaded our corpus
from a healthcare social media and evaluated our new method based on iterative
feedback to word embeddings using ground truth built from the existing CHV
terms. Our feedback algorithm outperformed unmodified GLoVe and new CHV
terms have been detected.
Keywords: Medical Ontology, Word Embeddings
1 Introduction
With the advancement of medical technology and the emergence of Internet social
media, people are more connected than before. Currently, many healthcare social
media platforms provide online consultations for patients. The Pew Research Center
reported that in 2011 about 66% of Internet users looked for advice regarding their
health issues[1]. Furthermore, the rate of using social media by physicians reached to
90% in 2011 [2]. Physicians, nurses, or any expert who practices medicine will not be
able to interact effectively with laymen unless they have a lexical source or ontology
that defines all medical jargon in an easy and clear way to be understood by laymen.
A concentrated effort, involving experts in different health fields, has created sev-
eral medical ontologies. These ontologies, such as Mesh, SNOMED CT, and many
others, were built to describe and connect professional medical concepts. The United
States National Library of Medicine (NLM) combined many resources into one the-
saurus called the Unified Medical Language System (UMLS). The UMLS is a me-
tathesaurus consisting of more than 3,800,000 professional biomedicine concepts [3].
Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
In contrast to the UMLS, the Open-Access and Collaborative Consumer Health
Vocabulary (OAC CHV, or just CHV for short), is a collection of medical terms writ-
ten in plain English. It provides a list of easy terms to refer to a professional medical
concept. The goal of developing CHV terms is to lessen the gap between laymen and
medical experts and to improve the accuracy of health information retrieval [4]. Out
of 3,800,000 concepts on the UMLS, only 56,000 concepts assigned a CHV term(s).
In spite of the claim that the CHV contains laymen terms mapped to professional
concepts, our investigations showed that out of the 56,000 CHV terms assigned to
UMLS concepts, 27,000 (48%) of concept’s terms are still jargon and are just mor-
phological variations of the professional term. For these, the CHV terms contain ei-
ther downcased letters, the plural ‘s’, or numbers and punctuations.
To address this, we propose a system that processes consumer-generated text from
a healthcare platform to find new CHV terms. The system uses the Global vectors for
word representations (GloVe). We also improved this algorithm by applying an auto-
matic, iterative feedback approach.
2 Related Work
Building a lexical resource or an ontology with the help of human can lead to a pre-
cise, coherence, and reliable knowledge base. However, it involves a lot of human
effort and consumes a lot of time. To address that, Kietz et al. [5] prototyped an ap-
proach to build a company ontology semi-automatically using an existing ontology
and a company-related dictionary. Harris and Treitler [4] developed the Open Access
Consumer Health Vocabulary (CHV) using statistical approaches and text collected
from the internet. There are several methods have been proposed to enrich this con-
sumer vocabulary, such as He et al.[6] proposed by enriching the CHV vocabulary
using a similarity-based technique. They collected posts from a healthcare social me-
dia and applied the k-means algorithm to find similar terms. However, their work is
tied to drawbacks of the K-means algorithm, such as the number of clusters and how
to initialize these clusters. Gu [7] also tried to enrich CHV vocabulary by applying
three recent word embedding methods. However, his work is not completely automat-
ic and involves human. In contrast, our proposed system is completely automatic and
uses state-of-the-art methods to extract new CHV terms.
3 Methodology
To enrich the CHV terms, we need a corpus that contains many laymen terms for
medical concepts. Medhelp.org is a healthcare social media platform in which people
post information about their health issues. These posts are presented in a ques-
tion/answer format wherein people share their experiences, knowledge, and opinions
within different health communities [8]. Such healthcare social media can be an ex-
cellent source from which to extract new CHV terms.
The other requirement for our system is a set of medical concepts that have associ-
ated laymen terms to use as ground truth. For that, we used the already existing CHV
Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
vocabulary. For each concept, we selected one term as the seed and judge the algo-
rithms by their ability to detect synonyms, i.e., the other terms in the concept. We use
the seed terms to locate contexts in corpus from which the new CHV vocabulary
might be extracted. Figure 1 shows the steps of our system.
Fig. 1. The methodology of extracting new CHV terms
3.1 GloVe Embeddings
GloVe algorithm is a word vector representation method. It builds word embeddings
using a log bilinear model. This algorithm combines the advantages of local window
methods and global matrix factorization [9]. GloVe has many hyperparameters that
can affect its results, but the window size and vector dimension are the most effective
parameters. This paper reports the results using the same setting reported in [9].
3.2 GloVe Iterative Feedback (GloVeIF)
The idea here is feeding the most similar terms that the GloVe algorithm produces
again to the GloVe cooccurrence matrix. This method explores the potential source of
auxiliary information, the corpus itself, through a process of iterative feedback. Figure
2 shows the steps of this method. We have highlighted the iterative feedback steps
with orange. In this method, GloVe lists the most similar terms to the CHV terms that
iteratively fed back to GloVe to boost the frequencies in the co-occurrence matrix as
though there were additional contexts available. When GloVe trains its word vectors,
a seed term for every medical concept will be chosen, and a list of top n most similar
terms will be listed. Our GloVeIF algorithm then iteratively submits these top n terms
to the trained vectors to find their top k most similar terms. Then, it adds them to the
top n list. For example, if top n = 10 and top k = 5, then the final list of most similar
terms for every seed term is going to be (10*5)+10 = 60 terms. After having this list
ready, GloVeIF feeds this list back to the GloVe model. So, we have two feedbacks.
The first from the GloVe pre-trained vectors to expand top n similar terms, and the
second from the GloVe model.
Fig. 2. The GloVeIF architecture
4 Evaluation
Corpus and Seed Terms. We build our laymen corpus from different MedHelp
communities. We downloaded all the questions on these communities on April 20 of
2019. The dataset size is roughly 1.3 Gb and contains approximately 135,000,000
Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
tokens. The corpus cleaned from punctuations, numbers, and traditional stopwords.
We removed any word with length less than 3. Also, we created our special stopword
list to remove common medical terms such as ‘test’, ‘procedure’, and ‘disease’. We
did the same process to our seed term list. Moreover, the seed terms list is cleaned to
remove words that are morphological variants of professional medical concepts with-
in the UMLS, so we kept only true laymen terms. Finally, only concepts that have at
least two associated CHV terms that occurs 100 times are kept in the seed term list so
that we will have enough contexts within the corpus to attempt word embeddings.
The final list contains 1257 concepts along with their associated CHV terms. Table 1
shows some of the UMLS professional concepts and their associated CHV terms.
Table 1. Example of UMLS concepts with their associate CHV terms.
CUI Medical Concept CHV terms
C0035334 retinitis pigmentosa pigmentary retinopathy cone rod
C0034194 pyloric stenosis stenos gastric outlet obstruct
Baselines and Evaluation Metrics. We compared the GloVeIF with the baselines as
reported in [7]. One of the baselines is the GloVe itself, and the other two are the
Word2Vec[10] and FastText[11]. For the ground truth dataset, we used the seed term
list created from the existing CHV vocabulary. A random term picked from the con-
cept’s associated CHV terms to be used as a seed term. The top 10 most similar terms
to that term are listed and compared with the other left CHV terms. For our accuracy
measurements, the precision, recall, F-score, and mean reciprocal rank (MRR) applied
to measure the performance of the system. Also, we measured the average of the
number of concepts that the algorithm was able to detect.
5 Results and Discussion
Table 2 shows the results of implementing the GloVeIF and the comparisons with the
baselines. All the algorithms run with their basic settings except the vector dimension
is set to 100 and window size of ±10. All the results reported with top n = 10, and for
the GloVeIF, the top k is set to 5.
Table 2. Results of running two algorithms.
Algorithm Precision (%) Recall (%) F-score (%) MRR Concepts
Word2Vec 21.24 16.66 18.61 0.26 9
GloVe 15.86 12.5 13.98 0.35 18
GloVeIF 17.56 13.39 15.19 0.27 21
Our GloVeIF algorithm outperformed the basic GloVe algorithm with an 8.7% im-
provement in the F-score. GloVeIF also outperformed all other algorithms by identi-
fying the highest number of concepts. However, the basic GloVe algorithm was the
best in terms MRR. On average, the basic GloVe and GloVeIF, found the true syno-
nym to the seed term in the 4th position of ranked similar terms. Table 2 shows that
Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
the Word2Vec algorithm with a continuous bag of words (CBOW) model outper-
formed other algorithms in term of the precision, recall, and F-score. However, it was
only able to find synonyms in half or fewer of the concepts versus the others. We did
not report the results from the FastText algorithm and the Word2Vec (Skip-Gram
model) because they were unable to detect any CHV terms related to the seeds at all.
The ground truth list has about 1200 concepts along with their associated CHV
terms. Among the 5 algorithms tested, our GloVeIF algorithm was able to detect as-
sociated terms for the largest number of concepts. However, at 21, this was still very
low. We believe that the number of concepts was so low because the evaluation used
the existing CHV terminology. We previously mentioned that 48% of the existing
CHV terms were just morphological variations of the UMLS terms (and thus removed
from our data set). However, many of the remaining CHV terms are not truly lay-
men’s terms and the users tend to be laymen asking questions and use common terms
only, or experts providing answers who use professional terms only, so the seeds and
the candidate CHV terms do not do co-occur frequently enough to be discovered by
word embedding methods. However, the raw results of GLoVeIF are quite promis-
ing; it seems to do an excellent job identifying laymen’s terms that are not in the cur-
rent CHV. We displayed a list of 500 seed terms, along with their topmost similar
term to three judges1 to rate their relatedness as 1 (related) or 0 (not related). This
informal, human validation found that 80% of time the most similar term was related
seed term. Table 3 shows some of the results that the GloVeIF detected. The most
similar terms in Table 3 sorted by their degree of similarity to the seed terms.
Table 3. Some of the seed terms and their most similar terms that GloVeIF produced.
Seed term Term1 Term 2 Term 3 Term 4
bowel bladder constipation diarrhea intestine
skin itch itchy dry irritate
ray xray scan mri spine
Since MedHelp.org posts are all related to health issues, we can see that some of
the seed terms are general, such as the term skin, but their most similar terms are
health issues such as skin itching, dry skin, and skin-irritating. Although clearly relat-
ed to the seed term, none of the candidate terms were in the associated CHV concept.
1 . The judges were the first, third, and fourth authors of this paper.
Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
6 Conclusion and Future Work
This paper presents a method to enrich a consumer health vocabulary (CHV) with
new terms from a healthcare social media texts using word embeddings. We begin by
demonstrating that the CHV contains many terms that are not true laymen’s terms.
Our algorithm, GLoVeIF, is automatic, and identifies new terms by locating syno-
nyms to seed CHV terms. We conclude by demonstrating that many of the top syno-
nyms proposed by GLoVeIF were related terms, even though they do not appear in
the current CHV. For future work, we suggest more investigation regarding the num-
ber of concepts and their CHV terms that can be detected. We also suggest imple-
menting the same seed term approach using more recent word embedding methods.
References
1. S. Fox, “Health Topics,” Pew Research Center: Internet, Science & Tech, 01-Feb-
2011. https://www.pewinternet.org/2011/02/01/health-topics-3/ (Oct. 21, 2019).
2. M. Modahl, L. Tompsett, and T. Moorhead, “Doctors, Patients & Social Media,”
Social Media, p. 16, 2011.
3. O. Bodenreider, “The unified medical language system (UMLS): integrating bio-
medical terminology,” Nucleic acids research, vol. 32, no. suppl_1, pp.
4. K. M. Doing-Harris and Q. Zeng-Treitler, “Computer-assisted update of a con-
sumer health vocabulary through mining of social network data,” Journal of medi-
cal Internet research, vol. 13, no. 2, p. e37, 2011.
5. J.-U. Kietz, A. Maedche, and R. Volz, “A Method for Semi-Automatic Ontology
Acquisition from a Corporate Intranet,” p. 15, Oct. 2000.
6. Z. He, Z. Chen, S. Oh, J. Hou, and J. Bian, “Enriching consumer health vocabulary
through mining a social Q&A site: A similarity-based approach,” Journal of bio-
medical informatics, vol. 69, pp. 75–85, 2017.
7. G. Gu et al., “Development of a Consumer Health Vocabulary by Mining Health
Forum Texts Based on Word Embedding: Semiautomatic Approach,” JMIR Medi-
cal Informatics, vol. 7, no. 2, p. e12704, 2019, doi: 10.2196/12704.
8. H. Kilicoglu et al., “Semantic annotation of consumer health questions,” BMC
bioinformatics, vol. 19, no. 1, p. 34, 2018.
9. J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word repre-
sentation,” in Proceedings of the 2014 conference on empirical methods in natural
language processing (EMNLP), 2014, pp. 1532–1543.
10. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed repre-
sentations of words and phrases and their compositionality,” in Advances in neural
information processing systems, 2013, pp. 3111–3119.
11. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors
with subword information,” Transactions of the Association for Computational
Linguistics, vol. 5, pp. 135–146, 2017.
Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).