=Paper=
{{Paper
|id=Vol-1179/CLEF2013wn-CLEFeHealth-BedrickEt2013
|storemode=property
|title=Lucene, MetaMap, and Language Modeling: OHSU at CLEF eHealth 2013
|pdfUrl=https://ceur-ws.org/Vol-1179/CLEF2013wn-CLEFeHealth-BedrickEt2013.pdf
|volume=Vol-1179
|dblpUrl=https://dblp.org/rec/conf/clef/BedrickS13
}}
==Lucene, MetaMap, and Language Modeling: OHSU at CLEF eHealth 2013==
<pdf width="1500px">https://ceur-ws.org/Vol-1179/CLEF2013wn-CLEFeHealth-BedrickEt2013.pdf</pdf>
<pre>
      Lucene, MetaMap, and Language Modeling:
            OHSU at CLEF eHealth 2013

                    Steven Bedrick and Golnar Sheikshabbafghi

    Center for Spoken Language Understanding, Oregon Health & Science University,
                                Portland, OR, USA
                           {bedricks,sheiksh}@ohsu.edu


        Abstract. The Oregon Health & Science University team’s participa-
        tion in task #3 (“addressing patients’ medical questions”) of this year’s
        eHealth CLEF campaign included submissions from two different re-
        trieval systems. The first was a traditional, Lucene-based system modi-
        fied from one used in previous years’ TREC-med campaigns; the second
        was a novel system that used statistical language modeling techniques to
        perform text retrieval. Since 2013 was the first year of our participation
        in this campaign, our focus was on familiarizing ourselves with working
        on a corpus of web text, as well as putting together a proof-of-concept im-
        plementation of a language-model retrieval system. We submitted three
        runs in total; one from the novel system, and two from our Lucene-based
        system, one of which made use of the National Library of Medicine’s
        MetaMap tool to perform query expansion. In general, our runs did not
        perform particularly well, although there were several topics for which
        our language model-based retrieval system produced the best P@10. Fu-
        ture work will focus on pre-indexing text normalization as well as a more
        sophisticated approach to query parsing.
        Keywords: Lucene, MetaMap, language model, skip-grams


1     Introduction

Most research into medical information retrieval can be categorized into one of
two broad and fuzzy groupings. The first (“type A”) concerns itself with the
information needs of clinicians, and focuses on searching specialized databases
in response to specific and well-informed topics. The second category (“type
B”) deals with so-called “consumer” information needs: searches conducted by
non-medical users (often patients or family members) over the Internet. Earlier
medical IR evaluation campaigns have tended to focus on the first of these two
groupings.
    For example, for several years TREC included a medical track in which par-
ticipants built and tested search systems designed to index and query electronic
health records in order to identify patients matching particular textual descrip-
tions (e.g., “Patients admitted to the hospital with end-stage chronic disease
who are offered hospice care”)[1]. Without in any way diminishing the substan-
tial difficulties inherent in performing high-quality open-class IR over a clinical
database, we believe that it is safe to say that this family of medical IR system
enjoys several significant advantages over the aforementioned second family of
medical IR system (systems designed for the lay population, and intended to
work on a general web corpus).
    A system designed for indexing and querying a clinical data repository of
some kind has the advantage of a (relatively speaking) predictable data schema,
as well as (again, relatively) a finite amount of content- both in terms of amount
(i.e., how many unique records) as well as kind. After all, while there may be a
very large number of types of pathology report, that number is generally both
knowable and tractable. A search system designed for consumers and the open
Web, however, must be able to handle an essentially infinite variety of input
documents and user queries, and must do so with far less context (about both
its corpus as well as its users1 ) than do its more constrained cousins.
    The third task of this year’s ShARe/CLEF eHealth track features just such
a search scenario. The task is described in full elsewhere[3]; as such, we will keep
our description brief. The task was an open-query document search over a corpus
consisting of ≈1.6 million web pages ostensibly containing health-related infor-
mation. The topics consisted of quasi-natural-language phrases (e.g. “is there a
connection between multiple sclerosis and dysplasia in oesophagus”) that repre-
sented the sorts of queries that actual patients might enter into a search engine
such as Google.
    Our group submitted runs derived from two different retrieval systems (de-
scribed in section 2). While our results were— with one or two exceptions— not
particularly impressive (see Section 3), we feel that we have laid a solid tech-
nical foundation for next year’s CLEF campaign. Furthermore, our experience
highlight several important differences between “type A” and “type B” medical
retrieval systems in terms of how best to use external resources.


2   Methods

We submitted runs from two separate systems. The first system was s very
traditional IR system based on the Apache Lucene2 open-source toolkit. It was
essentially a spiritual successor to the system used by the OHSU team for the
2011 and 2012 TREC medical tracks[4]. The second system was a novel system
that uses techniques from statistical language modeling to perform retrieval.3
The two systems were quite different in terms of their operation, and we will
describe each in turn.
1
  While the information literacy of clinicians certainly varies[2], it is safe to say that
  this variability is smaller than that found among the users of a general-purpose
  medical search engine.
2
  http://lucene.apache.org
3
  See Chapter 12 of Manning, et al.’s “Introduction to Information Retrieval” for an
  overview of the general approach[5].
2.1    Traditional System

The first of our two systems followed a very traditional architecture for a text
retrieval system, in that it featured a standard inverted index paired with vector-
space retrieval model. As mentioned above, we used the Lucene open-source IR
toolkit (version 4.2.1) to build the system, with no major modifications from
its out-of-the-box configuration. Our intent was to develop a system to use as a
reasonable baseline, and Lucene let us accomplish this with minimal difficulty.
Lucene also provided us with robust index creation tools as well as a rich query
language.
    This baseline Lucene system used the default Lucene StandardAnalyzer
query processor. Our system had an alternative query processing mode, which
we used for one of our additional runs. This mode makes use of our existing
MetaMap-based query parser used in previous years’ TREC campaigns (see [4]),
described in greater detail below.


Indexing Due to the large size of the corpus, building the Lucene index of the
documents proved to be a non-trivial task. We chose to use a 500-node Hadoop4
cluster to facilitate the process. Hadoop is an open-source implementation of the
Map/Reduce pattern, which was first popularized by Google as a way of easily
parallelizing certain computing tasks. A complete description of Map/Reduce
is beyond the scope of this document (see Dean & Ghemawat[6] for a more
thorough description); in short, a Map/Reduce program splits a task into two
steps, “map” and “reduce.” In the map step, each input document is processed
in parallel and is transformed into a set of key/value pairs {< ki , vi >, < kj , vj >
, ...} . In the reduce step, the key/value pairs are aggregated by their key into
sets of the form {< ki , {vi1 , vi2 , ...} >, < kj , {vj1 , vj2 , ...} >}, and each key’s
associated values are processed together. The ultimate point of this approach is
that each execution of the map step can be run completely independently from
any others, and a similar amount of parallelization can often be achieved in the
reduce phase.
     The canonical example is a distributed word-counting operation. In this case,
the mapper would take as input a document, and emit for each token in the
document a key/value pair in which the key is the token and the value is the
number of times that that token appeared in that document. The reducer, then,
would take as its input a single token along with a set of counts (one from each
input document that contained that token); its job would be to sum the counts
and emit the total number of occurrences of that token in the corpus. Because
each document can be counted independently of the others, and each token’s
occurrence counts can be summed independently of any other tokens’ counts, a
word-counting program using this approach can benefit greatly from a parallel
computing environment.
     The Map/Reduce model lends itself extremely well to the creation of inverted
indices. Consider the most trivial case, in which the map step emits terms and
4
    http://hadoop.apache.org
postings as keys and values, and the reduce step produces posting lists. For
our purposes, we used Map/Reduce to produce a Lucene index. Each mapper
processed a subset of the entire corpus, and produced a single index shard; the
shards were then run through the Lucene API’s index-merging tools to produce
a single, large index. The final index size was approximately 7.5 gigabytes, which
proved to be well within Lucene’s capabilities. We indexed both the document
titles as well as their bodies, after stripping the bodies of their HTML tags.

MetaMap As mentioned above, our baseline system uses Lucene’s default
StandardAnalyzer to process free-text queries. During previous years’ TREC
campaigns, we developed a query parser that uses the National Library of
Medicine’s (NLM) MetaMap tool[7] tool to attempt to identify query terms
that are “medically-relevant.” MetaMap uses a variety of NLP techniques to
map unstructured text to concepts from the Unified Medical Language System
(UMLS) Metathesaurus[8].
    Our query parser has a variety of operational modes, including several that
perform query expansion by including sibling entry terms for any concepts
matched from the NLM’s Medical Subject Headings (MeSH) indexing system.
For a complete description of the operation of this part of our system, consult our
2012 TREC Medical Track working notes paper[4]. In short, our query parser
takes unstructured text as input, and in an unsupervised manner produces a
(sometimes complex, and often suboptimal) query in Lucene’s syntax, making
use of various Boolean operators as appropriate.
    In the simplest operation mode, our query parser uses MetaMap to analyze
the free-text queries and identify any biomedical concepts. From there, we use the
UMLS to identify possible synonyms for these concepts, and link those together
using Boolean “OR.” Each term group is then linked using Boolean “AND.”
The parser contains several “stop-word” lists— sets of CUIs or UMLS Semantic
Types that it ignore, and not include in its final queries. Typically, CUIs or
Semantic Types end up on this list due to being overly common in either queries
or in documents (e.g., “Patients”).
    this approach worked reasonably well when used in previous years’ cam-
paigns, which involved querying electronic medical records for specific patient
profiles. We were unsure as to how well it would work when querying the less
formally-written and similarly less-focused content found in the present task,
but decided that it was worth trying.

2.2   Language Model System
In addition to our traditional baseline system, we submitted a run from an en-
tirely novel retrieval system. This system used techniques borrowed from statis-
tical language modeling to attempt to identify documents that were statistically
similar to the queries. Conceptually, we are using the language model to tell us,
for each document, the probability of that document generating the query. More
“relevant” documents should, in principle, have a higher probability than less
“relevant” documents.
    Slightly more formally, and using notation from [9], our approach computes
a language model Md for each document in the corpus d. Then, for any given
query q, we use the language models to compute P (q|Md ) for each document d,
and can rank the documents in descending order of this probability.
    This approach presents us with some interesting experimental possibilities.
There are a wide variety of different text normalization approaches, model smooth-
ing techniques, scoring functions, etc. that we wish to experiment with (see Sec-
tion 4).


Text Normalization In any language modeling task, the decision of how to
normalize one’s input is critical. We were somewhat aggressive in our normal-
ization approach. We first dropped all non-ASCII letters, since the task a priori
only involved English-language queries over English-language documents. After
alphabetical pruning, we compiled a list of more than 3 million tokens, most of
which were not actual words but were instead fragments, numbers, etc. etc.
    We then computed document frequencies for each token in order to compile
a list of “stop tokens.” We were primarily concerned with removing overly com-
mon tokens. By manually examining the relative frequencies of tokens, we set
a document frequency threshold of 700,000— that is, we considered tokens ap-
pearing in more than that many documents to be “too common,” and excluded
them from both the documents as well as the queries.


Language Model The language modeling approach we used was very simi-
lar to a standard bigram language model with an absolute discounting backoff
scheme.5 However, instead of using strict bigrams, we instead counted tokens
that co-occurred within an 11-word window (i.e., word pairs that had a maxi-
mum distance of ten words). In other words, if two tokens occurred relatively
near to each other, they were counted as a bigram. Furthermore, the distance
calculation took place after pruning stop words, so the two tokens could poten-
tially have been more than ten tokens apart in the original document.
    In the language modeling world, these are referred to as skip-grams, and are
often used to allow a language model to capture additional information about the
contents of a text.6 In our case, we chose to use skip-grams in part to compensate
for the large difference in length between the documents and the queries, which
can cause problems relating to model sparsity. By allowing this “slop” we are
effectively increasing the number of possible bigrams from each query, thereby
increasing the number of chances for that query’s bigrams to occur in the per-
document language models.


Scoring & Ranking Documents Given a normalized query and a normal-
ized set of documents, we compute the log-probability for all pseudo-bigram pairs
5
    We used an absolute discount parameter value of 0.5.
6
    For more details, two good places to start would be the work of Siu & Ostendorf[10]
    and that of Guthrie, et al.[11].
                    Run                      P@10 MAP nDCG@10
                    Traditional (Baseline) 0.2300 0.0953 0.2436
                    Lang. Model              0.2600 0.0999 0.2344
                    Baseline w/ MetaMap 0.1620 0.0816 0.1706
Table 1. Official results, including precision at rank 10 (P@10), mean average precision
(MAP), and normalized discounted cumulative gain at rank 10 (nDCG@10).


present in the query occurring in each document, just as one would using a stan-
dard bigram language model when estimating the log-probability of a sentence.
However, in our case, we are not strictly calculating the log-probability of the
query vs. each document. Because of our adjusted bigram scheme, each bigram
may potentially be counted multiple times depending on how the documents’
tokens are arranged; as such, the actual probabilities we calculate are somewhat
smaller than they would be in a traditional bigram scheme. However, as we are
calculating it in the same way across documents and queries, the probabilities
themselves are directly comparable with one another.
    As such, in order to rank the documents in order of “relevance,” we need
simply to rank the documents in descending order of probability (i.e., the doc-
uments with the highest probabilities for a given query are ranked highest in
the result list). There are any number of different ways to choose a threshold at
which to cut off results; we chose a fairly robust method in which we used the
histogram of calculated probabilities for a given query to choose a reasonable
threshold.


3     Results


Unfortunately, our runs did not perform particularly well this year. Generally
speaking, all three of our runs performed below the median in terms of P@10, and
none achieved an overall mean average precision score higher than 0.1. However,
there were several topics for which we achieved adequate performance, and there
were even several topics for which one or the other of our runs appeared to have
the best P@10. Our official results can be seen in Table 1.
    As is often the case in these kinds of evaluation campaigns, we observed a
large amount of variation in system performance across topics (see Figure 1).
We also noted a similarly large amount of variation in the number of documents
judged relevant for each topic, with one run having zero relevant documents,
seven more having fewer than ten, and one topic apparently having 610 relevant
documents.7

7
    This would be topic qtest19, “is abdominal pain due to helicobacter pylori a symp-
    tom of cancer.”
3.1   Traditional System

Our baseline system— essentially plain-vanilla Lucene, without external anno-
tations of any sort— did not perform particularly well. In terms of bpref, there
were only three topics for which it achieved a score higher than 0.5, and there
was a great deal of variability among the remaining topics (see Figure 1). In
terms of P@10, its performance was similarly dismal. There were 14 topics for
which its P@10 was 0.0; one topic (qtest29, “what is prognosis and the treat-
ment for aortic stenosis”) came in at 0.8, with a small handful coming in above
0.5. Most of the topics had P@10 scores of between 0.1 and 0.4.
    In addition to our baseline system, we also submitted a run from a system
that used MetaMap to perform query expansion (see Section 2.1). In previ-
ous evaluation campaigns, this approach had often resulted in improved per-
formance. In this task, the picture was somewhat more complex. Often, rather
than retrieving more potentially-useful results, the query expansion (which made
rather aggressive use of Boolean query operators) would result in fewer results
being retrieved (often many fewer, down to and including zero).
    Sometimes, as in the case of qtest27, this would result in improved precision
(in this case, from a P@10 of 0.0 in the baseline run to 0.6 in the MetaMap run).
More often, however, this would result in a decrease in precision, simply due
to the query’s Boolean criteria preventing more than a handful of articles from
being retrieved.
    Overall, there were 17 topics for which the MetaMap query expansion hurt
the system’s performance in terms of bpref, and 18 for which it had no effect.
That leaves 15 for which the query expansion improved bpref (see Table 2).
In three of these cases (qtest7, qtest27, and qtest46), the MetaMap run’s
non-zero bpref represented an improvement over a baseline bpref of 0.0. The
remaining cases of improvement fell into two categories. The first consisted of
several runs that featured relatively modest improvements (Metamap:Baseline
bpref ratios ranging from ≈ 1.06–≈ 1.70). The second consisted of a small
number of runs that experienced a much more dramatic improvement, with
bpref ratios ranging from ≈ 4.40 to ≈ 5.15.
    Looking at the final processed form of the queries that experienced the most
performance improvement, the thing that stands out is that they are simpler and
more parsimonious than they were before being processed by MetaMap. This
was not a universal result of being run through our MetaMap processor; quite
often, the final queries would be quite large and complex, with multiple Boolean
clauses. In these cases of notable improvement, it seems as though MetaMap
acted as something of a filter, stripping out un-necessary and noisy terms and
replacing them with a small number of salient terms. In our future work, we plan
to investigate this phenomenon further.


3.2   Language Model System

Strictly speaking, our language model system (LMS) did not perform particularly
well. There were several topics for which we did not retrieve very many results,
                                bpref by topic, Traditional System


        0.6


        0.5


        0.4


        0.3


        0.2


        0.1


        0.0
               qtest1
               qtest2
               qtest3
               qtest4
               qtest5
               qtest6
               qtest7
               qtest8
               qtest9
              qtest10
              qtest11
              qtest12
              qtest13
              qtest14
              qtest15
              qtest16
              qtest17
              qtest18
              qtest19
              qtest20
              qtest21
              qtest22
              qtest23
              qtest24
              qtest25
              qtest26
              qtest27
              qtest28
              qtest29
              qtest30
              qtest31
              qtest32
              qtest33
              qtest34
              qtest35
              qtest36
              qtest37
              qtest38
              qtest39
              qtest40
              qtest41
              qtest42
              qtest43
              qtest44
              qtest45
              qtest46
              qtest47
              qtest48
              qtest49
              qtest50
Fig. 1. bpref score by topic, “Traditional” (baseline) run. Note the extreme between-
topic variability in system performance; note also the large number of topics for which
the scored bpref was 0.0.


period— much less many relevant results. Overall, our LMS performed even less
consistently than did the baseline system (see Figure 2). That said, there were
several runs for which the LMS outperformed the baseline system by a wide
margin.
    For example, for qtest49, the LMS:Baseline bpref ratio was 3.69, represent-
ing a very notable increase in performance. All in all, there were 19 topics for
which the LMS outperformed our baseline system in terms of bpref, and 7 for
which there was no difference in performance. Five of the topics with improved
performance saw at least a doubling of bpref as compared to the baseline, and
one more came very close to double (bpref ratio of ≈ 1.97).8 In terms of P@10,
there were six topics for which our LMS run appears to have achieved the best
score. In one of these topics (qtest12), we appear to have outperformed the
median by a fairly significant margin; the other five topics (qtest20 qtest31,
qtest42, qtest43, and qtest47) appear to have been more challenging, as the
degree to which we outperformed the median was smaller.
    One unexpected attribute of the LMS was that diagnosing performance is-
sues was quite difficult. With our baseline system, we can easily examine our
generated queries, and can also directly inspect our index. With the LMS, how-
ever, things are much more opaque. As such, it is not easy for us to say why it
did particularly well or badly on any given topic. Future work will explore ways
to improve this state of affairs.
8
    On the other hand, there were 25 topics for which the LMS exhibited decreased
    bpref performance as compared to the baseline.
                          Topic Baseline MetaMap Abs. Diff Rel. Diff
                         qtest31 0.6426   0.6816   0.039   1.06069
                         qtest5 0.3751    0.4568  0.0817 1.21781
                         qtest11 0.0998   0.1315  0.0317 1.31764
                         qtest30 0.4277   0.5723  0.1446 1.33809
                         qtest37 0.3412   0.4646  0.1234 1.36166
                         qtest24 0.3937   0.5728  0.1791 1.45491
                         qtest32 0.1562   0.2461  0.0899 1.57554
                         qtest39 0.112    0.1936  0.0816 1.72857
                         qtest50 0.0463   0.2037  0.1574 4.39957
                         qtest36 0.1319   0.6019    0.47   4.56331
                         qtest34 0.1479   0.6805  0.5326 4.60108
                         qtest17 0.085    0.4375  0.3525 5.14706
                         qtest7     0     0.1389  0.1389     N/A
                         qtest27    0     0.4444  0.4444     N/A
                         qtest46    0     0.0123  0.0123     N/A
Table 2. bpref scores for topics that benefited from the MetaMap query expansion
described in section 2.1.


4     Conclusions & Future Work

Our results in this year’s evaluation campaign were somewhat disappointing.
However, we have now established two different solid baseline systems, and have
also demonstrated that our existing query expansion system is not completely
useless in a more general medically-themed retrieval task. Clearly, however, there
is much for us to do in preparation for next year’s evaluation campaign. Our
future work will consist of work on several fronts.
    The first and most important next step will be to improve our indexing strat-
egy. One issue that our baseline system ran into during development was that
many documents in the corpus contained extraneous text9 that led to numer-
ous false positives. We will be working on ways to detect content zones within
the page such that we only index the truly relevant text from each document.
We also plan on experimenting with various other approaches, such as index-
ing anchor or heading text separately from the rest of the document body, and
boosting their value in the search index.
    A second area of work will be in improving our MetaMap-based query parser
and expansion system. Right now, it is only using a subset of the vocabularies
in the UMLS, which we believe limited its ability to successfully map concepts
from the queries into UMLS CUIs. Furthermore, in its present form, the parser is
relatively inflexible and makes overly-aggressive use of Boolean operators. This
led to over-constrained queries as well as queries containing duplicate terms. We
hope to improve this state of affairs.
    A third area will be to explore ways to make use of syntactic data derived
from the queries themselves to better understand what the queries are actually
asking about. This could include constituent or dependency parsing, part-of-
9
    Sometimes in the form of page navigation elements, other times from intentional
    keyword frequency manipulation on the part of the page’s authors— e.g., a hidden
    div filled with a smörgåsbord of medical terms unrelated to the ostensible topic of
    the page.
                                bpref by topic, Language Model
       0.6


       0.5


       0.4


       0.3


       0.2


       0.1


       0.0
              qtest1
              qtest2
              qtest3
              qtest4
              qtest5
              qtest6
              qtest7
              qtest8
              qtest9
             qtest10
             qtest11
             qtest12
             qtest13
             qtest14
             qtest15
             qtest16
             qtest17
             qtest18
             qtest19
             qtest20
             qtest21
             qtest22
             qtest23
             qtest24
             qtest25
             qtest26
             qtest27
             qtest28
             qtest29
             qtest30
             qtest31
             qtest32
             qtest33
             qtest34
             qtest35
             qtest36
             qtest37
             qtest38
             qtest39
             qtest40
             qtest41
             qtest42
             qtest43
             qtest44
             qtest45
             qtest46
             qtest47
             qtest48
             qtest49
             qtest50
Fig. 2. bpref score by topic, “Language Model” run. As compared with the base-
line system (see Figure 1), the language model system was less consistent but higher-
performing.


speech tagging, and so on. We will also experiment with performing this sort of
analysis on the documents as well.
    Our language model retrieval system has many possible avenues for improve-
ment. Besides making use of the aforementioned syntactic data, we plan to ex-
periment with different skip-ngram window widths as well as some of the al-
gorithm’s other tuning parameters. This sort of optimization work was difficult
to do during the development phase of this campaign due to our lack of super-
vised training data; our hope is that, with access to the qrels from this year’s
campaign, we will be able to improve our approach.
    We are excited to have participated in this year’s evaluation campaign, and
eagerly await next year’s. Although we were disappointed by our systems’ perfor-
mance, we see a lot of potential in our fundamental approach, and look forward
to the opportunity to develop our ideas further.


References
 1. Voorhees, E., Hersh, W.: Overview of the trec 2012 medical records track. In: The
    Twenty-First Text REtrieval Conference Proceedings (TREC 2012). (2012)
 2. Hersh, W.R., Crabtree, M.K., Hickam, D.H., Sacherek, L., Friedman, C.P., Tid-
    marsh, P., Mosbaek, C., Kraemer, D.: Factors associated with success in searching
    MEDLINE and applying evidence to answer clinical questions. J Am Med Inform
    Assoc 9(3) (2002) 283–93
 3. Suominen, H., Salanterä, S., Velupillai, S., Chapman, W.W., Savova, G., Elhadad,
    N., Mowery, D., Leveling, J., Goeuriot, L., Kelly, L., Martinez, D., Zuccon, G.:
    Overview of the ShARe/CLEF eHealth Evaluation Lab 2013. In: CLEF 2013.
    Lecture Notes in Computer Science (LNCS), Springer (2013)
 4. Bedrick, S., Edinger, T., Cohen, A., Hersh, W.: Identifying patients for clinical
    studies from electronic health records: TREC 2012 medical records track at OHSU.
    In: The Twenty-First Text REtrieval Conference Proceedings (TREC 2012). (2012)
 5. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval.
    Cambridge University Press, New York, NY, USA (2008)
 6. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters.
    In: OSDI’04: Sixth Symposium on Operating System Design and Implementation,
    USENIX Association (2004)
 7. Aronson, A.R., Lang, F.M.: An overview of metamap: historical perspective and
    recent advances. J Am Med Inform Assoc 17(3) (2010) 229–36
 8. Bodenreider, O.: The unified medical language system (umls): integrating biomed-
    ical terminology. Nucleic Acids Res 32(Database issue) (Jan 2004) D267–70
 9. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval.
    [5] chapter Language Models for Information Retrieval
10. Siu, M., Ostendorf, M.: Variable n-grams and extensions for conversational speech
    language modeling. Speech and Audio Processing, IEEE Transactions on 8(1)
    (2000) 63–75
11. Guthrie, D., Allison, B., Liu, W., Guthrie, L., Wilks, Y.: A closer look at skip-
    gram modelling. In: Proceedings of the 5th international Conference on Language
    Resources and Evaluation (LREC-2006). (2006) 1–4

</pre>