<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Supervised Visualization of Vocabulary Knowledge towards Explainable Support of Second Language Learners</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yo Ehara</string-name>
          <email>ehara.yo@sist.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Shizuoka Institute of Science and Technology</institution>
          ,
          <addr-line>2200-2, Toyosawa, Fukuroi, Shizuoka</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>23</fpage>
      <lpage>25</lpage>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In second language learning, it is crucial to identify gaps in
knowledge of the language between second language
learners and native speakers. Such a gap exists even when learning
a single word in a second language. As the semantic
broadness of a word differs from language to language, language
learners must learn how broadly a word can be used in a
language. For example, certain languages use different words for
“period” in “a period of time” or “period pains” yet both are
nouns. Learners whose native languages are such languages
typically have only partial knowledge of a word, even though
they think they know the word “period,” producing a gap
between them and native speakers. Language learners typically
want explanations for these word usage differences, which
even native speakers find it difficult to explain and find it
costly to annotate. To support language learners in noticing
these challenging differences easily and intuitively, this
paper proposes a novel supervised visualization of the usages
of a word. In our method, the usages of an inputted word in
large corpora written by native speakers are visualized,
taking the semantic proximity between the usages into account.
Then, for the single inputted word, our method makes a
personalized prediction of word usages that each learner may
know, based on his/her results of a quick vocabulary test,
which takes approximately 30 minutes. The experiment
results show that our method produces better usage frequency
counts than raw usage frequency counts in predicting
vocabulary test responses, implying that word usage prediction is
accurate.</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>Acquiring a second language requires repeated efforts to
narrow the gap between language learners’ knowledge of the
language and that of native speakers. Making such gaps
intuitively understandable greatly helps language learners
selfteach the language and also helps researchers build
effective language tutoring systems. Some gaps such as
vocabulary size, or time spent in language learning are intuitively
easy to understand and, hence, are well studied. However,
in second language learning, most gaps are related to
meaning and semantics and are inherently abstract. Hence,
visualizing these gaps is essential to make these gaps intuitively
understandable.</p>
      <p>The broadness of a word, or how a word can be used in
the language to express different concepts, is one such
abstract gap (Read 2000). Because the meaning of a word
differs from language to language, when learning a word in a
second language, there typically exists a gap between what
learners think the word means and how the word is
actually used in the language. Polysemous words are examples
that are easy to understand: “book” can mean an item
associating with reading, or it can mean to make a reservation.
Other than these examples, to which the part-of-speech
tagging techniques in natural language processing (NLP) seem
applicable, some examples are more subtle: some languages
always use different words for “time” in “in a short time”
or “for a time,” in which the word “time” refers to a period,
and “time and space” or “time heals all wounds,” in which
“time” is used as an abstract concept. In another example,
many languages use different words for “period” in “a
period of time”, and “period” in “period pains”. In this way,
the granularity of the word’s senses should be distinguished
for second language acquisition, as it varies from word to
word.</p>
      <p>
        Polysemous words encode different concepts in one word:
hence, they have been one of the central topics in
knowledge engineering. A substantial amount of work has been
conducted to automatically recognize polysemous words for
practical applications by using machine learning,
including those in the previous AAAI-MAKE workshops
        <xref ref-type="bibr" rid="ref12 ref17 ref17">(Ramprasad and Maddox 2019; Hinkelmann et al. 2019;
Laurenzi et al. 2019)</xref>
        . However, even among few such
applications for second language acquisition
        <xref ref-type="bibr" rid="ref11 ref6">(Heilman et al. 2007;
Dias and Moraliyski 2009)</xref>
        in the artificial intelligence (AI)
community, the challenging problem of different granularity
of the word’s senses in second language acquisition has not
been addressed. In second language acquisition, as learners
are typically not linguistic experts, i.e., novices, hence,
systems to support their learning need to be intuitively
understandable. Our goal is to make the gaps among word usages
intuitively understandable, even for novice language
learners.
      </p>
      <p>
        To this end, this paper proposes a novel supervised
visualization method for word usages to assist in learning the
different usages of a word. Our method first searches all
usages of the target word in a large corpus written by
native speakers. Then, it calculates the vector representation of
each usage, or occurrence, of each word by using a
contextualized word embedding method
        <xref ref-type="bibr" rid="ref5">(Devlin et al. 2019)</xref>
        .
Contextualized word embedding methods
        <xref ref-type="bibr" rid="ref27 ref5">(Peters et al. 2018;
Devlin et al. 2019)</xref>
        are recently proposed methods to embed
each occurrence of a word, capturing the context of each
usage of the word.
      </p>
      <p>Then, our method is trained to visualize the
contextualized word embedding vectors by plotting each usage as a
point in a two-dimensional space. Unlike a typical
visualization method that merely projects the vectors to a
twodimensional space, our method is trained to fit and
visually explain a given supervision dataset. This means that
the same vectors are visualized in different ways if the
supervision dataset differs. Here, the supervisions are a
vocabulary test result dataset that consists of a matrix-format
data, recording which learner answered correctly/incorrectly
to which word question. The method visualizes the areas a
learner user may know by classifying each usage point in
the visualization into known/not known to the learner. This
classification is conducted in a personalized manner because
learners’ language skills and specialized fields are different.
The learner only needs to take a 30-minute vocabulary test
for this purpose.</p>
      <p>Figure 1 shows an example visualization using our
method. “To haunt” has two different meanings in English,
the first being “to chase” and the other “to curse,” or to be
affected by ghosts or misfortune. Each point shows the usage
of the word in a corpus written by native speakers. The
differences in point colors indicate whether they are predicted
to be known to the learner. The right side of the figure, within
the dotted curve, is predicted to be known to the learner. In
this way, our method visualizes the semantic area the learner
knows.</p>
      <sec id="sec-2-1">
        <title>Our contribution is as follows:</title>
        <p>For second language vocabulary learning, we propose
a novel supervised visualization model that captures
word broadness via a personalized prediction of learner’s
knowledge of usages.</p>
        <p>As our visualization uses a vocabulary test result dataset
as supervisions, learners can understand which usage of
the inputted word is predicted to be known/not known to
him/her. Unlike previous methods that output automatic
explanation of machine-learning models, our method is
much more intuitive and novice-friendly for language
learners in the sense that language learners do not need
to know about machine learning models.</p>
        <p>
          We evaluated our method in terms of predictive accuracy
of vocabulary test result dataset and achieved better
results compared to baselines.
While deep learning-based methods outperformed
conventional machine learning methods such as support vector
machines (SVMs) in many tasks, parameters of deep
learning methods are typically more difficult to interpret
compared to those in conventional models. To this end, in the
machine learning and artificial intelligence community, a
number of methods have been proposed to extract
explanations from trained machine-learning models, or training
models taking explainability into accounts
          <xref ref-type="bibr" rid="ref16 ref16 ref18 ref20 ref20 ref24 ref29">(Ribeiro, Singh,
and Guestrin 2016; Koh and Liang 2017; Lundberg and Lee
2017; Ribeiro, Singh, and Guestrin 2018)</xref>
          .
        </p>
        <p>However, the purpose of these methods is to explain
machine-learning models to help machine-learning
engineers and researchers in understanding the models.
Obviously, second language learners are usually not
machinelearning engineers and researchers. Therefore, methods of
these studies have different purposes, and it is difficult to
apply these methods to help their understanding of the
models. Language learners are typically even not interested in
the models. Rather, learners’ interests reside in
understanding their current learning status and what they should learn
to improve it. Hence, to meet learners’ needs, a model is
desirable for a learner to see his/her current learning status and
what he/she needs to learn in the near future.</p>
        <sec id="sec-2-1-1">
          <title>Word Embedding Visualization Studies</title>
          <p>
            Word embedding techniques are techniques that have been
extensively studied in natural language processing (NLP) to
obtain vector representations of words typically using
neural networks. The word2vec is a seminal paper in these lines
of studies
            <xref ref-type="bibr" rid="ref23">(Mikolov et al. 2013)</xref>
            . The following papers
report improvement of their accurateness to represent words
as vectors, typically by comparing the distances between
word vectors with human judgments on semantic proximity
between words
            <xref ref-type="bibr" rid="ref26">(Pennington, Socher, and Manning 2014)</xref>
            .
Early studies on word embeddings address how to make
one vector for each word. As one vector representation is
modeled to point one meaning, this limitation is obviously
problematic to deal with polysemous words. Several
previous studies tackled this problem and proposed methods
to estimate the number of a word’s meanings and to
estimate an embedding for each meaning of the word
            <xref ref-type="bibr" rid="ref1 ref18 ref29">(Athiwaratkun, Wilson, and Anandkumar 2018)</xref>
            . However,
recently, contextualized word embeddings
            <xref ref-type="bibr" rid="ref27 ref5">(Peters et al. 2018;
Devlin et al. 2019)</xref>
            became quickly popular. With these
methods, we can obtain an embedding for each usage, or
occurrence, of a word, considering the context of the
occurrence of the word in a running sentence. These methods can
also be seen as a method to estimate word embeddings for
polysemous words, with an extreme assumption that each
occurrence of a word has different meanings. As
contextualized word embeddings are shown to be successful in many
tasks, in current NLP, the former strategy to estimate both
the number of meanings of a word and an embedding for
each meaning is employed only when it is necessary.
          </p>
          <p>
            Following the rise of word embedding techniques,
visualization studies were propose to visualize word embeddings.
The study by (Smilkov et al. 2016) simply reported that their
development of a tool to visualize embeddings for different
words. The study by
            <xref ref-type="bibr" rid="ref19">(Liu et al. 2017)</xref>
            introduces applying
visualization of word embeddings to analyze semantic
relationships between words. Both paper deals with principal
component analysis (PCA) and t-SNE
            <xref ref-type="bibr" rid="ref22">(Maaten and Hinton
2008)</xref>
            for visualization. To our knowledge, we are the first
to visualize contextualized word embeddings, in which each
occurrence of a word, rather than a word, is visualized, with
a practical purpose on language education.
          </p>
          <p>
            In addition to the visualization, our method can also
predict the usages that each learner is familiar/unfamiliar with,
in a personalized manner, when vocabulary test result data of
dozens of learners are provided, such as the data in
            <xref ref-type="bibr" rid="ref9">(Ehara
2018)</xref>
            . While there exist previous studies
            <xref ref-type="bibr" rid="ref18 ref18 ref29 ref29 ref9">(Ehara 2018;
Lee and Yeung 2018; Yeung and Lee 2018)</xref>
            for predicting
the words that each learner is familiar/unfamiliar with using
such data by using simple machine-learning classification,
our method tackles a more difficult problem that deals with
predicting which usages of a word is known/unknown to the
learner.
          </p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Concordancer studies</title>
          <p>
            While our proposed method is novel as a visualization,
software tools that search the usages of an inputted word
for educational purposes and display them itself are not
novel: such software is known as concordancers.
Concordancers target learners, educators, and linguists as primary
users. They are interactive software tools that retrieve all
usages of the inputted word in a large corpus and display
the list of the usages, each of which comes with the
surrounding word patterns
            <xref ref-type="bibr" rid="ref13">(Hockey and Martin 1987)</xref>
            .
Concordancers were also studied to support translators, who are
second language learners in many cases
            <xref ref-type="bibr" rid="ref14 ref21 ref28">(Wu et al. 2004;
Jian, Chang, and Chang 2004; Lux-Pogodalla, Besagni, and
Fort 2010)</xref>
            .
          </p>
          <p>Figure 2 shows a screenshot from a current concordancer
1. In this screenshot, the word “book” is searched. Then, the
list of word usages is shown. Each word usage comes with
surrounding words so that language learners can see how the
word is used. While the list is sorted in alphabetical order of
the previous word, we can see that the list shows “a book”
and “the book” in totally different positions and are not
helpful for language learners. While some concordancers
support listing the usage of “book” as nouns by attaching texts
with part-of-speeches in advance, this is not helpful to see
the different usages of the word when the part-of-speeches
of the usages are identical. For example, the word “bank”
have polysemous meanings sharing the same part-of-speech:
one as financial organizations, and another as embankments.</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Personalized complex word identification studies</title>
          <p>
            In this study, a part of our goals is to identify complex
usages of a word in a running text. In other words, for one
word, one usage of the word in running text is complex
for a learner, and another usage of the word is not. There
are previous studies that identify complex words in a
personalized manner in the NLP literature
            <xref ref-type="bibr" rid="ref18 ref29 ref7">(Ehara et al. 2012;
Lee and Yeung 2018)</xref>
            . These studies predict the words that
each learner knows based on each learner’s result of a short
vocabulary test, which a learner typically takes 30 minutes
to solve. Also, there are also many studies that identify
complex usages in a non-personalized manner, as summarized in
            <xref ref-type="bibr" rid="ref24 ref30">(Paetzold and Specia 2016; Yimam et al. 2018)</xref>
            .
          </p>
          <p>However, to our knowledge, the task of identifying
complex usages in a personalized manner is novel. Our method
is also novel in that it trains how to visualize the usages so
that learners can visually understand the usage differences
by using the learners’ vocabulary test data.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Preliminary System and Experiments</title>
      <p>Before entering the technical details of our method
described in the Proposed Method section, we first show the
preliminary system and some experiment results to
introduce the motivation of the proposed method.</p>
      <p>The preliminary system visualizes contextualized word
embeddings by using the conventional visualization of
principal component analysis (PCA). Figure 3 shows the layout</p>
      <sec id="sec-3-1">
        <title>1https://lextutor.ca/conc/eng/</title>
        <p>
          of the preliminary system. Once a user provides a word to
the system, it automatically searches the word in the corpus
in a similar way to typical concordancers. Unlike
concordancers, the system has a database that stores contextualized
word embeddings for each usage or occurrence of each word
in the corpus. We used half a million sentences from the
British National Corpus
          <xref ref-type="bibr" rid="ref4">(BNC Consortium 2007)</xref>
          as the raw
corpus. We built the database by applying the
bert-baseuncased model of the PyTorch Pretrained the BERT project
2
          <xref ref-type="bibr" rid="ref5">(Devlin et al. 2019)</xref>
          to the corpus. We used the last layer,
which was more distant from the surface input, as the
embeddings.
        </p>
        <sec id="sec-3-1-1">
          <title>Choice of dimension reduction methods</title>
          <p>
            Principal component analysis (PCA) and t-SNE
            <xref ref-type="bibr" rid="ref22">(Maaten
and Hinton 2008)</xref>
            are famous dimension reduction methods,
and t-SNE is notable for its intuitiveness and well clustered
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>2https://github.com/huggingface/pytorch-pretrained-BERT</title>
        <p>
          points
          <xref ref-type="bibr" rid="ref22">(Maaten and Hinton 2008)</xref>
          .
        </p>
        <p>
          Knowing t-SNE, we did not employ t-SNE for
visualization for the following reasons: First, in our visualization,
the distances between usage points are important. While
tSNE often produces intuitive clusters between data points,
the distance between points in the visualization is
complicated compared to those of PCA. Hence, to interpret
distances between points, PCA is This is stated in the original
tSNE
          <xref ref-type="bibr" rid="ref22">(Maaten and Hinton 2008)</xref>
          paper. Moreover, many blog
posts such as 3 for engineers address this fact to encourage
the proper understanding of t-SNE. For these reasons, we
employed PCA for the basis of our visualization.
        </p>
        <p>
          Second, even if the data to visualize is fixed, t-SNE
returns different results depending on its hyperparameter
called perplexity. In contrast, PCA returns the same
results if the data to visualize is fixed. This dependence on
the hyperparameter is elaborated in the original t-SNE
paper
          <xref ref-type="bibr" rid="ref22">(Maaten and Hinton 2008)</xref>
          in the first place. We can also
find some blog posts targeting engineers that advocates to
carefully set the perplexity parameter such as 4. Various
results on fixed data can be useful when the data is difficult to
be pre-processed so that the following dimension-reduction
methods are easy to handle. However, in this study, the data
to be visualized are embeddings vectors; hence, the data can
be easily pre-processed before we feed them into the data.
Hence, for the purpose of this study, the feature that the
results vary on fixed data is unlikely to be useful. Rather, this
may possibly complicate the interpretation of the
visualization.
        </p>
        <p>Third, practically, t-SNE is computationally heavy
compared to PCA. Computing a t-SNE visualization involves
calculations for every pair of the given data points. While
how to deal with this heavy computational complexity is
addressed in studies such as (Tang et al. 2016), practically,
t</p>
      </sec>
      <sec id="sec-3-3">
        <title>3https://mlexplained.com/2018/09/14/paper-dissected</title>
        <p>visualizing-data-using-t-sne-explained/
4https://distill.pub/2016/misread-tsne/
SNE is usually computationally heavy when compared to
PCA. Strictly speaking, PCA has a similar complexity as
it involves the computation of singular values and vectors
in singular value decomposition (SVD). However, the
calculation of SVD has a number of applications other than
PCA-based visualization, sophisticated calculation methods
for large data were previously proposed (Halko et al. 2011).</p>
        <sec id="sec-3-3-1">
          <title>Preliminary System by using PCA</title>
          <p>
            We built a preliminary system and conducted some
experiments to see how contextualized word embedding vectors
are plotted in the system. Figure 4 depicts such an
example of searching for the word book. Users can directly type
the word in the textbox shown at the top of Figure 4.
Below is the visualization of the usages found and their list.
Each dark-colored point is linked to each usage. Two dark
colors are used to color each usage point according to the
results of a Gaussian mixture model (GMM) clustering with
2 components, as this value was reported to work well
            <xref ref-type="bibr" rid="ref1 ref18 ref29">(Athiwaratkun, Wilson, and Anandkumar 2018)</xref>
            . The light-red
colored point is the probe point: the usages are listed in
the nearest order of the probe point. No usage is linked to
the probe point. Users can freely and interactively drag and
move the probe point to change the list of usages below the
visualization. Each line of the list shows the usage
identification number and the surrounding words of the usage,
followed by a checkbox to record the usage so that
learners can refer to it later. In Figure 4, the probe point is on
the left part of the visualized figure. In the first several lines
of the list, the system successfully shows the usages of the
word book as a publication. In contrast, Figure 5 depicts
the case in which the users drag the probe point from the
left to the right of the visualization. The first several lines
of the list show the usages of the word book, which means
to reserve. We can see that the words surrounding the word
book vary: merely focusing on the surrounding words, such
as “to” before book, cannot distinguish the usages of book,
which means to reserve, from the usages of book for reading.
          </p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Clustering Results</title>
          <p>The GMM clustering was accurate but not perfect: 0
errors in the 42 usages of “book”, 1 error in the 22 usages
of “bank”, when manually checked in the excerpt. Hence,
learners can choose not to use this, as in the video. Figure 6
shows the variance of the usage vectors of each word against
its log frequency in the excerpt. It showed a statistically
significant moderate correlation (r = 0:56, p &lt; 0:01 by F-test),
implying that frequent words tend to have complex usages.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>Motivating Examples</title>
          <p>From the example of “book” in the previous sections, we can
easily see that the usages of “book” about reading are more
frequent than those of “book” about a reservation. Hence,
when counting the number of usages, it is intuitive to assume
that learners are not familiar with all usages but are familiar
with the usages within a certain radius in the vector space.
This is the motivation of our method descried in the next
section.</p>
          <p>Before entering the technical details of our visualization
method in the next section, we show some usage
prediction result examples of our method in a manner similar to
the previous examples of “book” so that readers can
intuitively understand our motivation, as shown in Figure 7
and Figure 8. The markers are changed to triangular to
denote that the colors reflect prediction results, rather than the
GMM-based clustering results explained above. The
coloring and darkness of the points in the visualization follow
those of the previous examples; the red light-colored point
is the probe point, and the other dark points denote usages.
Figure 7 shows an example of the familiar usage prediction
in case of searching the word “haunt”. The right-hand side
of the cross-marked circle is the area in which usages are
predicted to be familiar to this learner. The probe point is
located within the circle. We can see that the usages of “haunt”
about chasing are listed below. Figure 8 shows another
example of “haunt”. As the probe point is located outside of
the circle, in the left side of the visualization, the list below
shows the list of the usages predicted to be unfamiliar to this
learner. We can see that“haunt” about “to curse” are mainly
listed.
dvi
freq(vi)
=
=</p>
          <p>log(freq(vi) + 1)
N (~ci; ; Xi)
ni
X tanh (M
k=1
(2)
(3)
ReLU(
de(G~ci; G~xk;i)))(4)
As stated in the Related Work section, some previous
studies address methods to predict the words that a learner knows
based on his/her short vocabulary test result. However, since
our application requires personalized prediction of the
usages of the word that the learner does not know. Hence, we
propose a novel model that does this.</p>
          <p>
            Let us write the set of words as fv1; : : : ; vI g, where I is
the number of words (in type), and write the set of learners
as fl1; : : : ; lJ g, where J is the number of learners. Then, in
previous studies, based on the Rasch model
            <xref ref-type="bibr" rid="ref2">(Rasch 1960;
Baker 2004)</xref>
            , the following logistic regression model
Equation 1 is used to predict whether learner lj knows word vi or
1
not. Here, (x) := 1+exp( x) and yi;j is the response of the
learner in the vocabulary test; yi;j = 1 if learner lj answered
correctly to the question of word vi, and yi;j = 0 otherwise.
We have two types of parameters to tune: alj is the ability of
learner lj and dvi is the difficulty of word vi.
          </p>
          <p>P (yi;j = 1jlj ; vi) = (alj
dvi )
(1)</p>
          <p>
            Here, how to model dvi , or the difficulty parameter of
word vi, is the key to our purpose. Previous studies report
that the negative logarithm of the word frequency correlates
well with the perceived difficulty of words
            <xref ref-type="bibr" rid="ref3">(Tamayo 1987;
Beglar 2010)</xref>
            . As in Figure 1, our key idea is to count the
frequency of word usages only within a certain distance from
the typical usage of the word. Hence, we propose the
following model to implement this idea.
          </p>
          <p>For each vi, we have ni vectors that are vector
representation of each of the ni occurrences of word vi. We write these
vectors as Xi = f~x1;i; : : : ; ~xni;ig. Each vector ~xk;i is T1
dimensional. Among Xi, let ~ci be the one closest to their
center n1i Pkn=i1 ~xk. Let freq(vi) be the frequency of the vectors
in Xi within distance measured from the central vector ~ci.
We write this frequency simply as freq(vi) = N (~ci; ; Xi).
Here, n is the number of usages of word vi and let each ~xk
be each usage vector obtained from contextualized word
embedding methods. Let ReLU(z) = max(0; z) be the
rectiThe tricky part is that Equation 3 can be approximately
written as Equation 4, whose parameter can be easily tuned
and optimized by using neural machine learning framework
such as PyTorch. In Equation 4, due to the ReLU
function, negative values within the function is simply ignored.
Hence, as de is the Euclidean distance, if = 0, i.e., the
size of the circle is 0, the terms inside ReLU is negative,
and freq(vi) = 0. If de(G~ci; G~xk;i) &gt; 0, due to M and
tanh, the resulting value is almost 1. This means that we are
counting only the cases that surpasses de, i.e., counting the
usages within measured from ~ci.</p>
          <p>Notably, the following characteristics are important to
understand our model.</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>Not merely a logistic regression</title>
          <p>
            Notably, the proposed model is not merely a logistic
regression. Our model has more parameters such as ; ~ci; alj ; G.
Because of having different extra parameters compared to
the logistic regression, to train our model, we typically need
to use a neural network machine learning framework to
model and optimize, such as PyTorch. To optimize using
such models, as it is difficult to differentiate the loss
functions of such models by hand, the model loss function is
desirable to be mostly continuous and smooth so that its
parameters can be tuned using auto-gradient. We specifically
designed Equation 4 to meet these conditions. In the
experiments, we used the Adam optimization method
            <xref ref-type="bibr" rid="ref15">(Kingma and
Ba 2015)</xref>
            to optimize the loss function.
          </p>
        </sec>
        <sec id="sec-3-3-5">
          <title>Trainable G</title>
          <p>As Equation 4 is mostly continuous and smooth, matrix G
can also be trained by using deep-learning framework
software. As G is a projection matrix from T1 to T2, if we set
T2 = 2 to consider a projection to a two-dimensional space,
training G via supervisions means training visualization via
supervisions. Here, in our task setting, the supervisions are
vocabulary test dataset of second language learners, i.e., a
matrix in which the (j; i)-th element denotes whether learner
lj correctly answered the question of word vi.
j : Personalized
In Equation 4, for easier understanding, we write to be a
constant that does not depend on learner index j. In reality,
we can personalize by making dependent to learner index
j as j ; in this case, each learner lj has his/her own region
that he/she can understand, and the radius of this region is
j . This personalized version is the one that we used in the
experiments.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <sec id="sec-4-1">
        <title>Quantitative Results of Prediction</title>
        <p>
          Quantitative evaluation of this personalized prediction of
usages of a word is difficult; to this end, we need to test each
learner multiple times for different usages of the same word.
However, when tested with the same word multiple times,
learners easily notice that the word has multiple meanings.
Hence, instead, we evaluated the accuracy of personalized
prediction of the words that the learner knows under an
experiment setting similar to
          <xref ref-type="bibr" rid="ref9">(Ehara 2018)</xref>
          . Our proposed
method is based on neural classification with a novel
extension to adjusted counting the frequency of the usages within
distance j . Since a typical logistic-regression classifier is
identical to one-layer neural classifier, comparing our model
with a typical logistic-regression classifier using a frequency
feature in terms of accuracy can be used to indirectly
evaluate how the idea of adjusted frequency is a practical method
for evaluation.
        </p>
        <p>
          The proposed model estimates the number of occurrences,
i.e., usages, that each learner knows. In other words, this
can be regarded as modifying the word frequency so that
the model fits to the given vocabulary test dataset. In this
regard, we can evaluate how well the proposed model can
correct word frequency when an unbalanced corpus is given.
Each document in the British National Corpus (BNC)
          <xref ref-type="bibr" rid="ref4">(BNC
Consortium 2007)</xref>
          is annotated with a domain Table 1. We
evaluated how the proposed model can correct the word
frequency in the “arts” domain.
        </p>
        <p>
          We used the vocabulary test result data in which each of
100 learners answered 31 vocabulary questions on the
publicly available dataset
          <xref ref-type="bibr" rid="ref9">(Ehara 2018)</xref>
          . In 3; 100 vocabulary
test responses, we used 1; 800 to train the model, and the
rest was used for the test. The baseline model is simply a
logistic regression in which the logarithm of word frequency is
the only feature. The logarithm of word frequency has been
used as a simple rough measure for word difficulty and
previously used to analyze and predict word difficulty based
on vocabulary test data
          <xref ref-type="bibr" rid="ref18 ref18 ref29 ref29 ref3">(Beglar 2010; Ehara et al. 2013;
Lee and Yeung 2018; Yeung and Lee 2018)</xref>
          . The proposed
model counts only the number of usages within the radius
j . We used the PyTorch neural network framework to
automatically tune the radius j and the center of the sphere by
using its powerful automatic gradient support
          <xref ref-type="bibr" rid="ref25">(Paszke et al.
2017)</xref>
          .
        </p>
        <p>First, we perform experiments on T1 = T2 and G = I,
a setting where no projection is performed and the model
deals with T1 dimensional hyperspheres. Table 2 shows the
results. It can be seen that the accuracy of the prediction of
the word test data of language learners using the biased text
of arts domain only is lower than that using the word
frequency of all domains. The proposed method was able to
improve the accuracy of the word frequency of the arts
domain only by counting the frequency in the region on the
contextual word expression vector space where the
examinee is estimated to be reacting. This effect was also observed
for all domains. This seems to be the effect of frequency
counting excluding the cases where the proposed method is
outlier. The improvement in accuracy before and after
correction (p &lt; 0:01, Wilcoxon test) was statistically
significant when modifying word frequencies in the arts domain
alone or in all domains.
“Trained” visualization
In the above experiments, we considered the case where no
projection was conducted, by fixing G = I. Next, let us
consider the case where G is a projection to a two-dimensional
space, i.e., G is a 2 T1 matrix. Tuning G and radius j
to fit the vocabulary test dataset by using Equation 4 means
that we can actually train the visualization to explain the
vocabulary test dataset in a supervised manner.</p>
        <p>Figure 9 and Figure 10 show the result of the
visualization. The initial value of G was set to a two-dimensional
projection matrix by principal component analysis (PCA).
Though the initial value is the projection by PCA, it should
be noted that the projection matrix G itself is trained from
the vocabulary test dataset as well as the radius j .</p>
        <p>From Figure 9 and Figure 10, we can see that the
proposed method counts only the main meanings within the red
circle. To qualitatively evaluate the results, in Table 3, the
two farthest or closes example occurrences of “period” from
its center point, i.e., the center of the red circle in Figure 9
are shown. It can be seen that the farthest cases are
examples of the use of technical terms such as “period pain” and
“magnetic field period”, while the closest two cases are
examples of nouns representing periods such as “this period”
and “the period”.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this paper, we propose a supervised visualization method
to predict which usages of a word are known to each learner,
by using a vocabulary test result dataset as supervisions. Our
neural method automatically tunes the projection matrix to
visualize and the radius of each learner in the visualization
so that the counted frequency within the circles fits to the
supervisions. Experiments on actual subject response data
show that the proposed method can predict subject response
more accurately by modifying the frequency even when the
use cases are biased to a specific domain. As a future work,
we are planning to make our method more interactive.</p>
      <p>Ramprasad, S., and Maddox, J. 2019. CoKE: Word Sense
Induction Using Contextualized Knowledge Embeddings.
In AAAI Spring Symposium: Combining Machine Learning
with Knowledge Engineering.</p>
      <p>Rasch, G. 1960. Probabilistic Models for Some Intelligence
and Attainment Tests. Copenhagen: Danish Institute for
Educational Research.</p>
      <p>Read, J. 2000. Assessing Vocabulary. Cambridge University
Press.</p>
      <p>Smilkov, D.; Thorat, N.; Nicholson, C.; Reif, E.; Vie´gas,
F. B.; and Wattenberg, M. 2016. Embedding Projector:
Interactive Visualization and Interpretation of Embeddings. In
In Proc. of NIPS 2016 Workshop on Interpretable Machine
Learning in Complex Systems.</p>
      <p>Tamayo, J. M. 1987. Frequency of use as a measure of
word difficulty in bilingual vocabulary test construction and
translation. Educational and Psychological Measurement
47(4):893–902.</p>
      <p>Tang, J.; Liu, J.; Zhang, M.; and Mei, Q. 2016. Visualizing
large-scale and high-dimensional data. In Proc. of WWW,
287–297.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Athiwaratkun</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; Wilson,
          <string-name>
            <given-names>A.</given-names>
            ; and
            <surname>Anandkumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>F. B.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Item Response Theory : Parameter Estimation Techniques, Second Edition</article-title>
          . CRC Press.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Beglar</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>A rasch-based validation of the vocabulary size test</article-title>
          .
          <source>Language Testing</source>
          <volume>27</volume>
          (
          <issue>1</issue>
          ):
          <fpage>101</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>BNC</given-names>
            <surname>Consortium</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2007</year>
          .
          <article-title>The British National Corpus, version 3 (BNC XML Edition)</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Chang, M.-W.;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Dias</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Moraliyski</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Relieving polysemy problem for synonymy detection</article-title>
          .
          <source>In Portuguese Conference on Artificial Intelligence</source>
          ,
          <fpage>610</fpage>
          -
          <lpage>621</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Ehara</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sato</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Oiwa, H.; and Nakagawa,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <year>2012</year>
          .
          <article-title>Mining words in the minds of second language learners: learnerspecific word difficulty</article-title>
          .
          <source>In Proc. of COLING.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          2013.
          <article-title>Personalized reading support for second-language web documents</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          <volume>4</volume>
          (
          <issue>2</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Ehara</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Building an english vocabulary knowledge dataset of japanese english-as-a-second-language learners using crowdsourcing</article-title>
          .
          <source>In Proc. LREC</source>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          2011.
          <article-title>An algorithm for the principal component analysis of large data sets</article-title>
          .
          <source>SIAM Journal on Scientific computing 33</source>
          <volume>(5)</volume>
          :
          <fpage>2580</fpage>
          -
          <lpage>2594</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Heilman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Collins-Thompson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Callan</surname>
            , J.; and Eskenazi,
            <given-names>M.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts</article-title>
          .
          <source>In Proc. of NAACL</source>
          ,
          <fpage>460</fpage>
          -
          <lpage>467</lpage>
          . Rochester, New York: Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Hinkelmann</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Blaser</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Faust</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Horst</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Mehli</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Virtual Bartender: A Dialog System Combining Data-Driven and Knowledge-Based Recommendation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Hockey</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1987</year>
          .
          <source>The Oxford Concordance Program Version 2. Digital Scholarship in the Humanities</source>
          <volume>2</volume>
          (
          <issue>2</issue>
          ):
          <fpage>125</fpage>
          -
          <lpage>131</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Jian</surname>
          </string-name>
          , J.-Y.;
          <string-name>
            <surname>Chang</surname>
          </string-name>
          , Y.-C.
          <article-title>;</article-title>
          and
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>TANGO: Bilingual collocational concordancer</article-title>
          .
          <source>In Proc. of ACL demo.</source>
          ,
          <volume>166</volume>
          -
          <fpage>169</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>In Proc. of ICLR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Koh</surname>
            ,
            <given-names>P. W.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Understanding Blackbox Predictions via Influence Functions</article-title>
          .
          <source>In Proc. of ICML</source>
          ,
          <year>1885</year>
          -
          <fpage>1894</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Laurenzi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hinkelmann</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; Ju¨ngling, S.;
          <string-name>
            <surname>Montecchiari</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pande</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Towards an Assistive and Pattern Learning-driven Process Modeling Approach</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Yeung</surname>
            ,
            <given-names>C. Y.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Personalizing lexical simplification</article-title>
          .
          <source>In Proc. of COLING</source>
          ,
          <fpage>224</fpage>
          -
          <lpage>232</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Bremer, P.-T.;
          <string-name>
            <surname>Thiagarajan</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Srikumar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Livnat</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Pascucci</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Visual exploration of semantic relationships in neural word embeddings</article-title>
          .
          <source>IEEE trans. on vis. and comp</source>
          . g.
          <volume>24</volume>
          (
          <issue>1</issue>
          ):
          <fpage>553</fpage>
          -
          <lpage>562</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Lundberg</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.-I.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>A Unified Approach to Interpreting Model Predictions</article-title>
          . In Guyon, I.; Luxburg, U. V.;
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Wallach,
          <string-name>
            <surname>H.</surname>
          </string-name>
          ; Fergus,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Vishwanathan,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; and Garnett, R., eds.,
          <source>Proc. of NIPS</source>
          ,
          <fpage>4765</fpage>
          -
          <lpage>4774</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Lux-Pogodalla</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Besagni</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Fort</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>FastKwic, an “intelligent“ concordancer using FASTR</article-title>
          .
          <source>In Proc.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Maaten</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          v. d., and
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Visualizing data using t-sne</article-title>
          .
          <source>Journal of machine learning research 9</source>
          (Nov):
          <fpage>2579</fpage>
          -
          <lpage>2605</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Chen,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            ; and
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Proc. of NIPS</source>
          ,
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Paetzold</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Specia</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Benchmarking lexical simplification systems</article-title>
          .
          <source>In LREC.</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Paszke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gross</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chintala</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Chanan,
          <string-name>
            <given-names>G.</given-names>
            ;
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ;
            <surname>DeVito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ;
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ;
            <surname>Desmaison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Antiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ; and
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Automatic differentiation in pytorch.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Socher, R.; and
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proc. of EMNLP</source>
          ,
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>Proc. of NAACL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          2004.
          <article-title>Subsentential translation memory for computer assisted writing and translation</article-title>
          .
          <source>In Proc. of ACL demo.</source>
          ,
          <volume>106</volume>
          -
          <fpage>109</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Yeung</surname>
            ,
            <given-names>C. Y.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Personalized text retrieval for learners of chinese as a foreign language</article-title>
          .
          <source>In Proc. of COLING</source>
          ,
          <fpage>3448</fpage>
          -
          <lpage>3455</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Yimam</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Biemann</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Malmasi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Paetzold,
          <string-name>
            <given-names>G. H.</given-names>
            ;
            <surname>Specia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ; Sˇtajner, S.;
            <surname>Tack</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          ; and Zampieri,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>A report on the complex word identification shared task</article-title>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>