=Paper= {{Paper |id=Vol-2600/paper21 |storemode=property |title=Supervised Visualization of Vocabulary Knowledge towards Explainable Support of Second Language Learners |pdfUrl=https://ceur-ws.org/Vol-2600/paper21.pdf |volume=Vol-2600 |authors=Yo Ehara |dblpUrl=https://dblp.org/rec/conf/aaaiss/Ehara20 }} ==Supervised Visualization of Vocabulary Knowledge towards Explainable Support of Second Language Learners== https://ceur-ws.org/Vol-2600/paper21.pdf
 Supervised Visualization of Vocabulary Knowledge towards Explainable Support
                          of Second Language Learners

                                                               Yo Ehara
                                            Shizuoka Institute of Science and Technology,
                                            2200-2, Toyosawa, Fukuroi, Shizuoka, Japan.
                                                         ehara.yo@sist.ac.jp


                           Abstract                                  easy to understand and, hence, are well studied. However,
                                                                     in second language learning, most gaps are related to mean-
  In second language learning, it is crucial to identify gaps in     ing and semantics and are inherently abstract. Hence, visu-
  knowledge of the language between second language learn-
                                                                     alizing these gaps is essential to make these gaps intuitively
  ers and native speakers. Such a gap exists even when learning
  a single word in a second language. As the semantic broad-         understandable.
  ness of a word differs from language to language, language            The broadness of a word, or how a word can be used in
  learners must learn how broadly a word can be used in a lan-       the language to express different concepts, is one such ab-
  guage. For example, certain languages use different words for      stract gap (Read 2000). Because the meaning of a word dif-
  “period” in “a period of time” or “period pains” yet both are      fers from language to language, when learning a word in a
  nouns. Learners whose native languages are such languages          second language, there typically exists a gap between what
  typically have only partial knowledge of a word, even though       learners think the word means and how the word is actu-
  they think they know the word “period,” producing a gap be-
  tween them and native speakers. Language learners typically
                                                                     ally used in the language. Polysemous words are examples
  want explanations for these word usage differences, which          that are easy to understand: “book” can mean an item asso-
  even native speakers find it difficult to explain and find it      ciating with reading, or it can mean to make a reservation.
  costly to annotate. To support language learners in noticing       Other than these examples, to which the part-of-speech tag-
  these challenging differences easily and intuitively, this pa-     ging techniques in natural language processing (NLP) seem
  per proposes a novel supervised visualization of the usages        applicable, some examples are more subtle: some languages
  of a word. In our method, the usages of an inputted word in        always use different words for “time” in “in a short time”
  large corpora written by native speakers are visualized, tak-      or “for a time,” in which the word “time” refers to a period,
  ing the semantic proximity between the usages into account.        and “time and space” or “time heals all wounds,” in which
  Then, for the single inputted word, our method makes a per-        “time” is used as an abstract concept. In another example,
  sonalized prediction of word usages that each learner may
  know, based on his/her results of a quick vocabulary test,
                                                                     many languages use different words for “period” in “a pe-
  which takes approximately 30 minutes. The experiment re-           riod of time”, and “period” in “period pains”. In this way,
  sults show that our method produces better usage frequency         the granularity of the word’s senses should be distinguished
  counts than raw usage frequency counts in predicting vocab-        for second language acquisition, as it varies from word to
  ulary test responses, implying that word usage prediction is       word.
  accurate.                                                             Polysemous words encode different concepts in one word:
                                                                     hence, they have been one of the central topics in knowl-
                                                                     edge engineering. A substantial amount of work has been
                       Introduction                                  conducted to automatically recognize polysemous words for
Acquiring a second language requires repeated efforts to             practical applications by using machine learning, includ-
narrow the gap between language learners’ knowledge of the           ing those in the previous AAAI-MAKE workshops (Ram-
language and that of native speakers. Making such gaps intu-         prasad and Maddox 2019; Hinkelmann et al. 2019; Lau-
itively understandable greatly helps language learners self-         renzi et al. 2019). However, even among few such applica-
teach the language and also helps researchers build effec-           tions for second language acquisition (Heilman et al. 2007;
tive language tutoring systems. Some gaps such as vocabu-            Dias and Moraliyski 2009) in the artificial intelligence (AI)
lary size, or time spent in language learning are intuitively        community, the challenging problem of different granularity
                                                                     of the word’s senses in second language acquisition has not
Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel-
                                                                     been addressed. In second language acquisition, as learners
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com-        are typically not linguistic experts, i.e., novices, hence, sys-
bining Machine Learning and Knowledge Engineering in Practice        tems to support their learning need to be intuitively under-
(AAAI-MAKE 2020). Stanford University, Palo Alto, California,        standable. Our goal is to make the gaps among word usages
USA, March 23-25, 2020. Use permitted under Creative Commons         intuitively understandable, even for novice language learn-
License Attribution 4.0 International (CC BY 4.0).                   ers.
   To this end, this paper proposes a novel supervised visu-
alization method for word usages to assist in learning the
different usages of a word. Our method first searches all
usages of the target word in a large corpus written by na-
tive speakers. Then, it calculates the vector representation of
each usage, or occurrence, of each word by using a contex-
tualized word embedding method (Devlin et al. 2019). Con-
textualized word embedding methods (Peters et al. 2018;
Devlin et al. 2019) are recently proposed methods to embed
each occurrence of a word, capturing the context of each us-
age of the word.
   Then, our method is trained to visualize the contextual-
ized word embedding vectors by plotting each usage as a
point in a two-dimensional space. Unlike a typical visual-
ization method that merely projects the vectors to a two-
dimensional space, our method is trained to fit and visu-
ally explain a given supervision dataset. This means that
the same vectors are visualized in different ways if the su-
pervision dataset differs. Here, the supervisions are a vo-
cabulary test result dataset that consists of a matrix-format
data, recording which learner answered correctly/incorrectly
to which word question. The method visualizes the areas a           Figure 1: Usage of “haunt” predicted to be familiar to the learner.
learner user may know by classifying each usage point in
the visualization into known/not known to the learner. This
classification is conducted in a personalized manner because
learners’ language skills and specialized fields are different.
The learner only needs to take a 30-minute vocabulary test
for this purpose.
   Figure 1 shows an example visualization using our
method. “To haunt” has two different meanings in English,
the first being “to chase” and the other “to curse,” or to be af-
fected by ghosts or misfortune. Each point shows the usage                       Figure 2: An example of concordancer.
of the word in a corpus written by native speakers. The dif-
ferences in point colors indicate whether they are predicted
to be known to the learner. The right side of the figure, within
the dotted curve, is predicted to be known to the learner. In                              Related Work
this way, our method visualizes the semantic area the learner       Explainable machine learning studies
knows.
                                                                    While deep learning-based methods outperformed conven-
   Our contribution is as follows:                                  tional machine learning methods such as support vector ma-
                                                                    chines (SVMs) in many tasks, parameters of deep learn-
• For second language vocabulary learning, we propose               ing methods are typically more difficult to interpret com-
  a novel supervised visualization model that captures              pared to those in conventional models. To this end, in the
  word broadness via a personalized prediction of learner’s         machine learning and artificial intelligence community, a
  knowledge of usages.                                              number of methods have been proposed to extract expla-
                                                                    nations from trained machine-learning models, or training
• As our visualization uses a vocabulary test result dataset        models taking explainability into accounts (Ribeiro, Singh,
  as supervisions, learners can understand which usage of           and Guestrin 2016; Koh and Liang 2017; Lundberg and Lee
  the inputted word is predicted to be known/not known to           2017; Ribeiro, Singh, and Guestrin 2018).
  him/her. Unlike previous methods that output automatic               However, the purpose of these methods is to explain
  explanation of machine-learning models, our method is             machine-learning models to help machine-learning engi-
  much more intuitive and novice-friendly for language              neers and researchers in understanding the models. Obvi-
  learners in the sense that language learners do not need          ously, second language learners are usually not machine-
  to know about machine learning models.                            learning engineers and researchers. Therefore, methods of
                                                                    these studies have different purposes, and it is difficult to
• We evaluated our method in terms of predictive accuracy           apply these methods to help their understanding of the mod-
  of vocabulary test result dataset and achieved better re-         els. Language learners are typically even not interested in
  sults compared to baselines.                                      the models. Rather, learners’ interests reside in understand-
                                                                    ing their current learning status and what they should learn
to improve it. Hence, to meet learners’ needs, a model is de-      Concordancer studies
sirable for a learner to see his/her current learning status and   While our proposed method is novel as a visualization,
what he/she needs to learn in the near future.                     software tools that search the usages of an inputted word
                                                                   for educational purposes and display them itself are not
Word Embedding Visualization Studies                               novel: such software is known as concordancers. Concor-
                                                                   dancers target learners, educators, and linguists as primary
Word embedding techniques are techniques that have been            users. They are interactive software tools that retrieve all
extensively studied in natural language processing (NLP) to        usages of the inputted word in a large corpus and display
obtain vector representations of words typically using neu-        the list of the usages, each of which comes with the sur-
ral networks. The word2vec is a seminal paper in these lines       rounding word patterns (Hockey and Martin 1987). Concor-
of studies (Mikolov et al. 2013). The following papers re-         dancers were also studied to support translators, who are
port improvement of their accurateness to represent words          second language learners in many cases (Wu et al. 2004;
as vectors, typically by comparing the distances between           Jian, Chang, and Chang 2004; Lux-Pogodalla, Besagni, and
word vectors with human judgments on semantic proximity            Fort 2010).
between words (Pennington, Socher, and Manning 2014).                  Figure 2 shows a screenshot from a current concordancer
                                                                   1
Early studies on word embeddings address how to make                 . In this screenshot, the word “book” is searched. Then, the
one vector for each word. As one vector representation is          list of word usages is shown. Each word usage comes with
modeled to point one meaning, this limitation is obviously         surrounding words so that language learners can see how the
problematic to deal with polysemous words. Several pre-            word is used. While the list is sorted in alphabetical order of
vious studies tackled this problem and proposed methods            the previous word, we can see that the list shows “a book”
to estimate the number of a word’s meanings and to esti-           and “the book” in totally different positions and are not help-
mate an embedding for each meaning of the word (Athi-              ful for language learners. While some concordancers sup-
waratkun, Wilson, and Anandkumar 2018). However, re-               port listing the usage of “book” as nouns by attaching texts
cently, contextualized word embeddings (Peters et al. 2018;        with part-of-speeches in advance, this is not helpful to see
Devlin et al. 2019) became quickly popular. With these             the different usages of the word when the part-of-speeches
methods, we can obtain an embedding for each usage, or             of the usages are identical. For example, the word “bank”
occurrence, of a word, considering the context of the occur-       have polysemous meanings sharing the same part-of-speech:
rence of the word in a running sentence. These methods can         one as financial organizations, and another as embankments.
also be seen as a method to estimate word embeddings for
polysemous words, with an extreme assumption that each             Personalized complex word identification studies
occurrence of a word has different meanings. As contextual-        In this study, a part of our goals is to identify complex us-
ized word embeddings are shown to be successful in many            ages of a word in a running text. In other words, for one
tasks, in current NLP, the former strategy to estimate both        word, one usage of the word in running text is complex
the number of meanings of a word and an embedding for              for a learner, and another usage of the word is not. There
each meaning is employed only when it is necessary.                are previous studies that identify complex words in a per-
   Following the rise of word embedding techniques, visual-        sonalized manner in the NLP literature (Ehara et al. 2012;
ization studies were propose to visualize word embeddings.         Lee and Yeung 2018). These studies predict the words that
The study by (Smilkov et al. 2016) simply reported that their      each learner knows based on each learner’s result of a short
development of a tool to visualize embeddings for different        vocabulary test, which a learner typically takes 30 minutes
words. The study by (Liu et al. 2017) introduces applying          to solve. Also, there are also many studies that identify com-
visualization of word embeddings to analyze semantic re-           plex usages in a non-personalized manner, as summarized in
lationships between words. Both paper deals with principal         (Paetzold and Specia 2016; Yimam et al. 2018).
component analysis (PCA) and t-SNE (Maaten and Hinton                 However, to our knowledge, the task of identifying com-
2008) for visualization. To our knowledge, we are the first        plex usages in a personalized manner is novel. Our method
to visualize contextualized word embeddings, in which each         is also novel in that it trains how to visualize the usages so
occurrence of a word, rather than a word, is visualized, with      that learners can visually understand the usage differences
a practical purpose on language education.                         by using the learners’ vocabulary test data.
   In addition to the visualization, our method can also pre-
dict the usages that each learner is familiar/unfamiliar with,              Preliminary System and Experiments
in a personalized manner, when vocabulary test result data of      Before entering the technical details of our method de-
dozens of learners are provided, such as the data in (Ehara        scribed in the Proposed Method section, we first show the
2018). While there exist previous studies (Ehara 2018;             preliminary system and some experiment results to intro-
Lee and Yeung 2018; Yeung and Lee 2018) for predicting             duce the motivation of the proposed method.
the words that each learner is familiar/unfamiliar with using         The preliminary system visualizes contextualized word
such data by using simple machine-learning classification,         embeddings by using the conventional visualization of prin-
our method tackles a more difficult problem that deals with        cipal component analysis (PCA). Figure 3 shows the layout
predicting which usages of a word is known/unknown to the
                                                                      1
learner.                                                                  https://lextutor.ca/conc/eng/
Figure 3: System layout. CWE means contextualized word embed-
dings.




                                                                     Figure 5: Another example of searching the word book.



                                                                points (Maaten and Hinton 2008).
                                                                   Knowing t-SNE, we did not employ t-SNE for visualiza-
                                                                tion for the following reasons: First, in our visualization,
                                                                the distances between usage points are important. While t-
                                                                SNE often produces intuitive clusters between data points,
                                                                the distance between points in the visualization is compli-
                                                                cated compared to those of PCA. Hence, to interpret dis-
                                                                tances between points, PCA is This is stated in the original t-
                                                                SNE (Maaten and Hinton 2008) paper. Moreover, many blog
                                                                posts such as 3 for engineers address this fact to encourage
                                                                the proper understanding of t-SNE. For these reasons, we
                                                                employed PCA for the basis of our visualization.
                                                                   Second, even if the data to visualize is fixed, t-SNE
                                                                returns different results depending on its hyperparameter
                                                                called perplexity. In contrast, PCA returns the same re-
           Figure 4: Example of searching the word book.        sults if the data to visualize is fixed. This dependence on
                                                                the hyperparameter is elaborated in the original t-SNE pa-
                                                                per (Maaten and Hinton 2008) in the first place. We can also
                                                                find some blog posts targeting engineers that advocates to
of the preliminary system. Once a user provides a word to       carefully set the perplexity parameter such as 4 . Various re-
the system, it automatically searches the word in the corpus    sults on fixed data can be useful when the data is difficult to
in a similar way to typical concordancers. Unlike concor-       be pre-processed so that the following dimension-reduction
dancers, the system has a database that stores contextualized   methods are easy to handle. However, in this study, the data
word embeddings for each usage or occurrence of each word       to be visualized are embeddings vectors; hence, the data can
in the corpus. We used half a million sentences from the        be easily pre-processed before we feed them into the data.
British National Corpus (BNC Consortium 2007) as the raw        Hence, for the purpose of this study, the feature that the re-
corpus. We built the database by applying the bert-base-        sults vary on fixed data is unlikely to be useful. Rather, this
uncased model of the PyTorch Pretrained the BERT project        may possibly complicate the interpretation of the visualiza-
2
  (Devlin et al. 2019) to the corpus. We used the last layer,   tion.
which was more distant from the surface input, as the em-          Third, practically, t-SNE is computationally heavy com-
beddings.                                                       pared to PCA. Computing a t-SNE visualization involves
                                                                calculations for every pair of the given data points. While
Choice of dimension reduction methods                           how to deal with this heavy computational complexity is ad-
Principal component analysis (PCA) and t-SNE (Maaten            dressed in studies such as (Tang et al. 2016), practically, t-
and Hinton 2008) are famous dimension reduction methods,
                                                                   3
and t-SNE is notable for its intuitiveness and well clustered        https://mlexplained.com/2018/09/14/paper-dissected-
                                                                visualizing-data-using-t-sne-explained/
   2                                                               4
       https://github.com/huggingface/pytorch-pretrained-BERT        https://distill.pub/2016/misread-tsne/
   Figure 6: Variance of usage vectors vs. log word frequency.


                                                                     Figure 7: Recap: usage of “haunt” predicted to be known to the
SNE is usually computationally heavy when compared to                learner.
PCA. Strictly speaking, PCA has a similar complexity as
it involves the computation of singular values and vectors
in singular value decomposition (SVD). However, the cal-
culation of SVD has a number of applications other than              of “bank”, when manually checked in the excerpt. Hence,
PCA-based visualization, sophisticated calculation methods           learners can choose not to use this, as in the video. Figure 6
for large data were previously proposed (Halko et al. 2011).         shows the variance of the usage vectors of each word against
                                                                     its log frequency in the excerpt. It showed a statistically sig-
Preliminary System by using PCA                                      nificant moderate correlation (r = 0.56, p < 0.01 by F-test),
                                                                     implying that frequent words tend to have complex usages.
We built a preliminary system and conducted some exper-
iments to see how contextualized word embedding vectors
are plotted in the system. Figure 4 depicts such an exam-            Motivating Examples
ple of searching for the word book. Users can directly type          From the example of “book” in the previous sections, we can
the word in the textbox shown at the top of Figure 4. Be-            easily see that the usages of “book” about reading are more
low is the visualization of the usages found and their list.         frequent than those of “book” about a reservation. Hence,
Each dark-colored point is linked to each usage. Two dark            when counting the number of usages, it is intuitive to assume
colors are used to color each usage point according to the           that learners are not familiar with all usages but are familiar
results of a Gaussian mixture model (GMM) clustering with            with the usages within a certain radius in the vector space.
2 components, as this value was reported to work well (Athi-         This is the motivation of our method descried in the next
waratkun, Wilson, and Anandkumar 2018). The light-red                section.
colored point is the probe point: the usages are listed in              Before entering the technical details of our visualization
the nearest order of the probe point. No usage is linked to          method in the next section, we show some usage predic-
the probe point. Users can freely and interactively drag and         tion result examples of our method in a manner similar to
move the probe point to change the list of usages below the          the previous examples of “book” so that readers can in-
visualization. Each line of the list shows the usage identi-         tuitively understand our motivation, as shown in Figure 7
fication number and the surrounding words of the usage,              and Figure 8. The markers are changed to triangular to de-
followed by a checkbox to record the usage so that learn-            note that the colors reflect prediction results, rather than the
ers can refer to it later. In Figure 4, the probe point is on        GMM-based clustering results explained above. The color-
the left part of the visualized figure. In the first several lines   ing and darkness of the points in the visualization follow
of the list, the system successfully shows the usages of the         those of the previous examples; the red light-colored point
word book as a publication. In contrast, Figure 5 depicts            is the probe point, and the other dark points denote usages.
the case in which the users drag the probe point from the            Figure 7 shows an example of the familiar usage prediction
left to the right of the visualization. The first several lines      in case of searching the word “haunt”. The right-hand side
of the list show the usages of the word book, which means            of the cross-marked circle is the area in which usages are
to reserve. We can see that the words surrounding the word           predicted to be familiar to this learner. The probe point is lo-
book vary: merely focusing on the surrounding words, such            cated within the circle. We can see that the usages of “haunt”
as “to” before book, cannot distinguish the usages of book,          about chasing are listed below. Figure 8 shows another ex-
which means to reserve, from the usages of book for reading.         ample of “haunt”. As the probe point is located outside of
                                                                     the circle, in the left side of the visualization, the list below
Clustering Results                                                   shows the list of the usages predicted to be unfamiliar to this
The GMM clustering was accurate but not perfect: 0 er-               learner. We can see that“haunt” about “to curse” are mainly
rors in the 42 usages of “book”, 1 error in the 22 usages            listed.
                                                                          fied linear unit function, and M be a large positive constant,
                                                                          such as 100. Let G be a linear projection matrix from a T1
                                                                          dimensional space to a T2 dimensional space. Let de (~a, ~b) be
                                                                          the Euclidean distance between two vectors. By using these
                                                                          formulations, we modeled the difficulty of words as follows:

                                                                               dvi = − log(freq(vi ) + 1)                          (2)
                                                                          freq(vi ) = N (~ci , , Xi )                             (3)
                                                                                      ni
                                                                                      X
                                                                                    ≈      tanh (M · ReLU( − de (G~ci , G~xk,i )))(4)
                                                                                         k=1

                                                                             The tricky part is that Equation 3 can be approximately
                                                                          written as Equation 4, whose parameter can be easily tuned
                                                                          and optimized by using neural machine learning framework
                                                                          such as PyTorch. In Equation 4, due to the ReLU func-
Figure 8: Usage of “haunt” predicted not to be known to the learner.      tion, negative values within the function is simply ignored.
                                                                          Hence, as de is the Euclidean distance, if  = 0, i.e., the
                                                                          size of the circle is 0, the terms inside ReLU is negative,
                      Proposed Method                                     and freq(vi ) = 0. If  − de (G~ci , G~xk,i ) > 0, due to M and
As stated in the Related Work section, some previous stud-                tanh, the resulting value is almost 1. This means that we are
ies address methods to predict the words that a learner knows             counting only the cases that  surpasses de , i.e., counting the
based on his/her short vocabulary test result. However, since             usages within  measured from ~ci .
our application requires personalized prediction of the us-                  Notably, the following characteristics are important to un-
ages of the word that the learner does not know. Hence, we                derstand our model.
propose a novel model that does this.
                                                                          Not merely a logistic regression
   Let us write the set of words as {v1 , . . . , vI }, where I is
the number of words (in type), and write the set of learners              Notably, the proposed model is not merely a logistic regres-
as {l1 , . . . , lJ }, where J is the number of learners. Then, in        sion. Our model has more parameters such as , ~ci , alj , G.
previous studies, based on the Rasch model (Rasch 1960;                   Because of having different extra parameters compared to
Baker 2004), the following logistic regression model Equa-                the logistic regression, to train our model, we typically need
tion 1 is used to predict whether learner lj knows word vi or             to use a neural network machine learning framework to
not. Here, σ(x) := 1+exp(−x)  1
                                     and yi,j is the response of the      model and optimize, such as PyTorch. To optimize using
                                                                          such models, as it is difficult to differentiate the loss func-
learner in the vocabulary test; yi,j = 1 if learner lj answered           tions of such models by hand, the model loss function is
correctly to the question of word vi , and yi,j = 0 otherwise.            desirable to be mostly continuous and smooth so that its pa-
We have two types of parameters to tune: alj is the ability of            rameters can be tuned using auto-gradient. We specifically
learner lj and dvi is the difficulty of word vi .                         designed Equation 4 to meet these conditions. In the experi-
                                                                          ments, we used the Adam optimization method (Kingma and
               P (yi,j = 1|lj , vi ) = σ(alj − dvi )               (1)    Ba 2015) to optimize the loss function.
   Here, how to model dvi , or the difficulty parameter of
word vi , is the key to our purpose. Previous studies report              Trainable G
that the negative logarithm of the word frequency correlates              As Equation 4 is mostly continuous and smooth, matrix G
well with the perceived difficulty of words (Tamayo 1987;                 can also be trained by using deep-learning framework soft-
Beglar 2010). As in Figure 1, our key idea is to count the fre-           ware. As G is a projection matrix from T1 to T2 , if we set
quency of word usages only within a certain distance from                 T2 = 2 to consider a projection to a two-dimensional space,
the typical usage of the word. Hence, we propose the follow-              training G via supervisions means training visualization via
ing model to implement this idea.                                         supervisions. Here, in our task setting, the supervisions are
   For each vi , we have ni vectors that are vector representa-           vocabulary test dataset of second language learners, i.e., a
tion of each of the ni occurrences of word vi . We write these            matrix in which the (j, i)-th element denotes whether learner
vectors as Xi = {~x1,i , . . . , ~xni ,i }. Each vector ~xk,i is T1 di-   lj correctly answered the question of word vi .
       Pni Among Xi , let ~ci be the one closest to their cen-
mensional.
ter n1i k=1   ~xk . Let freq(vi ) be the frequency of the vectors         j : Personalized 
in Xi within distance  measured from the central vector ~ci .            In Equation 4, for easier understanding, we write  to be a
We write this frequency simply as freq(vi ) = N (~ci , , Xi ).           constant that does not depend on learner index j. In reality,
Here, n is the number of usages of word vi and let each ~xk               we can personalize  by making  dependent to learner index
be each usage vector obtained from contextualized word em-                j as j ; in this case, each learner lj has his/her own region
bedding methods. Let ReLU(z) = max(0, z) be the recti-                    that he/she can understand, and the radius of this region is
j . This personalized version is the one that we used in the
experiments.                                                      Table 1: Number of sentences in each domain of the BNC
                                                                  corpus in the total of 100, 000 sentences.
                                                                                   imaginative          21,946
                      Experiments                                                  arts                 18,289
Quantitative Results of Prediction                                                 natural sciences      5,256
Quantitative evaluation of this personalized prediction of us-                     social science        7,777
ages of a word is difficult; to this end, we need to test each                     commerce              4,378
learner multiple times for different usages of the same word.                      leisure              20,300
However, when tested with the same word multiple times,                            belief and thought    3,441
learners easily notice that the word has multiple meanings.                        world news              764
Hence, instead, we evaluated the accuracy of personalized                          applied science       2,625
prediction of the words that the learner knows under an                            world affairs        15,224
experiment setting similar to (Ehara 2018). Our proposed
method is based on neural classification with a novel exten-
sion to adjusted counting the frequency of the usages within
                                                                  Table 2: Accuracy of predicting learners’ vocabulary test re-
distance j . Since a typical logistic-regression classifier is
                                                                  sponses by using the raw freqiencies and the corrected fre-
identical to one-layer neural classifier, comparing our model
                                                                  quencies by the proposed model in each domain.
with a typical logistic-regression classifier using a frequency
feature in terms of accuracy can be used to indirectly evalu-                 Domain        Correction Accuracy
ate how the idea of adjusted frequency is a practical method                    Arts           Raw             0.61
for evaluation.                                                                 Arts        Corrected          0.64
    The proposed model estimates the number of occurrences,                 All domains        Raw             0.67
i.e., usages, that each learner knows. In other words, this                 All domains     Corrected          0.72
can be regarded as modifying the word frequency so that
the model fits to the given vocabulary test dataset. In this
regard, we can evaluate how well the proposed model can
correct word frequency when an unbalanced corpus is given.        for all domains. This seems to be the effect of frequency
Each document in the British National Corpus (BNC) (BNC           counting excluding the cases where the proposed method is
Consortium 2007) is annotated with a domain Table 1. We           outlier. The improvement in accuracy before and after cor-
evaluated how the proposed model can correct the word fre-        rection (p < 0.01, Wilcoxon test) was statistically signifi-
quency in the “arts” domain.                                      cant when modifying word frequencies in the arts domain
    We used the vocabulary test result data in which each of      alone or in all domains.
100 learners answered 31 vocabulary questions on the pub-
licly available dataset (Ehara 2018). In 3, 100 vocabulary        “Trained” visualization
test responses, we used 1, 800 to train the model, and the
rest was used for the test. The baseline model is simply a lo-    In the above experiments, we considered the case where no
gistic regression in which the logarithm of word frequency is     projection was conducted, by fixing G = I. Next, let us con-
the only feature. The logarithm of word frequency has been        sider the case where G is a projection to a two-dimensional
used as a simple rough measure for word difficulty and pre-       space, i.e., G is a 2 × T1 matrix. Tuning G and radius j
viously used to analyze and predict word difficulty based         to fit the vocabulary test dataset by using Equation 4 means
on vocabulary test data (Beglar 2010; Ehara et al. 2013;          that we can actually train the visualization to explain the vo-
Lee and Yeung 2018; Yeung and Lee 2018). The proposed             cabulary test dataset in a supervised manner.
model counts only the number of usages within the radius             Figure 9 and Figure 10 show the result of the visualiza-
j . We used the PyTorch neural network framework to auto-        tion. The initial value of G was set to a two-dimensional
matically tune the radius j and the center of the sphere by      projection matrix by principal component analysis (PCA).
using its powerful automatic gradient support (Paszke et al.      Though the initial value is the projection by PCA, it should
2017).                                                            be noted that the projection matrix G itself is trained from
    First, we perform experiments on T1 = T2 and G = I,           the vocabulary test dataset as well as the radius j .
a setting where no projection is performed and the model             From Figure 9 and Figure 10, we can see that the pro-
deals with T1 dimensional hyperspheres. Table 2 shows the         posed method counts only the main meanings within the red
results. It can be seen that the accuracy of the prediction of    circle. To qualitatively evaluate the results, in Table 3, the
the word test data of language learners using the biased text     two farthest or closes example occurrences of “period” from
of arts domain only is lower than that using the word fre-        its center point, i.e., the center of the red circle in Figure 9
quency of all domains. The proposed method was able to            are shown. It can be seen that the farthest cases are exam-
improve the accuracy of the word frequency of the arts do-        ples of the use of technical terms such as “period pain” and
main only by counting the frequency in the region on the          “magnetic field period”, while the closest two cases are ex-
contextual word expression vector space where the exami-          amples of nouns representing periods such as “this period”
nee is estimated to be reacting. This effect was also observed    and “the period”.
                                                                          neural method automatically tunes the projection matrix to
Table 3: Farthest (F.) and closest (C.) two occurrences of                visualize and the radius of each learner in the visualization
“period” from the center of the circle in Figure 9.                       so that the counted frequency within the circles fits to the
 F.         period pains can be severe and disruptive.                    supervisions. Experiments on actual subject response data
 F. to produce a slight spread of magnetic field period .                 show that the proposed method can predict subject response
 C.         design during this period was in the plan .                   more accurately by modifying the frequency even when the
 C.              the pub designer of the period ,                         use cases are biased to a specific domain. As a future work,
                                                                          we are planning to make our method more interactive.

                                                                                                 References
                                                                          Athiwaratkun, B.; Wilson, A.; and Anandkumar, A. 2018.
                                                                          Probabilistic FastText for multi-sense word embeddings. In
                                                                          Proc. of ACL.
                                                                          Baker, F. B. 2004. Item Response Theory : Parameter Esti-
                                                                          mation Techniques, Second Edition. CRC Press.
                                                                          Beglar, D. 2010. A rasch-based validation of the vocabulary
                                                                          size test. Language Testing 27(1):101–118.
                                                                          BNC Consortium, T. 2007. The British National Corpus,
                                                                          version 3 (BNC XML Edition).
                                                                          Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.
                                                                          Bert: Pre-training of deep bidirectional transformers for lan-
                                                                          guage understanding. In Proc. of NAACL.
                                                                          Dias, G., and Moraliyski, R. 2009. Relieving polysemy
Figure 9: Trained visualization example of “period”. Each trian-          problem for synonymy detection. In Portuguese Conference
gle point represents an occurrence, or a usage, of the word “pe-          on Artificial Intelligence, 610–621. Springer.
riod” in the “arts” domain in the BNC corpus. The entire projec-          Ehara, Y.; Sato, I.; Oiwa, H.; and Nakagawa, H. 2012. Min-
tion of the original contextualized word embedding vectors to the         ing words in the minds of second language learners: learner-
two-dimensional space, namely G, and the radius j was optimized          specific word difficulty. In Proc. of COLING.
to fit the vocabulary test dataset (ref. Equation 1, Equation 3, and
Equation 4). Intuitively, a large j denotes that learner lj has a high   Ehara, Y.; Shimizu, N.; Ninomiya, T.; and Nakagawa, H.
language ability as he/she is estimated to understand many of the         2013. Personalized reading support for second-language
occurrences of the word “period” within the red circle.                   web documents. ACM Transactions on Intelligent Systems
                                                                          and Technology 4(2).
                                                                          Ehara, Y. 2018. Building an english vocabulary knowledge
                                                                          dataset of japanese english-as-a-second-language learners
                                                                          using crowdsourcing. In Proc. LREC.
                                                                          Halko, N.; Martinsson, P.-G.; Shkolnisky, Y.; and Tygert, M.
                                                                          2011. An algorithm for the principal component analysis
                                                                          of large data sets. SIAM Journal on Scientific computing
                                                                          33(5):2580–2594.
                                                                          Heilman, M.; Collins-Thompson, K.; Callan, J.; and Eske-
                                                                          nazi, M. 2007. Combining Lexical and Grammatical Fea-
                                                                          tures to Improve Readability Measures for First and Second
                                                                          Language Texts. In Proc. of NAACL, 460–467. Rochester,
                                                                          New York: Association for Computational Linguistics.
                                                                          Hinkelmann, K.; Blaser, M.; Faust, O.; Horst, A.; and Mehli,
                                                                          C. 2019. Virtual Bartender: A Dialog System Combin-
                                                                          ing Data-Driven and Knowledge-Based Recommendation.
Figure 10: Trained visualization example of “figure”. The setting         In AAAI Spring Symposium: Combining Machine Learning
of the training is identical to that of Figure 9.                         with Knowledge Engineering.
                                                                          Hockey, S., and Martin, J. 1987. The Oxford Concordance
                                                                          Program Version 2. Digital Scholarship in the Humanities
                          Conclusions                                     2(2):125–131.
In this paper, we propose a supervised visualization method               Jian, J.-Y.; Chang, Y.-C.; and Chang, J. S. 2004. TANGO:
to predict which usages of a word are known to each learner,              Bilingual collocational concordancer. In Proc. of ACL
by using a vocabulary test result dataset as supervisions. Our            demo., 166–169.
Kingma, D. P., and Ba, J. 2015. Adam: A method for                Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2018. Anchors:
stochastic optimization. In Proc. of ICLR.                        High-Precision Model-Agnostic Explanations. In Proc. of
Koh, P. W., and Liang, P. 2017. Understanding Black-              AAAI.
box Predictions via Influence Functions. In Proc. of ICML,        Smilkov, D.; Thorat, N.; Nicholson, C.; Reif, E.; Viégas,
1885–1894.                                                        F. B.; and Wattenberg, M. 2016. Embedding Projector: In-
Laurenzi, E.; Hinkelmann, K.; Jüngling, S.; Montecchiari,        teractive Visualization and Interpretation of Embeddings. In
D.; Pande, C.; and Martin, A. 2019. Towards an Assistive          In Proc. of NIPS 2016 Workshop on Interpretable Machine
and Pattern Learning-driven Process Modeling Approach.            Learning in Complex Systems.
In AAAI Spring Symposium: Combining Machine Learning              Tamayo, J. M. 1987. Frequency of use as a measure of
with Knowledge Engineering.                                       word difficulty in bilingual vocabulary test construction and
                                                                  translation. Educational and Psychological Measurement
Lee, J., and Yeung, C. Y. 2018. Personalizing lexical sim-
                                                                  47(4):893–902.
plification. In Proc. of COLING, 224–232.
                                                                  Tang, J.; Liu, J.; Zhang, M.; and Mei, Q. 2016. Visualizing
Liu, S.; Bremer, P.-T.; Thiagarajan, J. J.; Srikumar, V.; Wang,
                                                                  large-scale and high-dimensional data. In Proc. of WWW,
B.; Livnat, Y.; and Pascucci, V. 2017. Visual exploration of
                                                                  287–297.
semantic relationships in neural word embeddings. IEEE
trans. on vis. and comp. g. 24(1):553–562.                        Wu, J.-C.; Chuang, T. C.; Shei, W.-C.; and Chang, J. S.
                                                                  2004. Subsentential translation memory for computer as-
Lundberg, S. M., and Lee, S.-I. 2017. A Unified Approach          sisted writing and translation. In Proc. of ACL demo., 106–
to Interpreting Model Predictions. In Guyon, I.; Luxburg,         109.
U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan,
S.; and Garnett, R., eds., Proc. of NIPS, 4765–4774.              Yeung, C. Y., and Lee, J. 2018. Personalized text retrieval
                                                                  for learners of chinese as a foreign language. In Proc. of
Lux-Pogodalla, V.; Besagni, D.; and Fort, K. 2010. FastK-         COLING, 3448–3455.
wic, an “intelligent“ concordancer using FASTR. In Proc.
of LREC.                                                          Yimam, S. M.; Biemann, C.; Malmasi, S.; Paetzold, G. H.;
                                                                  Specia, L.; Štajner, S.; Tack, A.; and Zampieri, M. 2018. A
Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using     report on the complex word identification shared task 2018.
t-sne. Journal of machine learning research 9(Nov):2579–          In Proc. of BEA.
2605.
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and
Dean, J. 2013. Distributed representations of words and
phrases and their compositionality. In Proc. of NIPS, 3111–
3119.
Paetzold, G., and Specia, L. 2016. Benchmarking lexical
simplification systems. In LREC.
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.;
DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer,
A. 2017. Automatic differentiation in pytorch.
Pennington, J.; Socher, R.; and Manning, C. 2014. Glove:
Global vectors for word representation. In Proc. of EMNLP,
1532–1543.
Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark,
C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized
word representations. Proc. of NAACL.
Ramprasad, S., and Maddox, J. 2019. CoKE: Word Sense
Induction Using Contextualized Knowledge Embeddings.
In AAAI Spring Symposium: Combining Machine Learning
with Knowledge Engineering.
Rasch, G. 1960. Probabilistic Models for Some Intelligence
and Attainment Tests. Copenhagen: Danish Institute for Ed-
ucational Research.
Read, J. 2000. Assessing Vocabulary. Cambridge University
Press.
Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ”Why
Should I Trust You?”: Explaining the Predictions of Any
Classifier. In Proc. of KDD, 1135–1144. ACM. event-place:
San Francisco, California, USA.