=Paper= {{Paper |id=Vol-2600/paper21 |storemode=property |title=Supervised Visualization of Vocabulary Knowledge towards Explainable Support of Second Language Learners |pdfUrl=https://ceur-ws.org/Vol-2600/paper21.pdf |volume=Vol-2600 |authors=Yo Ehara |dblpUrl=https://dblp.org/rec/conf/aaaiss/Ehara20 }} ==Supervised Visualization of Vocabulary Knowledge towards Explainable Support of Second Language Learners== https://ceur-ws.org/Vol-2600/paper21.pdf

Supervised Visualization of Vocabulary Knowledge towards Explainable Support
of Second Language Learners

Yo Ehara
Shizuoka Institute of Science and Technology,
2200-2, Toyosawa, Fukuroi, Shizuoka, Japan.
ehara.yo@sist.ac.jp

Abstract easy to understand and, hence, are well studied. However,
in second language learning, most gaps are related to mean-
In second language learning, it is crucial to identify gaps in ing and semantics and are inherently abstract. Hence, visu-
knowledge of the language between second language learn-
alizing these gaps is essential to make these gaps intuitively
ers and native speakers. Such a gap exists even when learning
a single word in a second language. As the semantic broad- understandable.
ness of a word differs from language to language, language The broadness of a word, or how a word can be used in
learners must learn how broadly a word can be used in a lan- the language to express different concepts, is one such ab-
guage. For example, certain languages use different words for stract gap (Read 2000). Because the meaning of a word dif-
“period” in “a period of time” or “period pains” yet both are fers from language to language, when learning a word in a
nouns. Learners whose native languages are such languages second language, there typically exists a gap between what
typically have only partial knowledge of a word, even though learners think the word means and how the word is actu-
they think they know the word “period,” producing a gap be-
tween them and native speakers. Language learners typically
ally used in the language. Polysemous words are examples
want explanations for these word usage differences, which that are easy to understand: “book” can mean an item asso-
even native speakers find it difficult to explain and find it ciating with reading, or it can mean to make a reservation.
costly to annotate. To support language learners in noticing Other than these examples, to which the part-of-speech tag-
these challenging differences easily and intuitively, this pa- ging techniques in natural language processing (NLP) seem
per proposes a novel supervised visualization of the usages applicable, some examples are more subtle: some languages
of a word. In our method, the usages of an inputted word in always use different words for “time” in “in a short time”
large corpora written by native speakers are visualized, tak- or “for a time,” in which the word “time” refers to a period,
ing the semantic proximity between the usages into account. and “time and space” or “time heals all wounds,” in which
Then, for the single inputted word, our method makes a per- “time” is used as an abstract concept. In another example,
sonalized prediction of word usages that each learner may
know, based on his/her results of a quick vocabulary test,
many languages use different words for “period” in “a pe-
which takes approximately 30 minutes. The experiment re- riod of time”, and “period” in “period pains”. In this way,
sults show that our method produces better usage frequency the granularity of the word’s senses should be distinguished
counts than raw usage frequency counts in predicting vocab- for second language acquisition, as it varies from word to
ulary test responses, implying that word usage prediction is word.
accurate. Polysemous words encode different concepts in one word:
hence, they have been one of the central topics in knowl-
edge engineering. A substantial amount of work has been
Introduction conducted to automatically recognize polysemous words for
Acquiring a second language requires repeated efforts to practical applications by using machine learning, includ-
narrow the gap between language learners’ knowledge of the ing those in the previous AAAI-MAKE workshops (Ram-
language and that of native speakers. Making such gaps intu- prasad and Maddox 2019; Hinkelmann et al. 2019; Lau-
itively understandable greatly helps language learners self- renzi et al. 2019). However, even among few such applica-
teach the language and also helps researchers build effec- tions for second language acquisition (Heilman et al. 2007;
tive language tutoring systems. Some gaps such as vocabu- Dias and Moraliyski 2009) in the artificial intelligence (AI)
lary size, or time spent in language learning are intuitively community, the challenging problem of different granularity
of the word’s senses in second language acquisition has not
Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel-
been addressed. In second language acquisition, as learners
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com- are typically not linguistic experts, i.e., novices, hence, sys-
bining Machine Learning and Knowledge Engineering in Practice tems to support their learning need to be intuitively under-
(AAAI-MAKE 2020). Stanford University, Palo Alto, California, standable. Our goal is to make the gaps among word usages
USA, March 23-25, 2020. Use permitted under Creative Commons intuitively understandable, even for novice language learn-
License Attribution 4.0 International (CC BY 4.0). ers.
To this end, this paper proposes a novel supervised visu-
alization method for word usages to assist in learning the
different usages of a word. Our method first searches all
usages of the target word in a large corpus written by na-
tive speakers. Then, it calculates the vector representation of
each usage, or occurrence, of each word by using a contex-
tualized word embedding method (Devlin et al. 2019). Con-
textualized word embedding methods (Peters et al. 2018;
Devlin et al. 2019) are recently proposed methods to embed
each occurrence of a word, capturing the context of each us-
age of the word.
Then, our method is trained to visualize the contextual-
ized word embedding vectors by plotting each usage as a
point in a two-dimensional space. Unlike a typical visual-
ization method that merely projects the vectors to a two-
dimensional space, our method is trained to fit and visu-
ally explain a given supervision dataset. This means that
the same vectors are visualized in different ways if the su-
pervision dataset differs. Here, the supervisions are a vo-
cabulary test result dataset that consists of a matrix-format
data, recording which learner answered correctly/incorrectly
to which word question. The method visualizes the areas a Figure 1: Usage of “haunt” predicted to be familiar to the learner.
learner user may know by classifying each usage point in
the visualization into known/not known to the learner. This
classification is conducted in a personalized manner because
learners’ language skills and specialized fields are different.
The learner only needs to take a 30-minute vocabulary test
for this purpose.
Figure 1 shows an example visualization using our
method. “To haunt” has two different meanings in English,
the first being “to chase” and the other “to curse,” or to be af-
fected by ghosts or misfortune. Each point shows the usage Figure 2: An example of concordancer.
of the word in a corpus written by native speakers. The dif-
ferences in point colors indicate whether they are predicted
to be known to the learner. The right side of the figure, within
the dotted curve, is predicted to be known to the learner. In Related Work
this way, our method visualizes the semantic area the learner Explainable machine learning studies
knows.
While deep learning-based methods outperformed conven-
Our contribution is as follows: tional machine learning methods such as support vector ma-
chines (SVMs) in many tasks, parameters of deep learn-
• For second language vocabulary learning, we propose ing methods are typically more difficult to interpret com-
a novel supervised visualization model that captures pared to those in conventional models. To this end, in the
word broadness via a personalized prediction of learner’s machine learning and artificial intelligence community, a
knowledge of usages. number of methods have been proposed to extract expla-
nations from trained machine-learning models, or training
• As our visualization uses a vocabulary test result dataset models taking explainability into accounts (Ribeiro, Singh,
as supervisions, learners can understand which usage of and Guestrin 2016; Koh and Liang 2017; Lundberg and Lee
the inputted word is predicted to be known/not known to 2017; Ribeiro, Singh, and Guestrin 2018).
him/her. Unlike previous methods that output automatic However, the purpose of these methods is to explain
explanation of machine-learning models, our method is machine-learning models to help machine-learning engi-
much more intuitive and novice-friendly for language neers and researchers in understanding the models. Obvi-
learners in the sense that language learners do not need ously, second language learners are usually not machine-
to know about machine learning models. learning engineers and researchers. Therefore, methods of
these studies have different purposes, and it is difficult to
• We evaluated our method in terms of predictive accuracy apply these methods to help their understanding of the mod-
of vocabulary test result dataset and achieved better re- els. Language learners are typically even not interested in
sults compared to baselines. the models. Rather, learners’ interests reside in understand-
ing their current learning status and what they should learn
to improve it. Hence, to meet learners’ needs, a model is de- Concordancer studies
sirable for a learner to see his/her current learning status and While our proposed method is novel as a visualization,
what he/she needs to learn in the near future. software tools that search the usages of an inputted word
for educational purposes and display them itself are not
Word Embedding Visualization Studies novel: such software is known as concordancers. Concor-
dancers target learners, educators, and linguists as primary
Word embedding techniques are techniques that have been users. They are interactive software tools that retrieve all
extensively studied in natural language processing (NLP) to usages of the inputted word in a large corpus and display
obtain vector representations of words typically using neu- the list of the usages, each of which comes with the sur-
ral networks. The word2vec is a seminal paper in these lines rounding word patterns (Hockey and Martin 1987). Concor-
of studies (Mikolov et al. 2013). The following papers re- dancers were also studied to support translators, who are
port improvement of their accurateness to represent words second language learners in many cases (Wu et al. 2004;
as vectors, typically by comparing the distances between Jian, Chang, and Chang 2004; Lux-Pogodalla, Besagni, and
word vectors with human judgments on semantic proximity Fort 2010).
between words (Pennington, Socher, and Manning 2014). Figure 2 shows a screenshot from a current concordancer
1
Early studies on word embeddings address how to make . In this screenshot, the word “book” is searched. Then, the
one vector for each word. As one vector representation is list of word usages is shown. Each word usage comes with
modeled to point one meaning, this limitation is obviously surrounding words so that language learners can see how the
problematic to deal with polysemous words. Several pre- word is used. While the list is sorted in alphabetical order of
vious studies tackled this problem and proposed methods the previous word, we can see that the list shows “a book”
to estimate the number of a word’s meanings and to esti- and “the book” in totally different positions and are not help-
mate an embedding for each meaning of the word (Athi- ful for language learners. While some concordancers sup-
waratkun, Wilson, and Anandkumar 2018). However, re- port listing the usage of “book” as nouns by attaching texts
cently, contextualized word embeddings (Peters et al. 2018; with part-of-speeches in advance, this is not helpful to see
Devlin et al. 2019) became quickly popular. With these the different usages of the word when the part-of-speeches
methods, we can obtain an embedding for each usage, or of the usages are identical. For example, the word “bank”
occurrence, of a word, considering the context of the occur- have polysemous meanings sharing the same part-of-speech:
rence of the word in a running sentence. These methods can one as financial organizations, and another as embankments.
also be seen as a method to estimate word embeddings for
polysemous words, with an extreme assumption that each Personalized complex word identification studies
occurrence of a word has different meanings. As contextual- In this study, a part of our goals is to identify complex us-
ized word embeddings are shown to be successful in many ages of a word in a running text. In other words, for one
tasks, in current NLP, the former strategy to estimate both word, one usage of the word in running text is complex
the number of meanings of a word and an embedding for for a learner, and another usage of the word is not. There
each meaning is employed only when it is necessary. are previous studies that identify complex words in a per-
Following the rise of word embedding techniques, visual- sonalized manner in the NLP literature (Ehara et al. 2012;
ization studies were propose to visualize word embeddings. Lee and Yeung 2018). These studies predict the words that
The study by (Smilkov et al. 2016) simply reported that their each learner knows based on each learner’s result of a short
development of a tool to visualize embeddings for different vocabulary test, which a learner typically takes 30 minutes
words. The study by (Liu et al. 2017) introduces applying to solve. Also, there are also many studies that identify com-
visualization of word embeddings to analyze semantic re- plex usages in a non-personalized manner, as summarized in
lationships between words. Both paper deals with principal (Paetzold and Specia 2016; Yimam et al. 2018).
component analysis (PCA) and t-SNE (Maaten and Hinton However, to our knowledge, the task of identifying com-
2008) for visualization. To our knowledge, we are the first plex usages in a personalized manner is novel. Our method
to visualize contextualized word embeddings, in which each is also novel in that it trains how to visualize the usages so
occurrence of a word, rather than a word, is visualized, with that learners can visually understand the usage differences
a practical purpose on language education. by using the learners’ vocabulary test data.
In addition to the visualization, our method can also pre-
dict the usages that each learner is familiar/unfamiliar with, Preliminary System and Experiments
in a personalized manner, when vocabulary test result data of Before entering the technical details of our method de-
dozens of learners are provided, such as the data in (Ehara scribed in the Proposed Method section, we first show the
2018). While there exist previous studies (Ehara 2018; preliminary system and some experiment results to intro-
Lee and Yeung 2018; Yeung and Lee 2018) for predicting duce the motivation of the proposed method.
the words that each learner is familiar/unfamiliar with using The preliminary system visualizes contextualized word
such data by using simple machine-learning classification, embeddings by using the conventional visualization of prin-
our method tackles a more difficult problem that deals with cipal component analysis (PCA). Figure 3 shows the layout
predicting which usages of a word is known/unknown to the
1
learner. https://lextutor.ca/conc/eng/
Figure 3: System layout. CWE means contextualized word embed-
dings.

Figure 5: Another example of searching the word book.

points (Maaten and Hinton 2008).
Knowing t-SNE, we did not employ t-SNE for visualiza-
tion for the following reasons: First, in our visualization,
the distances between usage points are important. While t-
SNE often produces intuitive clusters between data points,
the distance between points in the visualization is compli-
cated compared to those of PCA. Hence, to interpret dis-
tances between points, PCA is This is stated in the original t-
SNE (Maaten and Hinton 2008) paper. Moreover, many blog
posts such as 3 for engineers address this fact to encourage
the proper understanding of t-SNE. For these reasons, we
employed PCA for the basis of our visualization.
Second, even if the data to visualize is fixed, t-SNE
returns different results depending on its hyperparameter
called perplexity. In contrast, PCA returns the same re-
Figure 4: Example of searching the word book. sults if the data to visualize is fixed. This dependence on
the hyperparameter is elaborated in the original t-SNE pa-
per (Maaten and Hinton 2008) in the first place. We can also
find some blog posts targeting engineers that advocates to
of the preliminary system. Once a user provides a word to carefully set the perplexity parameter such as 4 . Various re-
the system, it automatically searches the word in the corpus sults on fixed data can be useful when the data is difficult to
in a similar way to typical concordancers. Unlike concor- be pre-processed so that the following dimension-reduction
dancers, the system has a database that stores contextualized methods are easy to handle. However, in this study, the data
word embeddings for each usage or occurrence of each word to be visualized are embeddings vectors; hence, the data can
in the corpus. We used half a million sentences from the be easily pre-processed before we feed them into the data.
British National Corpus (BNC Consortium 2007) as the raw Hence, for the purpose of this study, the feature that the re-
corpus. We built the database by applying the bert-base- sults vary on fixed data is unlikely to be useful. Rather, this
uncased model of the PyTorch Pretrained the BERT project may possibly complicate the interpretation of the visualiza-
2
(Devlin et al. 2019) to the corpus. We used the last layer, tion.
which was more distant from the surface input, as the em- Third, practically, t-SNE is computationally heavy com-
beddings. pared to PCA. Computing a t-SNE visualization involves
calculations for every pair of the given data points. While
Choice of dimension reduction methods how to deal with this heavy computational complexity is ad-
Principal component analysis (PCA) and t-SNE (Maaten dressed in studies such as (Tang et al. 2016), practically, t-
and Hinton 2008) are famous dimension reduction methods,
3
and t-SNE is notable for its intuitiveness and well clustered https://mlexplained.com/2018/09/14/paper-dissected-
visualizing-data-using-t-sne-explained/
2 4
https://github.com/huggingface/pytorch-pretrained-BERT https://distill.pub/2016/misread-tsne/
Figure 6: Variance of usage vectors vs. log word frequency.

Figure 7: Recap: usage of “haunt” predicted to be known to the
SNE is usually computationally heavy when compared to learner.
PCA. Strictly speaking, PCA has a similar complexity as
it involves the computation of singular values and vectors
in singular value decomposition (SVD). However, the cal-
culation of SVD has a number of applications other than of “bank”, when manually checked in the excerpt. Hence,
PCA-based visualization, sophisticated calculation methods learners can choose not to use this, as in the video. Figure 6
for large data were previously proposed (Halko et al. 2011). shows the variance of the usage vectors of each word against
its log frequency in the excerpt. It showed a statistically sig-
Preliminary System by using PCA nificant moderate correlation (r = 0.56, p < 0.01 by F-test),
implying that frequent words tend to have complex usages.
We built a preliminary system and conducted some exper-
iments to see how contextualized word embedding vectors
are plotted in the system. Figure 4 depicts such an exam- Motivating Examples
ple of searching for the word book. Users can directly type From the example of “book” in the previous sections, we can
the word in the textbox shown at the top of Figure 4. Be- easily see that the usages of “book” about reading are more
low is the visualization of the usages found and their list. frequent than those of “book” about a reservation. Hence,
Each dark-colored point is linked to each usage. Two dark when counting the number of usages, it is intuitive to assume
colors are used to color each usage point according to the that learners are not familiar with all usages but are familiar
results of a Gaussian mixture model (GMM) clustering with with the usages within a certain radius in the vector space.
2 components, as this value was reported to work well (Athi- This is the motivation of our method descried in the next
waratkun, Wilson, and Anandkumar 2018). The light-red section.
colored point is the probe point: the usages are listed in Before entering the technical details of our visualization
the nearest order of the probe point. No usage is linked to method in the next section, we show some usage predic-
the probe point. Users can freely and interactively drag and tion result examples of our method in a manner similar to
move the probe point to change the list of usages below the the previous examples of “book” so that readers can in-
visualization. Each line of the list shows the usage identi- tuitively understand our motivation, as shown in Figure 7
fication number and the surrounding words of the usage, and Figure 8. The markers are changed to triangular to de-
followed by a checkbox to record the usage so that learn- note that the colors reflect prediction results, rather than the
ers can refer to it later. In Figure 4, the probe point is on GMM-based clustering results explained above. The color-
the left part of the visualized figure. In the first several lines ing and darkness of the points in the visualization follow
of the list, the system successfully shows the usages of the those of the previous examples; the red light-colored point
word book as a publication. In contrast, Figure 5 depicts is the probe point, and the other dark points denote usages.
the case in which the users drag the probe point from the Figure 7 shows an example of the familiar usage prediction
left to the right of the visualization. The first several lines in case of searching the word “haunt”. The right-hand side
of the list show the usages of the word book, which means of the cross-marked circle is the area in which usages are
to reserve. We can see that the words surrounding the word predicted to be familiar to this learner. The probe point is lo-
book vary: merely focusing on the surrounding words, such cated within the circle. We can see that the usages of “haunt”
as “to” before book, cannot distinguish the usages of book, about chasing are listed below. Figure 8 shows another ex-
which means to reserve, from the usages of book for reading. ample of “haunt”. As the probe point is located outside of
the circle, in the left side of the visualization, the list below
Clustering Results shows the list of the usages predicted to be unfamiliar to this
The GMM clustering was accurate but not perfect: 0 er- learner. We can see that“haunt” about “to curse” are mainly
rors in the 42 usages of “book”, 1 error in the 22 usages listed.
fied linear unit function, and M be a large positive constant,
such as 100. Let G be a linear projection matrix from a T1
dimensional space to a T2 dimensional space. Let de (~a, ~b) be
the Euclidean distance between two vectors. By using these
formulations, we modeled the difficulty of words as follows:

dvi = − log(freq(vi ) + 1) (2)
freq(vi ) = N (~ci , , Xi ) (3)
ni
X
≈ tanh (M · ReLU( − de (G~ci , G~xk,i )))(4)
k=1

The tricky part is that Equation 3 can be approximately
written as Equation 4, whose parameter can be easily tuned
and optimized by using neural machine learning framework
such as PyTorch. In Equation 4, due to the ReLU func-
Figure 8: Usage of “haunt” predicted not to be known to the learner. tion, negative values within the function is simply ignored.
Hence, as de is the Euclidean distance, if = 0, i.e., the
size of the circle is 0, the terms inside ReLU is negative,
Proposed Method and freq(vi ) = 0. If − de (G~ci , G~xk,i ) > 0, due to M and
As stated in the Related Work section, some previous stud- tanh, the resulting value is almost 1. This means that we are
ies address methods to predict the words that a learner knows counting only the cases that surpasses de , i.e., counting the
based on his/her short vocabulary test result. However, since usages within measured from ~ci .
our application requires personalized prediction of the us- Notably, the following characteristics are important to un-
ages of the word that the learner does not know. Hence, we derstand our model.
propose a novel model that does this.
Not merely a logistic regression
Let us write the set of words as {v1 , . . . , vI }, where I is
the number of words (in type), and write the set of learners Notably, the proposed model is not merely a logistic regres-
as {l1 , . . . , lJ }, where J is the number of learners. Then, in sion. Our model has more parameters such as , ~ci , alj , G.
previous studies, based on the Rasch model (Rasch 1960; Because of having different extra parameters compared to
Baker 2004), the following logistic regression model Equa- the logistic regression, to train our model, we typically need
tion 1 is used to predict whether learner lj knows word vi or to use a neural network machine learning framework to
not. Here, σ(x) := 1+exp(−x) 1
and yi,j is the response of the model and optimize, such as PyTorch. To optimize using
such models, as it is difficult to differentiate the loss func-
learner in the vocabulary test; yi,j = 1 if learner lj answered tions of such models by hand, the model loss function is
correctly to the question of word vi , and yi,j = 0 otherwise. desirable to be mostly continuous and smooth so that its pa-
We have two types of parameters to tune: alj is the ability of rameters can be tuned using auto-gradient. We specifically
learner lj and dvi is the difficulty of word vi . designed Equation 4 to meet these conditions. In the experi-
ments, we used the Adam optimization method (Kingma and
P (yi,j = 1|lj , vi ) = σ(alj − dvi ) (1) Ba 2015) to optimize the loss function.
Here, how to model dvi , or the difficulty parameter of
word vi , is the key to our purpose. Previous studies report Trainable G
that the negative logarithm of the word frequency correlates As Equation 4 is mostly continuous and smooth, matrix G
well with the perceived difficulty of words (Tamayo 1987; can also be trained by using deep-learning framework soft-
Beglar 2010). As in Figure 1, our key idea is to count the fre- ware. As G is a projection matrix from T1 to T2 , if we set
quency of word usages only within a certain distance from T2 = 2 to consider a projection to a two-dimensional space,
the typical usage of the word. Hence, we propose the follow- training G via supervisions means training visualization via
ing model to implement this idea. supervisions. Here, in our task setting, the supervisions are
For each vi , we have ni vectors that are vector representa- vocabulary test dataset of second language learners, i.e., a
tion of each of the ni occurrences of word vi . We write these matrix in which the (j, i)-th element denotes whether learner
vectors as Xi = {~x1,i , . . . , ~xni ,i }. Each vector ~xk,i is T1 di- lj correctly answered the question of word vi .
Pni Among Xi , let ~ci be the one closest to their cen-
mensional.
ter n1i k=1 ~xk . Let freq(vi ) be the frequency of the vectors j : Personalized
in Xi within distance measured from the central vector ~ci . In Equation 4, for easier understanding, we write to be a
We write this frequency simply as freq(vi ) = N (~ci , , Xi ). constant that does not depend on learner index j. In reality,
Here, n is the number of usages of word vi and let each ~xk we can personalize by making dependent to learner index
be each usage vector obtained from contextualized word em- j as j ; in this case, each learner lj has his/her own region
bedding methods. Let ReLU(z) = max(0, z) be the recti- that he/she can understand, and the radius of this region is
j . This personalized version is the one that we used in the
experiments. Table 1: Number of sentences in each domain of the BNC
corpus in the total of 100, 000 sentences.
imaginative 21,946
Experiments arts 18,289
Quantitative Results of Prediction natural sciences 5,256
Quantitative evaluation of this personalized prediction of us- social science 7,777
ages of a word is difficult; to this end, we need to test each commerce 4,378
learner multiple times for different usages of the same word. leisure 20,300
However, when tested with the same word multiple times, belief and thought 3,441
learners easily notice that the word has multiple meanings. world news 764
Hence, instead, we evaluated the accuracy of personalized applied science 2,625
prediction of the words that the learner knows under an world affairs 15,224
experiment setting similar to (Ehara 2018). Our proposed
method is based on neural classification with a novel exten-
sion to adjusted counting the frequency of the usages within
Table 2: Accuracy of predicting learners’ vocabulary test re-
distance j . Since a typical logistic-regression classifier is
sponses by using the raw freqiencies and the corrected fre-
identical to one-layer neural classifier, comparing our model
quencies by the proposed model in each domain.
with a typical logistic-regression classifier using a frequency
feature in terms of accuracy can be used to indirectly evalu- Domain Correction Accuracy
ate how the idea of adjusted frequency is a practical method Arts Raw 0.61
for evaluation. Arts Corrected 0.64
The proposed model estimates the number of occurrences, All domains Raw 0.67
i.e., usages, that each learner knows. In other words, this All domains Corrected 0.72
can be regarded as modifying the word frequency so that
the model fits to the given vocabulary test dataset. In this
regard, we can evaluate how well the proposed model can
correct word frequency when an unbalanced corpus is given. for all domains. This seems to be the effect of frequency
Each document in the British National Corpus (BNC) (BNC counting excluding the cases where the proposed method is
Consortium 2007) is annotated with a domain Table 1. We outlier. The improvement in accuracy before and after cor-
evaluated how the proposed model can correct the word fre- rection (p < 0.01, Wilcoxon test) was statistically signifi-
quency in the “arts” domain. cant when modifying word frequencies in the arts domain
We used the vocabulary test result data in which each of alone or in all domains.
100 learners answered 31 vocabulary questions on the pub-
licly available dataset (Ehara 2018). In 3, 100 vocabulary “Trained” visualization
test responses, we used 1, 800 to train the model, and the
rest was used for the test. The baseline model is simply a lo- In the above experiments, we considered the case where no
gistic regression in which the logarithm of word frequency is projection was conducted, by fixing G = I. Next, let us con-
the only feature. The logarithm of word frequency has been sider the case where G is a projection to a two-dimensional
used as a simple rough measure for word difficulty and pre- space, i.e., G is a 2 × T1 matrix. Tuning G and radius j
viously used to analyze and predict word difficulty based to fit the vocabulary test dataset by using Equation 4 means
on vocabulary test data (Beglar 2010; Ehara et al. 2013; that we can actually train the visualization to explain the vo-
Lee and Yeung 2018; Yeung and Lee 2018). The proposed cabulary test dataset in a supervised manner.
model counts only the number of usages within the radius Figure 9 and Figure 10 show the result of the visualiza-
j . We used the PyTorch neural network framework to auto- tion. The initial value of G was set to a two-dimensional
matically tune the radius j and the center of the sphere by projection matrix by principal component analysis (PCA).
using its powerful automatic gradient support (Paszke et al. Though the initial value is the projection by PCA, it should
2017). be noted that the projection matrix G itself is trained from
First, we perform experiments on T1 = T2 and G = I, the vocabulary test dataset as well as the radius j .
a setting where no projection is performed and the model From Figure 9 and Figure 10, we can see that the pro-
deals with T1 dimensional hyperspheres. Table 2 shows the posed method counts only the main meanings within the red
results. It can be seen that the accuracy of the prediction of circle. To qualitatively evaluate the results, in Table 3, the
the word test data of language learners using the biased text two farthest or closes example occurrences of “period” from
of arts domain only is lower than that using the word fre- its center point, i.e., the center of the red circle in Figure 9
quency of all domains. The proposed method was able to are shown. It can be seen that the farthest cases are exam-
improve the accuracy of the word frequency of the arts do- ples of the use of technical terms such as “period pain” and
main only by counting the frequency in the region on the “magnetic field period”, while the closest two cases are ex-
contextual word expression vector space where the exami- amples of nouns representing periods such as “this period”
nee is estimated to be reacting. This effect was also observed and “the period”.
neural method automatically tunes the projection matrix to
Table 3: Farthest (F.) and closest (C.) two occurrences of visualize and the radius of each learner in the visualization
“period” from the center of the circle in Figure 9. so that the counted frequency within the circles fits to the
F. period pains can be severe and disruptive. supervisions. Experiments on actual subject response data
F. to produce a slight spread of magnetic field period . show that the proposed method can predict subject response
C. design during this period was in the plan . more accurately by modifying the frequency even when the
C. the pub designer of the period , use cases are biased to a specific domain. As a future work,
we are planning to make our method more interactive.

References
Athiwaratkun, B.; Wilson, A.; and Anandkumar, A. 2018.
Probabilistic FastText for multi-sense word embeddings. In
Proc. of ACL.
Baker, F. B. 2004. Item Response Theory : Parameter Esti-
mation Techniques, Second Edition. CRC Press.
Beglar, D. 2010. A rasch-based validation of the vocabulary
size test. Language Testing 27(1):101–118.
BNC Consortium, T. 2007. The British National Corpus,
version 3 (BNC XML Edition).
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.
Bert: Pre-training of deep bidirectional transformers for lan-
guage understanding. In Proc. of NAACL.
Dias, G., and Moraliyski, R. 2009. Relieving polysemy
Figure 9: Trained visualization example of “period”. Each trian- problem for synonymy detection. In Portuguese Conference
gle point represents an occurrence, or a usage, of the word “pe- on Artificial Intelligence, 610–621. Springer.
riod” in the “arts” domain in the BNC corpus. The entire projec- Ehara, Y.; Sato, I.; Oiwa, H.; and Nakagawa, H. 2012. Min-
tion of the original contextualized word embedding vectors to the ing words in the minds of second language learners: learner-
two-dimensional space, namely G, and the radius j was optimized specific word difficulty. In Proc. of COLING.
to fit the vocabulary test dataset (ref. Equation 1, Equation 3, and
Equation 4). Intuitively, a large j denotes that learner lj has a high Ehara, Y.; Shimizu, N.; Ninomiya, T.; and Nakagawa, H.
language ability as he/she is estimated to understand many of the 2013. Personalized reading support for second-language
occurrences of the word “period” within the red circle. web documents. ACM Transactions on Intelligent Systems
and Technology 4(2).
Ehara, Y. 2018. Building an english vocabulary knowledge
dataset of japanese english-as-a-second-language learners
using crowdsourcing. In Proc. LREC.
Halko, N.; Martinsson, P.-G.; Shkolnisky, Y.; and Tygert, M.
2011. An algorithm for the principal component analysis
of large data sets. SIAM Journal on Scientific computing
33(5):2580–2594.
Heilman, M.; Collins-Thompson, K.; Callan, J.; and Eske-
nazi, M. 2007. Combining Lexical and Grammatical Fea-
tures to Improve Readability Measures for First and Second
Language Texts. In Proc. of NAACL, 460–467. Rochester,
New York: Association for Computational Linguistics.
Hinkelmann, K.; Blaser, M.; Faust, O.; Horst, A.; and Mehli,
C. 2019. Virtual Bartender: A Dialog System Combin-
ing Data-Driven and Knowledge-Based Recommendation.
Figure 10: Trained visualization example of “figure”. The setting In AAAI Spring Symposium: Combining Machine Learning
of the training is identical to that of Figure 9. with Knowledge Engineering.
Hockey, S., and Martin, J. 1987. The Oxford Concordance
Program Version 2. Digital Scholarship in the Humanities
Conclusions 2(2):125–131.
In this paper, we propose a supervised visualization method Jian, J.-Y.; Chang, Y.-C.; and Chang, J. S. 2004. TANGO:
to predict which usages of a word are known to each learner, Bilingual collocational concordancer. In Proc. of ACL
by using a vocabulary test result dataset as supervisions. Our demo., 166–169.
Kingma, D. P., and Ba, J. 2015. Adam: A method for Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2018. Anchors:
stochastic optimization. In Proc. of ICLR. High-Precision Model-Agnostic Explanations. In Proc. of
Koh, P. W., and Liang, P. 2017. Understanding Black- AAAI.
box Predictions via Influence Functions. In Proc. of ICML, Smilkov, D.; Thorat, N.; Nicholson, C.; Reif, E.; Viégas,
1885–1894. F. B.; and Wattenberg, M. 2016. Embedding Projector: In-
Laurenzi, E.; Hinkelmann, K.; Jüngling, S.; Montecchiari, teractive Visualization and Interpretation of Embeddings. In
D.; Pande, C.; and Martin, A. 2019. Towards an Assistive In Proc. of NIPS 2016 Workshop on Interpretable Machine
and Pattern Learning-driven Process Modeling Approach. Learning in Complex Systems.
In AAAI Spring Symposium: Combining Machine Learning Tamayo, J. M. 1987. Frequency of use as a measure of
with Knowledge Engineering. word difficulty in bilingual vocabulary test construction and
translation. Educational and Psychological Measurement
Lee, J., and Yeung, C. Y. 2018. Personalizing lexical sim-
47(4):893–902.
plification. In Proc. of COLING, 224–232.
Tang, J.; Liu, J.; Zhang, M.; and Mei, Q. 2016. Visualizing
Liu, S.; Bremer, P.-T.; Thiagarajan, J. J.; Srikumar, V.; Wang,
large-scale and high-dimensional data. In Proc. of WWW,
B.; Livnat, Y.; and Pascucci, V. 2017. Visual exploration of
287–297.
semantic relationships in neural word embeddings. IEEE
trans. on vis. and comp. g. 24(1):553–562. Wu, J.-C.; Chuang, T. C.; Shei, W.-C.; and Chang, J. S.
2004. Subsentential translation memory for computer as-
Lundberg, S. M., and Lee, S.-I. 2017. A Unified Approach sisted writing and translation. In Proc. of ACL demo., 106–
to Interpreting Model Predictions. In Guyon, I.; Luxburg, 109.
U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan,
S.; and Garnett, R., eds., Proc. of NIPS, 4765–4774. Yeung, C. Y., and Lee, J. 2018. Personalized text retrieval
for learners of chinese as a foreign language. In Proc. of
Lux-Pogodalla, V.; Besagni, D.; and Fort, K. 2010. FastK- COLING, 3448–3455.
wic, an “intelligent“ concordancer using FASTR. In Proc.
of LREC. Yimam, S. M.; Biemann, C.; Malmasi, S.; Paetzold, G. H.;
Specia, L.; Štajner, S.; Tack, A.; and Zampieri, M. 2018. A
Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using report on the complex word identification shared task 2018.
t-sne. Journal of machine learning research 9(Nov):2579– In Proc. of BEA.
2605.
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and
Dean, J. 2013. Distributed representations of words and
phrases and their compositionality. In Proc. of NIPS, 3111–
3119.
Paetzold, G., and Specia, L. 2016. Benchmarking lexical
simplification systems. In LREC.
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.;
DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer,
A. 2017. Automatic differentiation in pytorch.
Pennington, J.; Socher, R.; and Manning, C. 2014. Glove:
Global vectors for word representation. In Proc. of EMNLP,
1532–1543.
Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark,
C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized
word representations. Proc. of NAACL.
Ramprasad, S., and Maddox, J. 2019. CoKE: Word Sense
Induction Using Contextualized Knowledge Embeddings.
In AAAI Spring Symposium: Combining Machine Learning
with Knowledge Engineering.
Rasch, G. 1960. Probabilistic Models for Some Intelligence
and Attainment Tests. Copenhagen: Danish Institute for Ed-
ucational Research.
Read, J. 2000. Assessing Vocabulary. Cambridge University
Press.
Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ”Why
Should I Trust You?”: Explaining the Predictions of Any
Classifier. In Proc. of KDD, 1135–1144. ACM. event-place:
San Francisco, California, USA.