Latent Semantic Analysis
    as Method for Automatic Question Scoring

                       David Tobinski1 and Oliver Kraft2
          1
            Universität Duisburg Essen, Universitätsstraße 2, 45141 Essen
                           david.tobinski@uni-due.de,
                    WWW home page: www.kognitivismus.de
          2
            Universität Duisburg Essen, Universitätsstraße 2, 45141 Essen


      Abstract. Automatically scoring open questions in massively multiuser
      virtual courses is still an unsolved challenge. In most online platforms,
      the time consuming process of evaluating student answers is up to the
      instructor. Especially unexpressed semantic structures can be considered
      problematic for machines. Latent Semantic Analysis (LSA) is an attempt
      to solve this problem in the domain of information retrieval and can be
      seen as general attempt for representing semantic structure. This paper
      discusses the rating of one item taken from an exam using LSA. It is
      attempted to use documents in a corpus as assessment criteria and to
      project student answers as pseudo-documents into the semantic space.
      The result shows that as long as each document is sufficiently distinct
      from each other, it is possible to use LSA to rate open questions.

      Keywords: Latent Semantic Analysis, LSA, automated scoring, open
      question evaluation


1   Introduction

Using software to evaluate open questions is still a challenge. Therefore, there
are many types of multiple choice tests and short answer tasks. But there is no
solution available in which students may train their ability to write answers to
open questions, as it is required in written exams. Especially in online courses
systems (like Moodle), it is up to the course instructor to validate open questions
herself.
    A common method to analyze text is to search for certain keywords, as it
is done by simple document retrieval systems. This method can not take into
account that different words may have the same or a similar meaning. In informa-
tion retrieval this leads to the problem, that potentially interesting documents
may not be found by a query with too few matching keywords. Latent Semantic
Analysis (LSA, Landauer and Dumais 1997) faces this problem by taking the
higher-order structure of a text into account. This method makes it possible to
retrieve documents which are similar to a query, even if they have only a few
keywords in common.
    Considering this problem in information retrieval to score an open question
seems to be a similar problem. Exam answers should contain important key-
words, but contain their own semantic structure also. This paper attempts to
rate a student’s exam answer by using LSA. For that a small corpus based upon
the accompanying book of the course “Pädagogische Psychologie” (Fritz et al.
2010) is manually created. It is expected that it is in general possible to rate
questions this way. Further it is of interest what constraints have to be taken
into account to apply LSA for question scoring.


2    Latent Semantic Analysis
LSA was described by Deerwester et al. 1990) as a statistical method for au-
tomatic document indexing and retrieval. Its advantage to other indexing tech-
niques is that it creates a latent semantic space. Naive document retrieval meth-
ods search for keywords shared by a query and a corpus. They have the disad-
vantage that it is difficult or even impossible to find documents if the request
and a potentially interesting document have a lack of shared keywords. Contrary
to this, LSA finds similarities even if query and corpus have few words in com-
mon. Beside its application in the domain of Information Retrieval, LSA is used
in other scientific domains and is discussed as theory of knowledge acquisition
(1997).
    LSA is based upon the Vector Space Model (VSM). This model treats a doc-
ument and its terms as a vector in which each dimension of the vector represents
an indexed word. Multiple documents are combined in a document-term-matrix,
in which each column represents a document and rows represent a terms. Cells
contain the term frequency of a document (Deerwester et al. 1990).
    A matrix created this way may be weighted. There are two types of weighting
functions. Local weighting is applied to a term i in document j and global
weighting is the terms weighting in the corpus. aij = local(i, j) ∗ global(i), where
aij addresses a cell of the document-term-matrix (Martin and Berry 2011). There
are several global and local weight functions. Since Dumais attested LogEntropy
to improve retrieval results better than other weight function (Dumais 1991),
studies done by Pincombe (2004) or Jorge-Botana et al. (2010) achieved different
results. Although there is no consensus about the best weighting, it has an
important impact to retrieval results.
    After considering the weighting of the document-term-matrix, Singular Value
Decomposition (SVD) is applied. SVD decomposes a matrix X into the product
of three matrices:

                                  X = T0 S0 D0T                                 (1)
    Component matrix T0 contains the derived orthogonal term factors, D0T de-
scribes the document factors and S0 contains singular values, so that their prod-
uct recreates the original matrix X. By convention, the diagonal matrix S is
arranged in descending order. This means, the lower the index of a cell, the
more information is contained. By reducing S from m to k dimensions, the
product of all three matrices (X)
                                b is the best approximation of X with k dimen-
sions. Choosing a good value for k is critical for later retrieval results. If too
many dimensions remain in S, unnecessary information will stay in the semantic
space. Choosing k too big will remove important information from the semantic
space (Martin and Berry 2011).
    Once SVD is applied and the reduction done, there are four common types
of comparisons, where the first two comparisons are quite equal: (i) Comparing
documents with documents is done by multiplying D with the square of S and
transposition of D. The value of cell ai,j now contains the similarity of document
i and document j in the corpus. (ii) The same method can be used to compare
terms with terms. (iii) The similarity of a term and a document can be taken
from the cells of X.
                  b (iv) For the purpose of information retrieval, it is important
to find a document described by keywords. According to the VSM keywords are
composed in a vector, which can be understood as a query (q). The following for-
mula projects a query into semantic space. The result is called pseudo-document
(Dq ) (Deerwester et al. 1990):

                                 Dq = q T T S −1                              (2)
   To compute similarity between documents and the pseudo-document, consine
similarity is generally taken (Dumais 1991). In their studies Jorge-Botana et al.
(2010) found out that Euclidean distance performs better than cosine similarity.


3   Application configuration

To verify if LSA is in general suitable for valuating open questions, students
answers from psychology exam in summer semester 2010 are analyzed. The exam
question requires to describe, how a text can be learned by using the three
cognitive learning strategies memorization, organization and elaboration. Each
correct description is rated with two points. A simple description is enough to
answer the question correctly, it is not demanded to transfer knowledge by giving
an example. For the evaluation brief assessment criteria are available, but due
to the short length of the description of each criterion new criteria are created
by using the accompanying book of the course as mentioned above.
    For the assessment a corpus is created, where each document is interpreted
as an assessment criterion, which is worth a certain number of points. This way
quite small corpora are created. For example, if a question is worth four points
the correlating corpus contains exact four documents and only a few hundred
terms, sometimes even less. To reduce noise in the corpus a list of stopwords
is used. Because the students answers are short in length, stemming is used in
this application. Beside using stemming and a list of stopwords, the corpus is
weighted. Pincombe (2004, 17) showed that for a small number of dimensions
BinIDF weighting has a high correlation to human ratings. Since the number of
dimensions is that low (see below) and a human rating is taken as basis for the
evaluation of LSA in this application, the used corpus is weighted by BinIDF.
    All calculations are done by using GNU R statistical processing language
using “lsa”3 library provided by CRAN. The library is based upon SVDLIBC4 by
Doug Rhode. It implements multiple functions to determine the value of k. The
example below was created by using dimcalc share function with a threshold
of 0.5, which sets k = 2. As consequence matrix S containing singular values is
reduced to two dimensions.
    Most students answers in the exam are rated with the maximum points. For
this test 20 rated answers are taken, as in the exam most of them achieved the full
number of points. The answers are of varying length, the shortest ones contain
just five to six words, while the longest consist of two or three sentences with up
to thirty or more words. Each of the chosen answers contain a description for all
three learning strategies, answers with missing descriptions are ignored.
    The evaluation done by the lecturers is used as template to evaluate the
results of LSA. It is expected, that these answers have a high similarity to its
matching criterion, represented by the documents. The rated answers are inter-
preted as a query, by using formula (2) the query is projected into the corpus
as a pseudo-document and because of their length they be near to the origin of
the corpus. To calculate the similarity between the pseudo-documents and the
documents, cosine similarities is used.


4     Discussion

Figure 1 (a) shows the corpus with all three assessment criteria (0 Memorization,
1 Organization, 2 Elaboration). It is noticeable that the criterion for memoriza-
tion lies closer to the origin than the other two criteria. This is a result of the
relatively short length of the document which is taken as criterion for memo-
rization. If the similarity between this and the other criteria is calculated, one
can see that this is problematic. Document 1 Organization and 2 Elaboration
have a cosine similarity of 0.08, so they can be seen as very unequal. While
0 Memorization and 1 Organization have an average similarity of 0.57, criteria
0 Memorization and 2 Elaboration are very similar with a value of 0.87. There-
fore and because of the tendency of pseudo-documents to lie close to the origin,
it can be expected that using cosine similarity will not be successful. The as-
sessment criterion for the descriptions of the memorization strategy overlaps the
criterion for the elaboration strategy.
    Looking at precision and recall values proofs this assumption to be correct
for the corpus plotted in Figure 1 (a). The evaluation of the answers achieves a
recall of 0.62, a precision of 0.51 and an accuracy of 0.68. Although the threshold
for a correct rating is set to 0.9, both values can be seen as too low to be used for
rating open questions. Since the two criteria for memorization and elaboration
3
    http://cran.r-project.org/web/packages/lsa/index.html
4
    http://tedlab.mit.edu/ dr/SVDLIBC/ This is a reimplementation of SVDPACKC
    written by Michael Berry, Theresa Do, Gavin O’Brien, Vijay Krishna and Sowmini
    Varadhan (University of Tennessee).
Fig. 1. Figure 1 (a) shows the corpus containing all three assessment criteria. It is
illustrated that document 0 Memorizing lies close to the origin. Figure 1 (b) shows the
corpus without the document 0 Memorizing. In Figure (a) and (b) the crosses close to
the origin mark the positions of the 20 queries.


have a high similarity, a description for one of them gets a high similarity for
both criteria. This causes the low precision values for the evaluation.
    Figure 1 (b) illustrates the corpus without the document, which is used as
criterion for the memorization strategy. Comparing both documents shows a
similarity of 0.06. By removing the problematic document from the corpus, the
similarity of the students answers to the assessment criterion for elaboration can
be calculated without being overlapped by the criterion for memorization. Using
this corpus for evaluation improves recall to 0.69, precision to 0.93 and accuracy
to 0.83.
    If one compares both results, it is remarkable that precision as a qualitative
characteristic improves to a high rate, while recall stays at an average level.
This means in the context of question rating that answers correctly validated
by LSA are very likely rated positive by a human rater. Although LSA creates
a precise selection of correct answers, recall rate shows that there are still some
positive answers missing in the selection. The increase of accuracy from 0.68 to
0.83 illustrates that the number of true negatives increases by using the second
corpus.


5    Conclusion and Future Work

The results of the experiment are encouraging and the general idea of using
LSA to rate open questions is functional. The approach of using documents as
assessment criterion and project human answers as pseudo-documents into the
semantic space constructed by LSA is useful. LSA selects correct answers with a
high precision, although some positive rated answers are missing in the selection.
But the application shows that some points need to be considered.
   All assessment criteria have to be sufficient distinct from each other and
should be of a certain length, if cosine similarity is used. As the criterion for
rating the elaboration descriptions shows, it is important that no criterion is
overlapped by another. Without considering this, sometimes it is impossible to
distinguish which criterion is the correct one. Having a criterion overlapping
another one leads to the problem that both criteria get a high similarity, which
raises the number of false positives and reduces the precision of the result. This
is a mayor difference between the application of LSA as an information retrieval
tool or for scoring purposes.
    Concerning the average recall value, it is an option to examine the impact
of a synonymy dictionary in futher studies. In addition, our result shows that
BinIDF weighting works well for a small number of dimensions, as Pincombe
(2004) described.
    For future work, we plan to use this layout in an online tutorial to perform
further tests in winter semester 2013/14. The tutorial is designed as massively
multiuser virtual course and will accompany a lecture in educational psychology,
which is attended by several hundred students. It will contain two items to gain
more empirical evidence and experience with this application and its configura-
tion. To examine the impact on learners long-term memory will be subject to
further studies.


References
Deerwester, S., Susan T. D., Furnas, G. W., Landauer, T. K., Harshman, R.: Indexing
    by latent semantic analysis. Jounal of the American Society For Information Science
    41, 391–407 (1990)
Dumais. S. T.: Improving the retrieval of information from external sources. Behavior
    Research Methods 23, 229–236 (1991)
Fritz, A., Hussy, W., Tobinski, D.: Pädagogische Psychologie. Reinhardt, München
    (2010)
Jorge-Botana, G., Leon, J. A., Olmos R., Escudero I.: Latent Semantic Analysis Pa-
    rameters for Essay Evaluation using Small-Scale Corpora. Journal of Quantitative
    Linguistics 17, 1–29 (2010)
Landauer, T. K., Dumais, S. T.: Solution to Plato’s problem : The latent semantic
    analysis theory of acquisition, induction, and representation of knowledge. Psycho-
    logical Review, 104, 211-240 (1997)
Landauer, T. K., McNamara, D. S., Dennis, S., Kintsch, W.: Handbook of Latent
    Semantic Analysis. Routledge, New York and London (2011)
Martin, D. I., Berry, M. W.: Mathematical Foundations Behind Latent Semantic Anal-
    ysis. Landauer et al., Handbook of Latent Semantic Analysis, 35–55 (2011)
Pincombe, B.: Comparison of human and latent semantic analysis (LSA) judgments of
    pairwise document similarities for a news corpus (2004)