<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Latent Semantic Analysis as Method for Automatic Question Scoring</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">David</forename><surname>Tobinski</surname></persName>
							<email>david.tobinski@uni-due.de</email>
							<affiliation key="aff0">
								<orgName type="institution">Universität Duisburg Essen</orgName>
								<address>
									<addrLine>Universitätsstraße 2</addrLine>
									<postCode>45141</postCode>
									<settlement>Essen</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Oliver</forename><surname>Kraft</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Universität Duisburg Essen</orgName>
								<address>
									<addrLine>Universitätsstraße 2</addrLine>
									<postCode>45141</postCode>
									<settlement>Essen</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Latent Semantic Analysis as Method for Automatic Question Scoring</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">C30A2ED04AAFED0487EFB5773DB9AE67</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-19T15:25+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Latent Semantic Analysis</term>
					<term>LSA</term>
					<term>automated scoring</term>
					<term>open question evaluation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Automatically scoring open questions in massively multiuser virtual courses is still an unsolved challenge. In most online platforms, the time consuming process of evaluating student answers is up to the instructor. Especially unexpressed semantic structures can be considered problematic for machines. Latent Semantic Analysis (LSA) is an attempt to solve this problem in the domain of information retrieval and can be seen as general attempt for representing semantic structure. This paper discusses the rating of one item taken from an exam using LSA. It is attempted to use documents in a corpus as assessment criteria and to project student answers as pseudo-documents into the semantic space. The result shows that as long as each document is sufficiently distinct from each other, it is possible to use LSA to rate open questions.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Using software to evaluate open questions is still a challenge. Therefore, there are many types of multiple choice tests and short answer tasks. But there is no solution available in which students may train their ability to write answers to open questions, as it is required in written exams. Especially in online courses systems (like Moodle), it is up to the course instructor to validate open questions herself.</p><p>A common method to analyze text is to search for certain keywords, as it is done by simple document retrieval systems. This method can not take into account that different words may have the same or a similar meaning. In information retrieval this leads to the problem, that potentially interesting documents may not be found by a query with too few matching keywords. Latent Semantic Analysis (LSA, <ref type="bibr" target="#b4">Landauer and Dumais 1997)</ref> faces this problem by taking the higher-order structure of a text into account. This method makes it possible to retrieve documents which are similar to a query, even if they have only a few keywords in common.</p><p>Considering this problem in information retrieval to score an open question seems to be a similar problem. Exam answers should contain important keywords, but contain their own semantic structure also. This paper attempts to rate a student's exam answer by using LSA. For that a small corpus based upon the accompanying book of the course "Pädagogische Psychologie" <ref type="bibr" target="#b2">(Fritz et al. 2010</ref>) is manually created. It is expected that it is in general possible to rate questions this way. Further it is of interest what constraints have to be taken into account to apply LSA for question scoring.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Latent Semantic Analysis</head><p>LSA was described by <ref type="bibr" target="#b0">Deerwester et al. 1990</ref>) as a statistical method for automatic document indexing and retrieval. Its advantage to other indexing techniques is that it creates a latent semantic space. Naive document retrieval methods search for keywords shared by a query and a corpus. They have the disadvantage that it is difficult or even impossible to find documents if the request and a potentially interesting document have a lack of shared keywords. Contrary to this, LSA finds similarities even if query and corpus have few words in common. Beside its application in the domain of Information Retrieval, LSA is used in other scientific domains and is discussed as theory of knowledge acquisition <ref type="bibr">(1997)</ref>.</p><p>LSA is based upon the Vector Space Model (VSM). This model treats a document and its terms as a vector in which each dimension of the vector represents an indexed word. Multiple documents are combined in a document-term-matrix, in which each column represents a document and rows represent a terms. Cells contain the term frequency of a document <ref type="bibr" target="#b0">(Deerwester et al. 1990)</ref>.</p><p>A matrix created this way may be weighted. There are two types of weighting functions. Local weighting is applied to a term i in document j and global weighting is the terms weighting in the corpus. a ij = local(i, j) * global(i), where a ij addresses a cell of the document-term-matrix <ref type="bibr" target="#b6">(Martin and Berry 2011)</ref>. There are several global and local weight functions. Since Dumais attested LogEntropy to improve retrieval results better than other weight function <ref type="bibr" target="#b1">(Dumais 1991)</ref>, studies done by <ref type="bibr" target="#b7">Pincombe (2004)</ref> or <ref type="bibr" target="#b3">Jorge-Botana et al. (2010)</ref> achieved different results. Although there is no consensus about the best weighting, it has an important impact to retrieval results.</p><p>After considering the weighting of the document-term-matrix, Singular Value Decomposition (SVD) is applied. SVD decomposes a matrix X into the product of three matrices:</p><formula xml:id="formula_0">X = T 0 S 0 D T 0 (1)</formula><p>Component matrix T 0 contains the derived orthogonal term factors, D T 0 describes the document factors and S 0 contains singular values, so that their product recreates the original matrix X. By convention, the diagonal matrix S is arranged in descending order. This means, the lower the index of a cell, the more information is contained. By reducing S from m to k dimensions, the product of all three matrices ( X) is the best approximation of X with k dimensions. Choosing a good value for k is critical for later retrieval results. If too many dimensions remain in S, unnecessary information will stay in the semantic space. Choosing k too big will remove important information from the semantic space <ref type="bibr" target="#b6">(Martin and Berry 2011)</ref>.</p><p>Once SVD is applied and the reduction done, there are four common types of comparisons, where the first two comparisons are quite equal: (i) Comparing documents with documents is done by multiplying D with the square of S and transposition of D. The value of cell a i,j now contains the similarity of document i and document j in the corpus. (ii) The same method can be used to compare terms with terms. (iii) The similarity of a term and a document can be taken from the cells of X. (iv) For the purpose of information retrieval, it is important to find a document described by keywords. According to the VSM keywords are composed in a vector, which can be understood as a query (q). The following formula projects a query into semantic space. The result is called pseudo-document (D q ) <ref type="bibr" target="#b0">(Deerwester et al. 1990</ref>):</p><formula xml:id="formula_1">D q = q T T S −1 (2)</formula><p>To compute similarity between documents and the pseudo-document, consine similarity is generally taken <ref type="bibr" target="#b1">(Dumais 1991)</ref>. In their studies Jorge-Botana et al. ( <ref type="formula">2010</ref>) found out that Euclidean distance performs better than cosine similarity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Application configuration</head><p>To verify if LSA is in general suitable for valuating open questions, students answers from psychology exam in summer semester 2010 are analyzed. The exam question requires to describe, how a text can be learned by using the three cognitive learning strategies memorization, organization and elaboration. Each correct description is rated with two points. A simple description is enough to answer the question correctly, it is not demanded to transfer knowledge by giving an example. For the evaluation brief assessment criteria are available, but due to the short length of the description of each criterion new criteria are created by using the accompanying book of the course as mentioned above.</p><p>For the assessment a corpus is created, where each document is interpreted as an assessment criterion, which is worth a certain number of points. This way quite small corpora are created. For example, if a question is worth four points the correlating corpus contains exact four documents and only a few hundred terms, sometimes even less. To reduce noise in the corpus a list of stopwords is used. Because the students answers are short in length, stemming is used in this application. Beside using stemming and a list of stopwords, the corpus is weighted. <ref type="bibr">Pincombe (2004, 17)</ref> showed that for a small number of dimensions BinIDF weighting has a high correlation to human ratings. Since the number of dimensions is that low (see below) and a human rating is taken as basis for the evaluation of LSA in this application, the used corpus is weighted by BinIDF.</p><p>All calculations are done by using GNU R statistical processing language using "lsa"<ref type="foot" target="#foot_0">3</ref> library provided by CRAN. The library is based upon SVDLIBC<ref type="foot" target="#foot_1">4</ref> by Doug Rhode. It implements multiple functions to determine the value of k. The example below was created by using dimcalc share function with a threshold of 0.5, which sets k = 2. As consequence matrix S containing singular values is reduced to two dimensions.</p><p>Most students answers in the exam are rated with the maximum points. For this test 20 rated answers are taken, as in the exam most of them achieved the full number of points. The answers are of varying length, the shortest ones contain just five to six words, while the longest consist of two or three sentences with up to thirty or more words. Each of the chosen answers contain a description for all three learning strategies, answers with missing descriptions are ignored.</p><p>The evaluation done by the lecturers is used as template to evaluate the results of LSA. It is expected, that these answers have a high similarity to its matching criterion, represented by the documents. The rated answers are interpreted as a query, by using formula (2) the query is projected into the corpus as a pseudo-document and because of their length they be near to the origin of the corpus. To calculate the similarity between the pseudo-documents and the documents, cosine similarities is used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Discussion</head><p>Figure <ref type="figure" target="#fig_0">1</ref> (a) shows the corpus with all three assessment criteria (0 Memorization, 1 Organization, 2 Elaboration). It is noticeable that the criterion for memorization lies closer to the origin than the other two criteria. This is a result of the relatively short length of the document which is taken as criterion for memorization. If the similarity between this and the other criteria is calculated, one can see that this is problematic. Document 1 Organization and 2 Elaboration have a cosine similarity of 0.08, so they can be seen as very unequal. While 0 Memorization and 1 Organization have an average similarity of 0.57, criteria 0 Memorization and 2 Elaboration are very similar with a value of 0.87. Therefore and because of the tendency of pseudo-documents to lie close to the origin, it can be expected that using cosine similarity will not be successful. The assessment criterion for the descriptions of the memorization strategy overlaps the criterion for the elaboration strategy.</p><p>Looking at precision and recall values proofs this assumption to be correct for the corpus plotted in Figure <ref type="figure" target="#fig_0">1 (a)</ref>. The evaluation of the answers achieves a recall of 0.62, a precision of 0.51 and an accuracy of 0.68. Although the threshold for a correct rating is set to 0.9, both values can be seen as too low to be used for rating open questions. Since the two criteria for memorization and elaboration have a high similarity, a description for one of them gets a high similarity for both criteria. This causes the low precision values for the evaluation.</p><p>Figure <ref type="figure" target="#fig_0">1</ref> (b) illustrates the corpus without the document, which is used as criterion for the memorization strategy. Comparing both documents shows a similarity of 0.06. By removing the problematic document from the corpus, the similarity of the students answers to the assessment criterion for elaboration can be calculated without being overlapped by the criterion for memorization. Using this corpus for evaluation improves recall to 0.69, precision to 0.93 and accuracy to 0.83.</p><p>If one compares both results, it is remarkable that precision as a qualitative characteristic improves to a high rate, while recall stays at an average level. This means in the context of question rating that answers correctly validated by LSA are very likely rated positive by a human rater. Although LSA creates a precise selection of correct answers, recall rate shows that there are still some positive answers missing in the selection. The increase of accuracy from 0.68 to 0.83 illustrates that the number of true negatives increases by using the second corpus.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion and Future Work</head><p>The results of the experiment are encouraging and the general idea of using LSA to rate open questions is functional. The approach of using documents as assessment criterion and project human answers as pseudo-documents into the semantic space constructed by LSA is useful. LSA selects correct answers with a high precision, although some positive rated answers are missing in the selection. But the application shows that some points need to be considered.</p><p>All assessment criteria have to be sufficient distinct from each other and should be of a certain length, if cosine similarity is used. As the criterion for rating the elaboration descriptions shows, it is important that no criterion is overlapped by another. Without considering this, sometimes it is impossible to distinguish which criterion is the correct one. Having a criterion overlapping another one leads to the problem that both criteria get a high similarity, which raises the number of false positives and reduces the precision of the result. This is a mayor difference between the application of LSA as an information retrieval tool or for scoring purposes.</p><p>Concerning the average recall value, it is an option to examine the impact of a synonymy dictionary in futher studies. In addition, our result shows that BinIDF weighting works well for a small number of dimensions, as <ref type="bibr" target="#b7">Pincombe (2004)</ref> described.</p><p>For future work, we plan to use this layout in an online tutorial to perform further tests in winter semester 2013/14. The tutorial is designed as massively multiuser virtual course and will accompany a lecture in educational psychology, which is attended by several hundred students. It will contain two items to gain more empirical evidence and experience with this application and its configuration. To examine the impact on learners long-term memory will be subject to further studies.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Figure 1 (a) shows the corpus containing all three assessment criteria. It is illustrated that document 0 Memorizing lies close to the origin. Figure 1 (b) shows the corpus without the document 0 Memorizing. In Figure (a) and (b) the crosses close to the origin mark the positions of the 20 queries.</figDesc><graphic coords="5,165.96,115.84,283.44,124.80" type="bitmap" /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">http://cran.r-project.org/web/packages/lsa/index.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">http://tedlab.mit.edu/ dr/SVDLIBC/ This is a reimplementation of SVDPACKC written by Michael Berry, Theresa Do, Gavin O'Brien, Vijay Krishna and Sowmini Varadhan (University of Tennessee).</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Indexing by latent semantic analysis</title>
		<author>
			<persName><forename type="first">S</forename><surname>Deerwester</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">D</forename><surname>Susan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">W</forename><surname>Furnas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">K</forename><surname>Landauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Harshman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Jounal of the American Society For Information Science</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="page" from="391" to="407" />
			<date type="published" when="1990">1990</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Improving the retrieval of information from external sources</title>
		<author>
			<persName><forename type="middle">S T</forename><surname>Dumais</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Behavior Research Methods</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page" from="229" to="236" />
			<date type="published" when="1991">1991</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Fritz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Hussy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tobinski</surname></persName>
		</author>
		<title level="m">Pädagogische Psychologie</title>
				<meeting><address><addrLine>München</addrLine></address></meeting>
		<imprint>
			<publisher>Reinhardt</publisher>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Latent Semantic Analysis Parameters for Essay Evaluation using Small-Scale Corpora</title>
		<author>
			<persName><forename type="first">Jorge-Botana</forename></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Leon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Olmos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Escudero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Quantitative Linguistics</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="page" from="1" to="29" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Solution to Plato&apos;s problem : The latent semantic analysis theory of acquisition, induction, and representation of knowledge</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">K</forename><surname>Landauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Dumais</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Psychological Review</title>
		<imprint>
			<biblScope unit="volume">104</biblScope>
			<biblScope unit="page" from="211" to="240" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Handbook of Latent Semantic Analysis</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">K</forename><surname>Landauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Mcnamara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dennis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Kintsch</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2011">2011</date>
			<publisher>Routledge</publisher>
			<pubPlace>New York and London</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Mathematical Foundations Behind Latent Semantic Analysis</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">I</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">W</forename><surname>Berry</surname></persName>
		</author>
		<author>
			<persName><surname>Landauer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Handbook of Latent Semantic Analysis</title>
				<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="35" to="55" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Comparison of human and latent semantic analysis (LSA) judgments of pairwise document similarities for a news corpus</title>
		<author>
			<persName><forename type="first">B</forename><surname>Pincombe</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
