<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Incorporating Wide Context Information for Deep Knowledge Tracing using Attentional Bi-interaction</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Raghava</forename><surname>Krishnan</surname></persName>
							<email>raghava.krishnan@fujixerox.co.jp</email>
							<affiliation key="aff0">
								<orgName type="institution">Fuji Xerox Co Ltd</orgName>
								<address>
									<settlement>Yokohama</settlement>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Janmajay</forename><surname>Singh</surname></persName>
							<email>janmajay.singh@fujixerox.co.jp</email>
							<affiliation key="aff0">
								<orgName type="institution">Fuji Xerox Co Ltd</orgName>
								<address>
									<settlement>Yokohama</settlement>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Masahiro</forename><surname>Sato</surname></persName>
							<email>sato.masahiro@fujixerox.co.jp</email>
							<affiliation key="aff0">
								<orgName type="institution">Fuji Xerox Co Ltd</orgName>
								<address>
									<settlement>Yokohama</settlement>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Qian</forename><surname>Zhang</surname></persName>
							<email>qian.zhang@fujixerox.co.jp</email>
							<affiliation key="aff0">
								<orgName type="institution">Fuji Xerox Co Ltd</orgName>
								<address>
									<settlement>Yokohama</settlement>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tomoko</forename><surname>Ohkuma</surname></persName>
							<email>ohkuma.tomoko@fujixerox.co.jp</email>
							<affiliation key="aff0">
								<orgName type="institution">Fuji Xerox Co Ltd</orgName>
								<address>
									<settlement>Yokohama</settlement>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Incorporating Wide Context Information for Deep Knowledge Tracing using Attentional Bi-interaction</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">82718E63D482A9B06AA14DE3D9F0A1EB</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T15:00+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Computer Aided Education</term>
					<term>Adaptive learning</term>
					<term>personalization</term>
					<term>sequential modeling</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Online learning platforms also known as Computer Aided Education Systems have recently grown in importance owing to their ability to personalize study plans in accordance with individual student requirements. Learning platforms have modeled student knowledge state using student responses with the recently popular Deep Knowledge Tracing (DKT) technique. Using context information has also proven effective in various predictive problems prompting learning platforms to store a variety of context features about a student's performance history. An example context may be response time, where shorter times to answer questions may indicate higher mastery of a skill. Therefore, it is crucial to incorporate context features in the most effective way possible. Most of the research in DKT either use no context features, or use a set of context features that span only a narrow set of student characteristics. To overcome this, we identify a wide set of context features and incorporate their interactions into the DKT model. We then observe the effects of incorporating these additional context feature interactions and also propose an adaptive weighting technique that learns the appropriate context feature interaction weights. These techniques are compared with state-of-the-art baselines and their performances were evaluated using AUC scores.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Computer Aided Education (CAE) systems aim to personalize the study plan of a user to best suit his/her needs. This is achieved through the process of Knowledge Tracing where the current knowledge state of the user is estimated using the history of their interactions with the system, and this estimated knowledge state is used to predict the future performance of the user. Accurately predicted future student performances are then used as cues to better personalize the study plan of each user. In addition to a history of user responses, CAE systems usually also store additional metadata related to user performance history, like response time, type of question, number of attempts, etc. Adomavicius et al. <ref type="bibr" target="#b0">[1]</ref> refer to this additional information as contexts or context features, and give an interactional view of context as having a cyclical relationship with an underlying activity. In our case, the activity is a student's response and context is the additional information.</p><p>A popular approach to knowledge tracing over the past few years had been Deep Knowledge Tracing <ref type="bibr" target="#b1">[2]</ref> (DKT) which learns a continuous representation of the knowledge state as compared to the discrete variable representation used in Bayesian Knowledge Tracing <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5]</ref> (BKT). The drawback with DKT is that it used only the history of student responses, while there are other factors that affect a student's performance while using an online learning platform, such as forgetting, learning ability, etc. This is partially overcome in <ref type="bibr" target="#b5">[6]</ref> which uses a technique called bi-interaction in a framework called Bi-interaction Deep Knowledge Tracing (BIDKT). This technique incorporates context in the form of second degree interactions between input question response and context features relating to forgetting behavior, where interactions are the inner product or Hadamard product of the embedding vectors of the context features. This technique showed us that using context feature interactions is an effective way of incorporating the above mentioned factors in a knowledge tracing model.</p><p>A drawback of BIDKT was using only a small set of additional context features (narrow), in this case, student forgetting behavior. While including only a few features led to a reasonable improvement in performance, it would be of interest to know if the trend would have continued to rise as more related contexts were identified and included in the model. Additionally, the bi-interaction technique used, assigned the same weight to all feature interactions. This might be a problem if a larger set of context features are used, as important interactions might get diluted along with unimportant ones. This may result in either saturation or even a drop in model performance.</p><p>In this paper we posit that using additional context features (wide) should lead to improved performance of knowledge tracing models. We further hypothesize that existing models may not be well suited to effectively use additional contexts since they do not weigh contexts by their importances. To verify these ideas, we first identify additional contexts which may provide important cues for future student performance prediction. We then analyze how current best model performance changes with wider contexts. Finally, we propose a new technique by modifying BIDKT which adaptively learns weights for contexts via an Attention network similar to <ref type="bibr" target="#b6">[7]</ref> and see how it compares to identified baselines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Knowledge Tracing: Since the emergence of Long Short-term Memory(LSTM) Networks, the Deep Knowledge Tracing model has been the most popular knowledge tracing technique <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b7">8]</ref>. There have been variations and extensions of DKT such as <ref type="bibr" target="#b8">[9]</ref> that use Memory Networks to model individual skill levels more effectively, while <ref type="bibr" target="#b9">[10]</ref> use hop LSTMs to use only relevant past exercises to estimate the current skill level. There have also been efforts to separately model the student's ability in the Dynamic Key-Value Memory Networks for Knowledge Tracing framework <ref type="bibr" target="#b10">[11]</ref>. Although most efforts at knowledge tracing only use sequential models <ref type="bibr" target="#b11">[12]</ref>, <ref type="bibr" target="#b12">[13]</ref> use Convolutional Neural Networks for knowledge tracing, while <ref type="bibr" target="#b13">[14]</ref> uses sequential models such as LSTM to estimate parameters of IRT. There have also been a few attempts at using Attention Networks in knowledge tracing. Pandey et al. <ref type="bibr" target="#b14">[15]</ref> use a self-attention mechanism to identify relevant Knowledge Components from past learning interactions of the student.</p><p>Using Context Features in Knowledge Tracing: Given the success of using context feature and their interactions in other domains <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b16">17,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b20">21]</ref>, recently there have been efforts in knowledge tracing to incorporate context features in predictive models as well. Sun et al. <ref type="bibr" target="#b21">[22]</ref> try to use a wide variety of context features for the task of knowledge tracing, they achieve this by ensembling one of various algorithms such as Decision Tree or Support Vector Machines or Linear Regressor with the Dynamic Key-Value Memory Network architecture of <ref type="bibr" target="#b8">[9]</ref>. Zhang et al. <ref type="bibr" target="#b7">[8]</ref> propose an Autoencoder architecture to reduce the dimensionality of a large number of features being input to DKT. Attention Networks have been used to incorporate context features as well. Pandey et al. <ref type="bibr" target="#b22">[23]</ref> use a self-attention mechanism to incorporate contextual information relating to exercise relations and forgetting.</p><p>There have also been efforts to incorporate context features in the form of interactions for the task of knowledge tracing. Vie et al. <ref type="bibr" target="#b23">[24]</ref> uses Factorization Machines to model the interaction between a wide variety of features. Nagatani et al. <ref type="bibr" target="#b5">[6]</ref> on the other hand models feature interactions using Bi-interaction, a variant of Factorization Machines and additionally inputs these interactions to an LSTM and achieves reasonable results. The model in <ref type="bibr" target="#b5">[6]</ref> also forms the basis of our work. Our proposed model aims to improve the model proposed in <ref type="bibr" target="#b5">[6]</ref> by increasing the variety of context features and also by proposing a technique to utilize the additional context information in an effective way using the attention mechanism from <ref type="bibr" target="#b6">[7]</ref>. While there have been other efforts at using contextual features in the DKT framework <ref type="bibr" target="#b7">[8]</ref> and Factorization Machines framework <ref type="bibr" target="#b23">[24]</ref>, this is the first attempt at using Attentional bi-interation to incorporate context feature interactions in the DKT model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Background</head><p>In this section, we will provide some background to the domain of knowledge tracing and also describe the architectures of the DKT and BIDKT models. These models are the basis for our proposed architecture.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Knowledge Tracing</head><p>Knowledge tracing is the process of estimating a student's current knowledge state and using it to predict future performance. Given a sequence of past learning attempts 𝐱 0 ⋯ 𝐱 𝑡 , we need to predict the student's performance for attempt 𝐱 𝑡+1 . In general, an attempt 𝐱 𝑡 = (𝑞 𝑡 , 𝑎 𝑡 ) is defined as a tuple that contains the skill set id (𝑞 𝑡 ) of a question at time step 𝑡 and whether the student response (𝑎 𝑡 ) to the question is correct or not. In this case, 𝑞 𝑡 is identified as a skill set id from a set of skills 𝑄 and 𝑎 𝑡 is a binary variable. We need to predict 𝑎 𝑡+1 for 𝑞 𝑡+1 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Deep Knowledge Tracing</head><p>Deep Knowledge Tracing (DKT) shown in Figure <ref type="figure" target="#fig_0">1</ref>(a), models students' knowledge state transition using an LSTM, which is a modified version of the RNN. The architecture of the DKT model is from <ref type="bibr" target="#b1">[2]</ref> where, at time step 𝑡, the knowledge state is represented as 𝐡 𝑡 ∈ ℝ 𝑘 where 𝑘 is the hidden state dimension. The DKT model in Figure <ref type="figure" target="#fig_0">1</ref>(a) shows the 2 processes the model is supposed to perform i.e. estimating the current knowledge state and predicting future performance.</p><p>In the case of DKT, the input 𝐱 𝑡 is a one-hot vector, which is the Cartesian product of 𝑞 𝑡 and 𝑎 𝑡 . 𝐱 𝑡 is then embedded into a dense real-valued vector 𝐯 𝑡 . During the knowledge state estimation process, for a given input 𝐱 𝑡 = (𝑞 𝑡 , 𝑎 𝑡 ) at each time step 𝑡, the knowledge state 𝐡 𝑡 is updated. The knowledge state 𝐡 𝑡 is estimated using the embedded vector 𝐯 𝑡 and previous knowledge state 𝐡 𝑡−1 using the LSTM module. For the prediction process, the output layer is implemented as a linear layer with sigmoid activation. The predicted probability of correct responses to all skill sets 𝐲 𝑡 ∈ ℝ |𝑄| formed the model output.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Bi-Interaction Deep Knowledge Tracing</head><p>Bi-Interaction Deep Knowledge Tracing (BIDKT) <ref type="bibr" target="#b5">[6]</ref>, shown in figure <ref type="figure" target="#fig_0">1(b</ref>) is an extension to the DKT model, which integrates interactions between the input question response and context features related to forgetting behavior into the DKT model. The context features used were repeated time gap, sequence time gap and past trial counts which have been described further in Section 5.2. The input to the RNN module 𝐯 𝑐 𝑡 is computed using an integration technique called biinteraction. 𝐯 𝑐 𝑡 is the product of the interactions between 𝐯 𝑡 the embedded dense real-valued vector of the input 𝐱 𝑡 , and 𝐜 𝑖 𝑡 , the embedded dense real-valued vector of each context feature relating to forgetting.</p><formula xml:id="formula_0">𝐯 𝑐 𝑡 = 𝑛 ∑ 𝑖=1 𝐯 𝑡 ⊙ 𝐜 𝑖 𝑡 (1)</formula><p>Here, 𝑛 is the number of context features. The current knowledge state is computed using the previous knowledge state 𝐡 𝑡−1 and the product of integration 𝐯 𝑐 𝑡 as:</p><formula xml:id="formula_1">𝐡 𝑡 = 𝜙(𝐯 𝑐 𝑡 , 𝐡 𝑡−1 )<label>(2)</label></formula><p>To predict the student's performance at the next attempt, the interaction between the current knowledge state 𝐡 𝑡 and context at the next attempt 𝐜 𝑖 𝑡+1 is computed. The context embedding parameters are shared between the current knowledge state estimation step and the future performance prediction step.</p><formula xml:id="formula_2">𝐡 𝑐 𝑡 = 𝑛 ∑ 𝑖=1 𝐡 𝑡 ⊙ 𝐜 𝑖 𝑡+1<label>(3)</label></formula><p>And finally the probability of answering correctly 𝐲 𝑡 ∈ ℝ |𝑄| is computed as:</p><formula xml:id="formula_3">𝐲 𝑡 = 𝜎 (𝐛 𝑜𝑢𝑡 + 𝐖 𝑜𝑢𝑡 𝐡 𝑐 𝑡 ),<label>(4)</label></formula><p>where 𝜎(⋅) is the sigmoid function, 𝐖 𝑜𝑢𝑡 ∈ ℝ |𝑄|×𝑘 is the weight matrix, and 𝐛 𝑜𝑢𝑡 ∈ ℝ |𝑄| is the bias vector of the output. The implementation of the output layer is similar to that of DKT.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Proposed Approach</head><p>Our proposed model Attentional Bi-Interaction Deep Knowledge Tracing (ABIDKT), shown in figure <ref type="figure" target="#fig_0">1(c</ref>) is an extension to the BIDKT model, which weights interactions between the skill id and context features in the BIDKT model. The original BIDKT model uses a narrow set of context features related only to the long term trait of forgetting. But our goal is to use a wider set of features so as to better estimate the knowledge state and accurately predict future performance. The additional context features are wins, fails, question type, previous attempt response time and difference in previous attempt response time which have been described in detail in Section 5.2. The wins and fails context features have been picked up from <ref type="bibr" target="#b10">[11]</ref>, as the paper says that these features can be a good indication of the student's learning ability. The question type feature is important because each question type is associated with a different level of difficulty and therefore this feature serves as a great indicator of correct response probability. The previous attempt response time feature was picked up from <ref type="bibr" target="#b7">[8]</ref> and the difference in previous attempt response time feature was used because preliminary analysis showed that it is a good indicator of skill mastery.</p><p>However, using this wider set of features could lead to the issue of the important interactions being averaged out. Therefore, the ABIDKT model uses an attention network in a modified integration technique to weight the important interactions and ensure that they do not get averaged out.</p><p>In this case the input to the RNN module 𝐯 𝑐 𝑡 is computed using a modified integration technique called attentional bi-interaction. In this integration method, 𝐯 𝑐 𝑡 is the product of the interactions between 𝐯 𝑡 the embedded dense real-valued vector of the input 𝐱 𝑡 , and 𝐜 𝑖 𝑡 , the embedded dense real-valued vector of each context feature.</p><formula xml:id="formula_4">𝐯 𝑐 𝑡 = 𝑛 ∑ 𝑖=1 𝑝 𝑖 (𝐯 𝑡 ⊙ 𝐜 𝑖 𝑡 )<label>(5)</label></formula><p>Here 𝑛 is the number of context features and 𝑝 𝑖 ∈ ℝ is the normalized attention weight of the interaction calculated using the attention layer. The attention weight 𝑝 ′ 𝑖 and 𝑝 𝑖 the attention weight normalized by the Softmax function are computed as:</p><formula xml:id="formula_5">𝑝 ′ 𝑖 = 𝐡 𝑇 𝑡𝑎𝑛ℎ(𝐖 𝑎𝑡𝑡 (𝐯 𝑡 ⊙ 𝐜 𝑖 𝑡 ) + 𝐛 𝑎𝑡𝑡 ) 𝑎𝑛𝑑 𝑝 𝑖 = 𝑒𝑥𝑝(𝑝 ′ 𝑖 ) ∑ 𝑛 𝑖=1 𝑒𝑥𝑝(𝑝 ′ 𝑖 )<label>(6)</label></formula><p>Similar to BIDKT, the current knowledge state is computed using the previous knowledge state 𝐡 𝑡−1 and the product of integration 𝐯 𝑐 𝑡 as:</p><formula xml:id="formula_6">𝐡 𝑡 = 𝜙(𝐯 𝑐 𝑡 , 𝐡 𝑡−1 )<label>(7)</label></formula><p>To predict the student's performance at the next attempt, the weighted interaction between the current knowledge state and context at the next attempt is computed as:</p><formula xml:id="formula_7">𝐡 𝑐 𝑡 = 𝑛 ∑ 𝑖=1 𝑝 𝑖 (𝐡 𝑡 ⊙ 𝐜 𝑖 𝑡+1 )<label>(8)</label></formula><p>The probability of a correct answer 𝐲 𝑡 ∈ ℝ |𝑄| is computed in the same way it is for BIDKT. The implementation of the output layer is also the same as the DKT and BIDKT models. Similar to the architecture of BIDKT, the parameters of context embedding are shared between the current knowledge state estimation step and the future performance prediction step in the ABIDKT model as well. In the case of the attention network parameters, 2 variations were experimented with, one where the attention network parameters are shared and the other where the parameters are not shared.</p><p>The training parameters for BIDKT are the skill id (𝐱 𝑡 ) embedding matrix 𝐀, weights of the RNN, weight 𝐖 𝑜𝑢𝑡 and bias 𝐛 𝑜𝑢𝑡 for prediction and embedding matrix 𝐂 for the context information. In the case of ABIDKT we additionally have to train weight 𝐖 𝑎𝑡𝑡 , bias 𝐛 𝑎𝑡𝑡 and parameter 𝐡 𝑇 of the attention layer. These parameters are jointly learned by minimizing a standard cross entropy loss between the predicted probability of correctly answering the next question for the skill id 𝑞 𝑡+1 and the true label 𝑎 𝑡+1 :</p><formula xml:id="formula_8"> = − ∑ 𝑡 (𝑎 𝑡+1 log(𝐲 𝑇 𝑡 𝛿(𝑞 𝑡+1 )) + (1 − 𝑎 𝑡+1 )log(1 − 𝐲 𝑇 𝑡 𝛿(𝑞 𝑡+1 )))<label>(9)</label></formula><p>where 𝛿(𝑞 𝑡+1 ) is the one-hot encoding for which skill id is answered in the next time step 𝑡 + 1.</p><p>The training process for the ABIDKT model is the same as the training process for BIDKT and DKT. The main difference between the models lies in the set of trainable parameters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experiments</head><p>Experiments were conducted to compare the performances of the proposed architecture ABIDKT with BIDKT and DKT with different combinations of context features. The experiments were conducted to verify the following 2 hypotheses:</p><p>1. The bi-interaction technique used in the BIDKT architecture cannot effectively leverage a wider set of context features than those used in [6] 2. Weighting context feature interactions using an attention network ensures that the performance does not saturate even on increasing the number of context features 5-fold cross validation was performed by using a 70% ∶ 10% ∶ 20% ratio for the train:validation:test split as done in the experimental setting of <ref type="bibr" target="#b5">[6]</ref>. The details of the databases used, experiments conducted and results obtained are given below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Datasets</head><p>The datasets chosen for the experiments are the Assistments 2012-2013 <ref type="bibr" target="#b24">[25]</ref> dataset which contains information about students studying school level Mathematics with multiple question types, and the Slepemapy.cz <ref type="bibr" target="#b25">[26]</ref> dataset which contains data from an online platform that teaches primary school Geography mainly consisting of 2 question types. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Preprocessing</head><p>Records where user made only a single attempt of a single skill set item were removed as in <ref type="bibr" target="#b5">[6]</ref>. Additionally a few noisy records with negative response times were also removed. Continuous valued context features were preprocessed and discretized to use in BIDKT and ABIDKT models. Further details are as follows:</p><p>1. repeated time gap: calculated using the difference in time stamp between the current and previous attempt of same skill in minutes. 2. sequence time gap: calculated using the difference in time stamp between the current and previous attempt (independent of skill id) in minutes. 3. past trial counts: calculated as the count of the number of times the same skill has been attempted in the past. 4. wins: calculated as the count of correct responses in the past trials of the same skill. 5. fails: calculated as the count of incorrect responses in the past trials of the same skill. 6. question type:</p><p>• discrete value in the range 0-5 for the Assistments 2012-2013 dataset.</p><p>• binary discrete value 0,1 for the Slepemapy.cz dataset. 7. previous attempt response time: time taken for the response of the previous attempt of the same skill, in seconds. 8. difference in previous attempt response time: calculated as the difference in response times of the last 2 attempts in seconds, of the same skill.</p><p>All features except for question type were discretized using the 𝑙𝑜𝑔 2 scale. The repeated time gap, sequence time gap and past trial counts features are same as the context features used in BIDKT <ref type="bibr" target="#b5">[6]</ref>. The additional context features were determined based on common features available across datasets and common features used in KT literature <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b10">11]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Hyper-parameters</head><p>The set of hyperparameters which maximized averaged AUC over 5-fold cross validation were chosen for final model implementations. Final results on corresponding test sets were also reported using the AUC metric. The hyper-parameters were set as follows:</p><p>1. learning rate: varying the learning rate did not have a significant effect on the maximum value of AUC. Various learning rate values between 0.001 and 1 were tried, at values that were approximate multiples of 3 i.e. 0.001, 0.003, 0.01, 0.03, etc. The value was further fine-tuned around the best performing value. Finally, the learning rate was set at 0.7 for the Assistments 2012-2013 dataset, except for the DKT architecture which was fixed at 0.5. For the Slepemapy.cz dataset, the learning rate was fixed at 0.9 for all architectures. 2. hidden layer dimensions: Different values of hidden layer dimension between 10 and 100 were tried at differences of 10, and the value was empirically set at 30 for all variations of architectures and datasets. 3. dropout: The value of dropout had been set using the best value of dropout from the experiments in <ref type="bibr" target="#b5">[6]</ref> at 0.3. 4. weight decay: weight decay values were varied between 10 −6 and 10 −3 at multiples of 10 i.e.10 −5 , 10 −4 , and the best value varied between different folds in the k-fold cross validation. 5. mini batch size: this value was set at 100 for both datasets. For the Slepemapy.cz dataset, although the batch size value in <ref type="bibr" target="#b5">[6]</ref> was set at 30, we set it at 100 to speed up processing. 6. epochs: the epochs were set as 1.5 times the maximum number of epochs till the point of convergence of AUC across all 5 folds for the BIDKT architecture. The number of epochs was set at 600 and 200 for the Assistments 2012-2013 and Slepemapy.cz datasets respectively and the highest test AUC score among all these epochs was reported.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Results and Discussion</head><p>The results shown are for the DKT, BIDKT and 2 variations of the ABIDKT architecture respectively. The results for the BIDKT and ABIDKT architectures are shown for different number of features. The feature combinations are: The variations of the ABIDKT architecture are as follows:</p><p>• ABIDKT-SP: The parameters of the attention network and bi-interaction layer are shared between the knowledge state estimation step and the future performance prediction step similar to the BIDKT architecture</p><p>• ABIDKT: The parameters of the bi-interaction layer are shared between the knowledge state estimation step and the future performance prediction step, while the parameters of the attention network are trained independently Figures 2(a) and 2(b) show the average test AUC results across 5 folds, for different combinations of features and being incorporated in different architectures. From the results we can observe that sharing trainable parameters between the knowledge state estimation step and future performance prediction step (ABIDKT-SP) does not have any significant impact on the performance, although not sharing parameters (ABIDKT) does perform marginally better when the number of features are increased for both datasets. The main takeaways from the results are as follows:</p><p>Hyperparameter Tuning and Reproducibility. All baselines were reproduced and their hyperparameters were tuned using the same methodology as for the proposed model. We found that our tuning method led to an AUC improvement of 0.7% for both models for the Assistments 2012-2013 dataset compared to values stated in <ref type="bibr" target="#b5">[6]</ref>. For Slepemapy.cz, while DKT could be reproduced, we could not match the AUC for BIDKT, primarily because the batch size parameter mentioned in the paper was too small and computation proved very time consuming. On the other hand, while setting a larger batch size led to a more reasonable runtime, the model saw a 1.1% drop in AUC.</p><p>Effect of Additional Context Features. Including wide context features led to improvements in AUC for both BIDKT and ABIDKT models and on both datasets, suggesting that identified features encapsulated information indicative of future student performance. Also, in support of hypothesis 1 stated in Section 5, the extent of improvement in BIDKT tapered off and there was negligible change when number of features was increased from 5 to 8.</p><p>Effect of Attention Layers. Contrary to hypothesis 2, adaptive learned context weights in the form of attention layers did not provide a substantial improvement in model performance, instead consistently achieving an AUC 0.1% less than its BIDKT counterpart. This may be because the added context features are not large in number, and attention layers involve more trainable parameters. Trained on the same amount of data, the benefit from fewer trainable parameters in BIDKT outweighs the adapative weight assignments learned by attention layers.</p><p>We conducted further analysis by computing the micro-AUC by binning the predictions based on past trial counts as shown in Figure <ref type="figure" target="#fig_2">3</ref> and computing the percentage improvement of the ABIDKT architecture over the BIDKT architecture for each bin. The bin sizes were chosen so as to balance the number of samples in each bin. This analysis was performed on the Assistments 2012-2013 dataset as this is a Mathematics tutor dataset where each user is bound to have a large number of trials. From these results we can observe that for low trial counts, ABIDKT does not show an improvement over BIDKT, but as the number of trials increase, the percentage improvement of ABIDKT over BIDKT also steadily increases for all sets of features. This shows that ABIDKT may be useful in databases where each student has a large number of trials. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>The focus of this paper was to observe the effect of using a wider range of context features in the BIDKT model and to propose techniques to effectively incorporate them. We first identified a wider set of context features and incorporated them in the BIDKT model. Experimental results on 2 datasets showed that increasing the number of context features improves the performance of BIDKT significantly, but the performance begins to taper off as the number of features are increased from 5 to 8. We postulated that this could be because of important feature interactions being diluted with other unimportant ones. To overcome this drawback, we proposed a technique that adaptively learns the weights of feature interactions and incorporated this as an attention layer in the BIDKT model. Experimental results of these models on 2 datasets show that this weighting technique was not sufficient to improve the performance of our models. This could be because we were trying to learn additional parameters using the same amount of data. We therefore analyzed the performance of our model across different trial counts and found that our model does outperform BIDKT when the number of past trial counts are high.</p><p>In future work, we first plan to implement our models on datasets that have a higher number of trials counts per student. We also plan to modify the architecture of Attention and see if this can perform better than the ABIDKT model. Additionally, we plan to try these approaches in an architecture where skill is modeled separately as in Dynamic key-value Memory Networks for Knowledge Tracing.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Architectures for (a) deep knowledge tracing (b) bi-interaction deep knowledge tracing and (c) Proposed model: Attentional bi-interaction deep knowledge tracing.In each architecture, the blue arrows describe a process of modeling a student's knowledge while orange arrows describe a process of predicting a student's performance. In our proposed model, the context information (shown in green) is incorporated in the form of context interactions (represented in purple) which are weighted according to importance to obtain a weighted interaction vector (purple with multi-colored components). The incorporation of context information happens at time steps 𝑡 and 𝑡 + 1 as 𝐜 𝑡 and 𝐜 𝑡+1</figDesc><graphic coords="4,89.29,340.16,416.68,243.54" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Average test AUC scores of different architectures on (a) the Assistments 2012-2013 dataset and (b) the Slepemapy dataset</figDesc><graphic coords="9,97.64,84.19,226.77,153.55" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Percentage improvement in micro-AUC of the ABIDKT architecture over the BIDKT architecture on the Assistments 2012-2013 dataset for different feature sets. The AUC scores have been computed by binning the predictions based on past trial counts.</figDesc><graphic coords="10,158.74,405.31,277.80,191.72" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Statistics of the data.</figDesc><table><row><cell>Dataset</cell><cell cols="2">#records #users #items</cell></row><row><cell>Assistments 2012-2013</cell><cell>5,818,868 45,675</cell><cell>266</cell></row><row><cell>slepemapy.cz</cell><cell>10,087,305 87,952</cell><cell>1,458</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Context-aware recommender systems</title>
		<author>
			<persName><forename type="first">G</forename><surname>Adomavicius</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mobasher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ricci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tuzhilin</surname></persName>
		</author>
		<idno type="DOI">10.1609/aimag.v32i3.2364</idno>
		<ptr target="https://aaai.org/ojs/index.php/aimagazine/article/view/2364.doi:10.1609/aimag.v32i3.2364" />
	</analytic>
	<monogr>
		<title level="j">AI Magazine</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="67" to="80" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Deep knowledge tracing</title>
		<author>
			<persName><forename type="first">C</forename><surname>Piech</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bassen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ganguli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sahami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">J</forename><surname>Guibas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sohl-Dickstein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="505" to="513" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Knowledge tracing: Modeling the acquisition of procedural knowledge, User modeling and user-adapted interaction</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">T</forename><surname>Corbett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Anderson</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1994">1994</date>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="253" to="278" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Khajah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">V</forename><surname>Lindsey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">C</forename><surname>Mozer</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1604.02416</idno>
		<title level="m">How deep is knowledge tracing?</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Individualized bayesian knowledge tracing models</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">V</forename><surname>Yudelson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">R</forename><surname>Koedinger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">J</forename><surname>Gordon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on artificial intelligence in education</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="171" to="180" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Augmenting knowledge tracing by considering forgetting behavior</title>
		<author>
			<persName><forename type="first">K</forename><surname>Nagatani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ohkuma</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The World Wide Web Conference</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="3101" to="3107" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T.-S</forename><surname>Chua</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1708.04617</idno>
		<title level="m">Attentional factorization machines: Learning the weight of feature interactions via attention networks</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Incorporating rich features into deep knowledge tracing</title>
		<author>
			<persName><forename type="first">L</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Botelho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">T</forename><surname>Heffernan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fourth (2017) ACM Conference on Learning@ Scale</title>
				<meeting>the Fourth (2017) ACM Conference on Learning@ Scale</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="169" to="172" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Dynamic key-value memory networks for knowledge tracing</title>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>King</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D.-Y</forename><surname>Yeung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 26th international conference on World Wide Web</title>
				<meeting>the 26th international conference on World Wide Web</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="765" to="774" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Knowledge tracing with sequential key-value memory networks</title>
		<author>
			<persName><forename type="first">G</forename><surname>Abdelrahman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval</title>
				<meeting>the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="175" to="184" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Dynamic student classiffication on memory networks for knowledge tracing</title>
		<author>
			<persName><forename type="first">S</forename><surname>Minn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">C</forename><surname>Desmarais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Pacific-Asia Conference on Knowledge Discovery and Data Mining</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="163" to="174" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Convolutional knowledge tracing: Modeling individualization in student learning process</title>
		<author>
			<persName><forename type="first">S</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<idno type="DOI">10.1145/3397271.3401288</idno>
		<idno>doi:10.1145/3397271.3401288</idno>
		<ptr target="https://doi.org/10.1145/3397271.3401288" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR &apos;20</title>
				<meeting>the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR &apos;20<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="1857" to="1860" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Lu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2008.01169</idno>
		<title level="m">Deep knowledge tracing with convolutions</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">C.-K</forename><surname>Yeung</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1904.11738</idno>
		<title level="m">Deep-irt: Make deep learning based knowledge tracing explainable using item response theory</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">A self-attentive model for knowledge tracing</title>
		<author>
			<persName><forename type="first">S</forename><surname>Pandey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Karypis</surname></persName>
		</author>
		<idno>CoRR abs/1907.06837</idno>
		<ptr target="http://arxiv.org/abs/1907.06837.arXiv:1907.06837" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Matrix factorization techniques for recommender systems</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Koren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Volinsky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="page" from="30" to="37" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Factorization machines</title>
		<author>
			<persName><forename type="first">S</forename><surname>Rendle</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2010 IEEE International Conference on Data Mining, IEEE</title>
				<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="995" to="1000" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Sparse factorization machines for clickthrough rate prediction</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE 16th International Conference on Data Mining (ICDM)</title>
				<imprint>
			<date type="published" when="2016">2016. 2016</date>
			<biblScope unit="page" from="400" to="409" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Field-weighted factorization machines for click-through rate prediction in display advertising</title>
		<author>
			<persName><forename type="first">J</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">L</forename><surname>Ruiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Lu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 World Wide Web Conference</title>
				<meeting>the 2018 World Wide Web Conference</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1349" to="1357" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Afs: An attention-based mechanism for supervised feature selection</title>
		<author>
			<persName><forename type="first">N</forename><surname>Gui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Hu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="3705" to="3713" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2003.11235</idno>
		<title level="m">Autofis: Automatic feature interaction selection in factorization models for click-through rate prediction</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Muti-behavior features based knowledge tracking using decision tree improved dkvmn</title>
		<author>
			<persName><forename type="first">X</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Feng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACM Turing Celebration Conference-China</title>
				<meeting>the ACM Turing Celebration Conference-China</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Pandey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Srivastava</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2008.12736</idno>
		<title level="m">Rkt: Relation-aware self-attention for knowledge tracing</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Knowledge tracing machines: Factorization machines for knowledge tracing</title>
		<author>
			<persName><forename type="first">J.-J</forename><surname>Vie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kashima</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="750" to="757" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Addressing the assessment challenge with an online system that tutors as it assesses, User Modeling and User-Adapted</title>
		<author>
			<persName><forename type="first">M</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Heffernan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Koedinger</surname></persName>
		</author>
		<idno type="DOI">10.1007/s11257-009-9063-7</idno>
		<ptr target="https://doi.org/10.1007/s11257-009-9063-7.doi:10.1007/s11257-009-9063-7" />
	</analytic>
	<monogr>
		<title level="j">Interaction</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="page" from="243" to="266" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Adaptive geography practice data set</title>
		<author>
			<persName><forename type="first">J</forename><surname>Papousek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Pelánek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Stanislav</surname></persName>
		</author>
		<idno type="DOI">10.18608/jla.2016.32.17</idno>
		<ptr target="https://doi.org/10.18608/jla.2016.32.17.doi:10.18608/jla.2016.32.17" />
	</analytic>
	<monogr>
		<title level="j">Journal of Learning Analytics</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="317" to="321" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
