Analyzing collaborative filtering for UNED first-year student
enrolment Recommendation system
Adrián Clavero1, Víctor Fresno 2, Fernando Latorre Torres 2 and Salvador Ros 3
1
  Centro Asociado de la UNED en Barbastro, Spain
2
  Computer Systems and Languages Department, ETSI Informática, UNED, Spain
3
  Communication and Control System Department, ETSI Informática, UNED, Spain


                      Abstract
                      First-year students enrolment is an important situation that can influence their future results.
                      An adequate decision about how many and which subjects are the most adequate for the
                      personal characteristic of a first-year student contribute to decreasing the dropout and
                      improving their results. This work presents the study of different algorithms to develop a
                      collaborative filter recommendation system for UNED first-year students. Two algorithms
                      have been evaluated and analyzed. The best algorithm for the recommendation system is based
                      on cosine similarity improving in the best scenario up to 50% of the results.

                      Keywords 1
                      Recommendation Systems, learning analytics, dropout, education.

1. Introduction

   National Distance Education University (UNED) is a distance university characterized by its special
methodology and the special profile of its students. UNED students’ profile is a student in his thirties
combining his studies with professional activity or family reconciliation. This special profile determines
the way these students enroll at the university. While the sophomores usually choose a few subjects and
select them carefully according to a principle of effectiveness, the first-year student tends to enroll in
the whole year, or if they decide on a partial enrollment, they haven´t enough information apart from
the syllabus to make the most effective enrolment.
   Analyzing the public data of the UNED´s statistical portal (https://app.uned.es/evacaldos/), it is
remarkable to compare the evaluation rates between the first two years and the last two years, the latter
being clearly higher. These data could be interpreted as the first-year student hasn’t had time enough to
study the subject or even selected subjects that needed previous content delivered in another subject.
Anyway, the bad planning of the study is behind these data. On the contrary, the high rates among the
sophomores suggest better planning of the study, selecting more carefully, and according to their needs,
the subjects, obtaining a better optimization of their efforts and results.
   This paper presents an enrollment recommendation system for the UNED´s first-year students,
whose objective is to suggest the number of subjects to enroll based on the individual features of the
students and conducted to get the best academic results and reduce the university dropout rate. This
work has been carried out as a Final Degree Project in the Computer Engineering Degree at the UNED.
The data used for the analysis were provided by the central university services, having been previously
anonymized according to the RGPD.


Learning Analytics Summer Institute Spain (LASI Spain) 2022, June 20–21, 2022, Salamanca, Spain
EMAIL: aclavero@barbastro.uned.es (A. 1); flatorre@barbastro.uned.es (A. 2); vfresno@scc.uned.es (A. 3); sros@scc.uned.es (A. 4)
© 2022 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                      77
2. State of the art

    Nowadays, we can state that recommender systems are a cornerstone of many successful new
business models. Amazon, YouTube, Spotify, Netflix [1] and all the content streaming companies use
recommenders to improve their offer and build customer loyalty. For this purpose, they have access to
customer behavioral data and analyze them to recommend their products efficiently, increasing their
profits. In the same way, it is possible to use these systems to improve different scenarios in the learning
processes. According to [2], recommendation systems have been applied to other areas of the education
field, being the academic election the area of more significant application. This area includes tasks such
as selecting a university, course, or specific discipline. Other education areas would be related to
academic performance, content or learning resources trying to predict the content preferences according
to the student profiles [3].
    The enrolment of students is a matter of interest for universities since a good selection of the subject
for enrolling contributes to a good student performance and decrease the dropout rate.
    The most common technique used in recommendation systems is the collaborative filter [4-7].
    This technique uses the user information and their correlations to make a recommendation. Suppose
this information is categorized by the users’ preferences. In that case, we are talking of user-based
filtering, which tends to group users with the same preferences hypothesizing that if they have, then
they require similar products. On the other side, we consider an element-based filter if we consider the
product rating patterns. The collaborative filter provides better results when there is a large amount of
data and, on the contrary, has problems when starting with new elements [8]. Another disadvantage of
this technique is the cold-start effect and the sparsity. The cold-start problem arises when a new user
starts using the system and has very little information about him. The sparsity implies a lack of data or
very irregular data. This problem is relevant in collaborative filtering systems since they are based on
user data [9].
    Other approaches are, on the one hand, content-based recommenders whose primary goal is to
recommend products like those that the users have used and liked. This approach is like the element-
based collaborative filter with the main difference that it only collects the data from the user to whom
it recommends. Content-based filtering has the advantage of not depending on the preferences of other
users since it is based on the comparison between elements. The disadvantages of this technique are
similar to collaborative filtering [9]. On the other hand, we have knowledge-based recommenders.
These recommenders make deterministic recommendations and are not affected by the introduction of
new elements and use information obtained from the previous different actions of the user. The
information needed for this technique is captured using different methods, and usually, this acquisition
of knowledge has a high cost in terms of computation, time and resources [9].
    Finally, the hybrid recommender combines different approaches to obtain more robust
recommendation models using the main advantages of the used models and decreasing the negative
effect of the chosen approaches. The approach used for the recommenders’ applications is mostly
hybrid, so different techniques are combined to obtain the result, being the collaborative filter the
individual techniques most used. It should be noted that not the use of this approach depends on the
available data, and it is not always possible to apply. As we highlighted before, these systems are time-
consuming, so it could require a long execution time depending on the algorithm and the amount of
data it processes. In these cases, it would be convenient to evaluate the results with different data
volumes to optimize results and execution time.
    Other considerations about recommender implementations are [9] a) the scalability, the larger the
dataset, the efficiency could be decreased, so the application of big data architectures must be present,
b) privacy protection, especially in the learning process. The data must be protected effectively, and the
system doesn´t have to require more data than needed, c) Over-specialization.
    This problem occurs when there is no diversity in the pattern of recommendations, and the chances
of the user discovering something beneficial are practically nil d) Gray-Sheep, this effect refers to a
user who does not show a clear preference or has inconsistent behavior. Therefore, the recommendation
becomes difficult, and, thus, the effectiveness of the recommender system decreases.


                                                    78
3. UNED enrollment recommender system

   The proposed recommender is focused on the task of academic enrollment of UNED first-year
students. The recommender system will help first-year students choose the number of subjects that best
suits their characteristics. We have selected this area of application because it is a crucial point in
preventing dropouts, as can be seen in the data of the UNED statistical portal. These data reflect a
significant difference between the number of enrolled and presented students in the first-year subjects.
   The recommender system developed uses a collaborative filter technique as we rely on students'
experiences from previous courses to make the recommendation. We discarded the content-based filter
as we focused on indicating a range of subjects rather than specific subjects. In the case of knowledge-
based systems, they do not suit our needs as we want to consider the experience of students from
previous years, which provides excellent information.
   Since this is a first approach to solve the problem, our recommender works without subject`s
information regarding their complexity so It only can recommend a rank of number of subject that could
guarantee a good performance.

3.1.    Dataset

    One of the principal problems when we face the implementation of a recommender in a university,
is to obtain the data [10]. In this case, this project has been supported by the Vice-rectorate of
technologies offering us total access to the data in an anonymized way. All the students’ identifiers have
been modified not to be able to identify a unique student.
    Figure 1 shows the entity relation of the available information. This scheme shows we have available
information about subjects, enrolment, students, evaluation performance, access to the university genre
and UNED associate centers. These tables contain information from the last eleven years, and all the
information used has been anonymized before processing it. As it can be shown and taking into account
that the UNED is the highest university in Spain with almost 200.000 students enrolled per year, the
volume of data is huge.


Figure 1: Entity-relation scheme of the available data. Labels are in Spanish. All the ID fields correspond
to anonymized data.

   However, although it may seem that we have a large amount of data, we must choose those that meet
the requirements of our project. Thus, we should only select data from students in their first year and
who have taken the exams at least once. Therefore, we must choose only the data related to the first
year for the recommender implementation. Also, we have preprocessed the data to keep those that


                                                    79
provide valuable information eliminating those that contained some inadequate or null value and those
students who did not take any of the subjects enrolled because they may correspond to extraordinary
situations, and we considered them as outliers.

3.2.    Characterization of the UNED´s first-year student

    Since our user is a new student and we have no information about his behavior (i.e., previous
enrollment information), we must use partial information about our scheme since we are using
collaborative filtering. The profile of a first-year student was defined by different features obtained from
the data stored in the University central systems. We have defined five main features: a) Age range (i.e.
this feature is split in different age ranges from 0 to 115) b) Access method to the university. We have
identified twelve methods of access. C) Genre, for this property we only can identified 2 possible values,
d) UNED Degree in which students want to enroll to limit the recommendation and finally e) Associate
center where the student is making the enrollment information. The associate center is an essential field
because it is related to the UNED geographical structure and allows us to segment information
geographically. Therefore, we are looking for students like our first-year student in age, method of
access to the university, genre and associate center. Once we have similar students identified, we check
how many subjects they enrolled in and if they were successful or not, allowing us to recommend a
range of numbers of subjects.
    For computational purposes, this profile is coded as a one-shot vector whose fields code the
information of the selected features, Table 1.

Table 1
First-year student One-shot vector
 Features      UNED’s degree     Age range         Access           Campus                  Genre
   Fields     7101 7102 … 20 25 …                 1 2 …           1   2    …           M            F
   Values       1     0     0 0      1 0          0 1 0           1   0    0           1            0

    Thus, when the student meets one of these characteristics, we will put a 1 in that component while,
if he/she does not meet it, we will put a 0. For example, a student whose age range corresponds to 20,
will have a 1 in that component and a 0 in the rest of the components that refer to the age range.
    The information is asked to the new first-year student using a simple web interface, Figure 2.


Figure 2: User interface to collect first-year student information.


                                                    80
3.3.    Applied Algorithms

   In this work, were analyzed two algorithms for making recommendations: a) K-means b) cosine
similarity. We have selected these two algorithms to compare two different approaches and obtain
highlights about the best scenarios in which could be used.

3.3.1. K-means based recommender
   It is a machine learning algorithm. This algorithm avoids relying on similarity measures between
pairs of items, and it aims to divide all the input elements into K groups where their similarity is
maximized.


Figure 3: K-value calculated by the elbow method.

   In a first step, the elements that will be the centroids of each group are defined (either by the user or
by random initialization). In subsequent steps, these centroids and the elements belonging to each group
are re-defined. Once the groups have been created, the algorithm will allow us to classify a new
component of the most suitable group, i.e., the group whose centroid is most similar. This algorithm
has a high time cost in learning, but then the classification of a new student is fast, which allows
obtaining a fast recommendation.

3.3.2. Cosine similarity-based recommender

    The similarity estimation offered by this method is based on calculating the cosine of the angle
formed by the two vectors. If the angle between two vectors 𝑒𝑒1 and 𝑒𝑒2 is equal to 0º, the cosine value
is 1; both vectors will have the same direction. On the contrary, if the angle is 180º, the cosine value is
-1; the vectors will have opposite directions. Mathematically it is defined as follows:

                                                            𝑒! · 𝑒" #
                                      cos( 𝑒! , 𝑒" ) =
                                                          ‖𝑒! ‖ ∙ ‖𝑒" ‖


                                                     81
3.4.    Experimentation

   To analyze the best algorithm that fits our problem, we have selected a small dataset built with data
from five different grades. These grades have been selected for their different global success rates to
determine if this parameter influences the selection of the algorithm. The degrees selected are in Table
2. These data have been obtained from the statistical portal of the UNED.

Table 2
Degrees selected to evaluate the recommender.
        Code                    Degree Denomination                          Global Success Rate
        6301                        Social Education                               58.12%
        6502          Business Administration and Management                       31.89%
        6702                         History of Art                                48.64%
        6801                     Electrical Engineering                            16.84%
        7101                    Computer Engineering                               27.41%

    For the experimentation, we have used all the data for the last eleven years except for the 2019-2020
academic year data due to COVID 2019 pandemic because it does not follow the trend of previous
years. We have also eliminated the records of enrollments after the first year since we focused on the
enrollment recommendation for the first year. Also, we have split our dataset in two following the rule
80:20, 80% for training and 20% for tests. The students used as reference were those that made their
first enrollment in the academic year 2018-2019.
    To calculate the adequacy of the recommendation, we used two error metrics. First, we define the
Real Fail Rate, RFR, which refers to the rate of real fail subjects of a student. We defined the Real Fail
Rate as:
                                                   (𝐹𝐺 + 𝑁𝑃)
                                         𝑅𝐹𝑅 =
                                                        𝑁𝑆

   Where FG is the number of failing subjects, NP is the number of subjects that, in the end, the student
didn´t take the exam, and NS is the number of subjects the student had enrolled.

                                                   (𝑅𝑁𝑆 − 𝑅𝐴)
                                        𝑅𝑀𝐹𝑅 =
                                                      𝑅𝑁𝑆

   Where RNS is the number of recommended subjects for enrolling, and RA is the number of student
success subjects.

3.5.    Results

   To evaluate both algorithms, we built the one-shot vector of our test students and ran the algorithm
against the dataset, and computed RFR for each degree. For RMFR, we calculated its value for the two
extreme values of the recommended range. We defined RMFRL for the left value of the range and
RMFRR for the right value of the range. Below, we show the test results for each algorithm, Table 3.

Table 3
RFR and RMFR results for all the grades in the dataset
  Code    Degree Denomination          Real Data         Cosine Similarity               K-mean
                                           RFR         RMFRL         RMFRR           RMFRL   RMFRR
  6502       Social Education             0.51          0.31          0.45            0.42    0.45
  6301           Business                 0.35          0.33          0.39            0.43    0.42
           Administration and
               Management


                                                   82
   6801         History of Art              0.68            0.18          0.34         0.16       0.36
   6702     Electrical Engineering          0.32            0.37          0.50         0.78       0.78
   7101     Computer Engineering            0.59            0.22          0.34         0.39       0.41


4. Discussion and Conclusions

    With all the experimentation carried out and after the analysis of all the data, we can affirm that the
K-Means algorithm is the one that offers the worst results, being notably better in the calculation of the
recommendation with cosine similarity. Therefore, the classification algorithm cannot produce a set of
sufficiently precise groups to improve the success rate or dropout. One of the reasons is the lack of
information we have about first-year students. Our one-shot vector is simple, and it could be improved
with more knowledge to try to define better the groups of students that share the same characteristics.
The selection of K in the algorithm is another drawback of this system. To obtain the K parameter, we
have applied the elbow algorithm dynamically to adjust the algorithm to the user profile. More work in
this selection would improve the systems.
    On the other hand, of these results, it is noteworthy that the recommender system does not improve
the RFR of grades with a low RFR, such as History of Art, but it would help to significantly reduce the
number of failed subjects in degrees in which the percentage of subjects they fail is currently very high.
According to the results obtained, the system would make it possible to reduce the percentage of failed
subjects by up to 50% in the best scenario.
     In future lines of work, it would be especially useful to add information on subjects and degrees
because although students are similar, the degrees they take may vary the results of the recommendation
given to everyone. This would even make it possible to recommend certain combinations of subjects to
create a more efficient and balanced sequence of study.
    I would like to emphasize the need for meaningful learning in the students since it would facilitate
understanding the contents of each subject. Although no specific order is established for the study of
the subjects, a good structuring of them would help us correctly scaffold the concepts. This can be
achieved by including additional information on the assignments.

5. References

[1] Gironacci, I. (2021). Literature Review of Recommendation Systems (pp. 119–129).
    https://doi.org/10.4018/978-1-7998-4339-9.ch009
[2] Rivera, A. C., Tapia-Leon, M., & Lujan-Mora, S. (2018). Recommendation Systems in Edu-cation:
    A Systematic Mapping Study. In Á. Rocha & T. Guarda (Eds.), Proceedings of the International
    Conference on Information Technology & Systems (ICITS 2018) (pp. 937–947). Springer
    International Publishing. https://doi.org/10.1007/978-3-319-73450-7_89
[3] Charnelli, M. E., Lanzarini, L. C., & Díaz, F. J. (2018). Sistemas recomendadores aplica-dos en
    educación. XX Workshop de Investigadores en Ciencias de la Computación (WICC 2018,
    Universidad Nacional del Nordeste). http://sedici.unlp.edu.ar/handle/10915/67261
[4] Lynn, N. D., & Emanuel, A. (2021). A review on Recommender Systems for course selec-tion in
    higher education. https://doi.org/10.1088/1757-899X/1098/3/032039
[5] Maphosa, M., Doorsamy, W., & Paul, B. (2020). A Review of Recommender Systems for
    Choosing Elective Courses. https://doi.org/10.14569/ijacsa.2020.0110933
[6] O’Mahony, M. P., & Smyth, B. (2007). A recommender system for on-line course enrol-ment: An
    initial study. Proceedings of the 2007 ACM Conference on Recommender Sys-tems, 133–136.
    https://doi.org/10.1145/1297231.1297254
[7] Ricci, F., Rokach, L., & Shapira, B. (2010). Recommender Systems Handbook. In Recom-mender
    Systems Handbook (Vols. 1–35, pp. 1–35). https://doi.org/10.1007/978-0-387-85820-3_1
[8] Gupta, S., & Dave, M. (2020). An Overview of Recommendation System: Methods and
    Techniques (pp. 231–237). https://doi.org/10.1007/978-981-15-0222-4_20


                                                    83
[9] Sharma, R., & Singh, R. (2016). Evolution of Recommender Systems from Ancient Times to
     Modern Era: A Survey. Indian Journal of Science and Technology, 9(20), 1–12.
     https://doi.org/10.17485/ijst/2016/v9i20/88005
[10] Fernández-García, A. J., Rodríguez-Echeverría, R., Preciado, J. C., Manzano, J. M. C., & Sánchez-
     Figueroa, F. (2020). Creating a Recommender System to Support Higher Educa-tion Students in
     the      Subject     Enrollment      Decision.    IEEE       Access,     8,      189069–189088.
     https://doi.org/10.1109/ACCESS.2020.3031572


                                                 84