Exploring the function of discussion forums in MOOCs: comparing data mining and graph-based approaches Lorenzo Vigentini Andrew Clayphan Learning & Teaching Unit Learning & Teaching Unit UNSW Australia, UNSW Australia, Lev 4 Mathews, Kensington 2065 Lev 4 Mathews, Kensington 2065 +61 (2) 9385 6226 +61 (2) 9385 6226 l.vigentini@unsw.edu.au a.clayphan@unsw.edu.au ABSTRACT learning environments [6]. These can be deployed in a variety of In this paper we present an analysis (in progress) of a dataset ways, ranging from a tangential support resource which students containing forum exchanges from three different MOOCs. The can refer to when they need help, to a space for learning with forum data is enhanced because together with the exchanges and others, driven by the activities students have to carry out (usually the full text, we have a description of the design and pedagogical sharing work and eliciting feedback). The latter, in a sense, function of forums in these courses and a certain level of detail emulates class-time in traditional courses providing a space for about the users, which includes achievement, completion, and in structured discussions about the topics of the course. One could some instances more details such as: education; employment; age; argue that like in face-to-face classes, the value of the interaction and prior MOOC exposure. depends on the importance attributed to the forums by the instructors. This is an interesting point to explore teachers’ Although a direct comparison between the datasets is not possible presence and the value of their input in directing such because the nature of the participants and the courses are conversations. Mazzolini & Maddison characterize the role of the different, what we hope to identify using graph-based techniques teacher and teacher presence in online discussion forums as is a characterization of the patterns in the nature and development varying from being the ‘sage on the stage’, to the ‘guide on the of communication between students and the impact of the ‘teacher side’ or even ‘the ghost in the wings’ [7]. Furthermore they argue presence’ in the forums. With the awareness of the differences, we that the ‘ideal’ degree of visibility of the instructor in discussion hope to demonstrate that student engagement can be directed ‘by- forums depends on the purpose of forums and their relationship to design’ in MOOCs: teacher presence should therefore be planned assessment. There are also a number of accounts indicating that carefully in the design of large-scale courses. students’ learning in forums is not very effective [8, 9]. However if one looks at the data there are numerous examples indicating Keywords that behaviours in forums are good predictors of performance in MOOCs, Discussion forums, graph-based EDM, pedagogy. the courses using them, particularly if forum activities are assessed [10,11,12,13]. Yet, forums in MOOCs tend to attract only a small portion of the student activity [14]. This is setting 1. INTRODUCTION forums in MOOCs apart from ‘tutorial-type’ forums used to In the past couple of years MOOCs (Massive Open Online support students’ learning in online or blended courses in higher Courses) have become the center of much media hype as education. Furthermore, some argue that active engagement is not disruptive and transformational [1, 2]. Although the focus has the only way of benefiting from discussion forums [15] and been on a few characteristics of the MOOCS – i.e. free courses, students’ characteristics and preferences could be more important massive numbers, massive dropouts and implicit quality than the course design in determining the way in which they take warranted by the status of the institutions delivering these courses full advantage of online resources [16]. – a rapidly growing research interest has started to question the effectiveness of MOOCS for learning and their pedagogies. If one ignores entirely the philosophies of teaching driving the design 2. THE THREE MOOCS IN DETAIL and delivery of MOOCs going from the the socio-constructivist In order to investigate the way in which students use the (cMOOC, [4, 5]) to instructivist (xMOOC, [3]), at the practical discussion forums, we have extracted data from three MOOCS level, instructors have to make specific choices about how to use delivered by a large, research intensive Australian university. The the tools available to them. One of these tools is the discussion three courses are: P2P (From Particles to Planets - Physics); forum. Forums are one of the most popular asynchronous tools to LTTO (Learning to Teach Online); and INTSE (Introduction to support students’ communication and collaboration in web-based Systems Engineering), which are broadly characterised in the top of Table 1. The courses were specifically designed in quite different ways to test hypotheses about their design, delivery and effectiveness. In particular, P2P was designed emulating a traditional university course in a sequential manner. All content was released on a week-by-week basis dictating the pace of instruction. LTTO and INTSE, instead were designed to provide a certain level of flexibility for the students to elect their learning paths. All content was readily available at the start, however for LTTO, the delivery in forum is 4x in magnitude compared to the other courses. Yet, if followed a week-on-week delivery focusing on the interaction we look at the average amount of posts or comments, the patterns with students and a selective attention to particular weekly topics are not straightforward to interpret, as the level of engagement is (i.e. weekly feedback videos driven by the discussion forums as similar across the courses with 3 to 5 posts per student and 1 to 3 well as weekly announcements). Although announcements were comments (i.e. replies to existing posts), but with P2P showing a used also in INTSE, the lack of weekly activities in the forums did higher level of engagement than the other courses. One possible not impose a strong pacing. In INTSE, the forums had only a explanation is the different target group of the different courses tangential support value and were used mainly to respond to with INTSE including a majority of professional engineers with students’ queries and to clarify specific topics emerging from the postgraduate qualifications, P2P focusing on high school student quizzes. Table 1 provides an overview of the different courses. and teachers, and LTTO targeting a broad base of teachers across This also shows that the forum activity in the various courses is a different educational levels. very small portion of all actions emerging from the logs of activity which has been reported in the literature [9]. INTSE LTTO P2P Teachers at all High school Target group Engineers 3. DETAILS OF THE DATASET levels and teachers 3.1 The dataset Course length 9 weeks 8 weeks 8 weeks The data under consideration is an export form the Coursera 54 105 63 Forums platform. Raw forum database tables (posts, comments, tags, (14 top level) (17 top-level) (15 top-level) votes) as well as a JSON based web clickstream were used. The Design mode All-at-once All-at-once Sequential clickstream events consist of a key which specifies action – either a ‘pageview’ or ‘video’ item. Forum clickstream events were Delivery mode All-at-once Staggered Staggered identified by a common ‘/forum’ prefix. Use of forums Tangential Core activity Support The clickstream was further classified into: browsing; profile N in forum 422 (2.1%) 1685 (9.3%) 293 (2.8%) lookups; social interaction (looking at contributions); search; tagging; and threads. From the classification it became evident the 1361 6361 1399 Tot posts clickstream did not record all events, such as when a post or (avg=3.3) (avg=3.8) (avg=4.8) comment was made, or when votes were applied. For these, 285 2728 901 Tot comments specific database tables were used. In order to manage different (avg=0.7) (avg=1.7) (avg=3.1) data sets and sources, a standardized schema was built, allowing Registrants 32705 28558 22466 disparate sources to feed into, but exposing a common interface to conduct analysis over forum activities. This is shown in Figure 1. Active 60% 63% 47% students1 Figure 1. Forum data transformation process 4.2% 4.4% 0.7% Completing2 (0.3% D) (2.4 D) (0.2%) Table 1. Summary of the courses under investigation. NOTE: 1. Active students are those appearing in the clickstream; 2. Completing students achieve the pass grade or Distinction (D) The type of activity is summarised in Figure 2. In the chart, the five categories refer to the following: View corresponds to listing forums, threads and viewing posts; Post is the writing of a post or start of a new thread; Comment is a reply to an existing post; Social refers to all actions engaging directly with other’s status (up-vote, down-vote and looking at profiles/reputation); Engage refers to the additional interaction with forums content (searching, tagging, ‘watching’ or subscribing to posts or threads). The viewing behaviour is the most prominent for both the student and instructor groups and the figures are pretty much similar across the board. A two-way ANOVA (2x5, role by activity) on the percentage of distributions, shows that there is no significant difference between students and instructors, but there is an obvious difference between views and the other types of behaviour (F(4,29) = 1656.3, p < .01). If we consider the engagement over the timeline and compare the type of activities carried out by students and instructors, Figure 3 3.2 An overview of forums activity (end of the paper) shows the patterns for the three courses. The There are very interesting trends which require more detailed most striking pattern is that there doesn’t seem to be an obvious examination (bottom of table 1). As expected, in LTTO the forum one. For what concerns posts and views in all the three courses activity is larger than in the other courses and this is probably due there is a sense of synchronicity between the two groups, however to the fact that students were asked to submit post in forums from this chart it is not possible to understand in more detail what following the learning activities. The proportion of active students are the connections between what students and teachers do. Instructors’ comments are slightly offset, possibly as a reaction to activity of students in the forums is organized according to a set of students’ posts. An interesting aspect is the amount of ‘social’ commonly used quantitative metrics and a couple of measures engagement in the P2P course that merits further analysis. borrowed from Social Network Analysis (table 2). Although this seems to be a promising approach, there are two issues with this 4. DIRECTIONS AND OPEN QUESTIONS methodology in the MOOCs: 1) only a tiny proportion of students From this coarse analysis it is apparent that there seem to be can be considered active and 2) it is hard to scale the instructor’s minimal behavioural differences in the way students and evaluation. The first problem is not easily resolved and it is an instructors interact in the different courses, however more analysis issue in the literature reviewed [17, 18]; non-posting behavior is is required to tackle questions about the individual differences in considered as an index of disengagement, partly because this is students’ and instructors’ patterns of interaction and their easy to measure. In principle the latter could be substituted by interrelations. Furthermore little can be said about how the nature peer evaluation (up-vote, down-vote), but there is no easy way to of interactions drives the development of communication and ensure consistency. engagement. However a number of questions like the following Indicator Type Description remain open and unanswered: how do discussions develop over time? How teacher presence affects the development of Messages Quantitative Number of messages written by discussions? Is the number of forums affecting how students the student. engage with them (i.e. causing disorientation)? Threads Quantitative Number of new threads created by the student. Figure 2. Distribution of forum activities by role Words Quantitative Number of words written by the student. Sentences Quantitative Number of sentences written by the student. Reads Quantitative Number of messages read on the forum by the student. Time Quantitative Total time, in minutes, spent on forum by the student. AvgScoreMsg Qualitative Average score on the instructor's evaluation of the student's messages. Centrality Social Degree centrality of the student. Prestige Social Degree prestige of the student. Table 2. Possible indicators characterising forum engagement An alternative method that can be explored is graph-based approaches. For example, Bhattacharya et al. [19] used graph- based techniques to explore the evolution of software and source branching providing an insight in the process. Kruck et al. [20] developed GSLAP, an interactive, graph‐based tool for analyzing web site traffic based on user‐defined criteria. Kobayashi et al. [18] used a method to quickly identify and track the evolution of topics in large datasets using a mix of assignment of documents to time slices and clustering to identify discussion topics. Yang et al [21] integrated graph-based clustering to characterize the emergence of communities and text-based analysis to portray the nature of exchanges. In fact, students move in the various sub-forums taking different roles or stances as they engage with different subsets of students. As the reasons to engage in these discussions are partly determined by different interests, goals, and issues, it is possible to construct a social network graph based on the post-reply-comment structure within threads. The network generated provides a possible view of a 4.1 The DM and graph-based approaches student’s social participation within a MOOC, which may indicate A possible way to answer the questions about the types/patterns of some detail about their values, beliefs and intentions. behaviours, the structure and development of networks and the growth of groups/communities over time might be using data Furthermore, Brown et al [22] have already shown the value of mining and graph-based approaches. For example, [6] used a exploring the communities in discussion forums in MOOCs combination of quantitative, qualitative and social network particularly for what concerns the homogeneity of performance information about forum usage to predict students' success or but dissimilarity of motivations characterizing student hubs. failure in a course by applying classification algorithms and classification via clustering algorithms. In their approach the 4.2 Discussion points [6] Cristóbal Romero, Manuel-Ignacio López, Jose-María Luna, The examples above provide evidence of the potential for using and Sebastián Ventura. 2013. Predicting students’ final graph-based methods to obtain better insights into the process and performance from participation in on-line discussion forums. content analysis for our dataset and to extend its applicability to Computers & Education 68 (October 2013), 458–472. DOI: MOOCs, however there are a number of contentious points to http://dx.doi.org/10.1016/j.compedu.2013.06.009 raise which will provide opportunities for discussion. [7] Margaret Mazzolini and Sarah Maddison. 2007. When to Firstly the number of students who are actively involved in jump in: The role of the instructor in online discussion discussion is a very small proportion of the active participants. forums. Computers & Education 49, 2 (September 2007), This means that the subset may not be representative at all. One 193–213. could argue that these students are already engaged or desperately DOI:http://dx.doi.org/10.1016/j.compedu.2005.06.011 need help. Previous literature [21, 22, 23] focused on the ability [8] M.j.w. Thomas. 2002. Learning within incoherent structures: to predict performance and on the peer effect which can emerge the space of online discussion forums. Journal of Computer from the analysis of the graphs/social networks. Assisted Learning 18, 3 (September 2002), 351–366. Secondly, one could question the value of the communities in DOI:http://dx.doi.org/10.1046/j.0266-4909.2002.03800.x xMOOCs: especially when courses are designed with an [9] Daniel F.O. Onah, Jane Sinclair, and Russell Boyatt. 2014. instructivits approach leading to mastery, by definition this is an Exploring the use of MOOC discussion forums. In individualistic perspective focused on the testing of one’s own Proceedings of London International Conference on skills/learning. Of course in cMOOCs -connectivists by design- Education. London: LICE, 1–4. the importance of the development of social support is essential. [10] Alstete, J.W. and Beutell, N.J. Performance indicators in This seems to be supported by Brown et al [22]: they were not online distance learning courses: a study of management able to uncover a direct relation between stated goals and education. Quality Assurance in Education 12, 1 (2004), 6– motivations with the participation in forums, and attributed this to 14. pragmatic needs. However, as the authors suggested earlier, the instructors might play a fundamental role in shaping the [11] Cheng, C.K., Paré, D.E., Collimore, L.-M., and Joordens, S. communities based on the value attributed to forums in their Assessing the effectiveness of a voluntary online discussion plans/design and the level of engagement/interaction. Considering forum on improving students’ course performance. the split between cMOOCs and xMOOCs again, interesting work Computers & Education 56, 1 (2011), 253–261. might come out of the experiment conducted by Rose’ and [12] Palmer, S., Holt, D., and Bray, S. Does the discussion help? colleagues in the DALMOOC in which automated agents were The impact of a formally assessed online discussion on final deployed to support students’ conversations. In Coursera the student results. British Journal of Educational Technology deployment of ‘community mentors’ will be an interesting space 39, 5 (2008), 847–858. to explore, given that the importance of design seems to be [13] Patel, J. and Aghayere, A. Students’ Perspective on the removed from instructors in the ‘on-demand’ model. Impact of a Web-based Discussion Forum on Student Lastly, more research is needed in the time-based dimension of Learning. Frontiers in Education Conference, 36th Annual, development of forums in MOOCs. Questions like how students (2006), 26–31. bond and create stable relations, how they become authoritative [14] Jacqueline Aundree Baxter and Jo Haycock. 2014. Roles and and what motivates them to contribute over time are all open student identities in online large course forums: Implications questions which the analysis of graphs over time might be able to for practice. The International Review of Research in Open address. and Distributed Learning 15, 1 (January 2014). 5. REFERENCES [15] Vanessa Paz Dennen. 2008. Pedagogical lurking: Student [1] Dirk Jan van den Berg and Edward Crawley. Why MOOCS engagement in non-posting discussion behavior. Computers Are Transforming the Face of Higher Education. Retrieved in Human Behavior 24, 4 (July 2008), 1624–1633. April 12, 2015 from http://www.huffingtonpost.co.uk/dirk- DOI:http://dx.doi.org/10.1016/j.chb.2007.06.003 jan-van-den-berg/why-moocs-are- [16] René F. Kizilcec, Chris Piech, and Emily Schneider. 2013. transforming_b_4116819.html Deconstructing Disengagement: Analyzing Learner [2] Chris Parr. The evolution of Moocs. Retrieved April 12, Subpopulations in Massive Open Online Courses. In 2015 from Proceedings of the Third International Conference on http://www.timeshighereducation.co.uk/comment/opinion/th Learning Analytics and Knowledge. LAK ’13. New York, e-evolution-of-moocs/2015614.article NY, USA: ACM, 170–179. DOI:http://dx.doi.org/10.1145/2460296.2460330 [3] C. Osvaldo Rodriguez. 2012. MOOCs and the AI-Stanford Like Courses: Two Successful and Distinct Course Formats [17] Dennen, V.P. Pedagogical lurking: Student engagement in for Massive Open Online Courses. European Journal of non-posting discussion behavior. Computers in Human Open, Distance and E-Learning (January 2012). Behavior 24, 4 (2008), 1624–1633. [4] George Siemens. 2005. Connectivism: A learning theory for [18] Kobayashi, M. and Yung, R. Tracking Topic Evolution in the digital age. International journal of instructional On-Line Postings: 2006 IBM Innovation Jam Data. In T. technology and distance learning 2, 1 (2005), 3–10. Washio, E. Suzuki, K.M. Ting and A. Inokuchi, eds., Advances in Knowledge Discovery and Data Mining. [5] Stephen Downes. 2008. Places to go: Connectivism & Springer Berlin Heidelberg, 2008, 616–625. connective knowledge, Innovate. [19] Bhattacharya, P., Iliofotou, M., Neamtiu, I., and Faloutsos, [22] R. Brown, C Lynch, Y. Wang, M. Eagle, J. Albert, T. M. Graph-based Analysis and Prediction for Software Barnes, R. Baker, Y. Bergner, D. McNamara. Communities Evolution. Proceedings of the 34th International Conference of performance and communities of preference. GEDM on Software Engineering, IEEE Press (2012), 419–429. 2015, in press. [20] Kruck, S.E., Teer, F., and Jr, W.A.C. GSLAP: a graph-based [23] M. Fire, G. Katz, Y. Elovici, B. Shapira, and L. Rokach, web analysis tool. Industrial Management & Data Systems “Predicting Student Exam’s Scores by Analyzing Social 108, 2 (2008), 162–172. Network Data,” in Active Media Technology, R. Huang, A. A. Ghorbani, G. Pasi, T. Yamaguchi, N. Y. Yen, and B. Jin, [21] D. Yang, M. Wen, A. Kumar, E. P. Xing, and C. P. Rose, Eds. Springer Berlin Heidelberg, 2012, pp. 584–595. “Towards an integration of text and graph clustering methods as a lens for studying social interaction in MOOCs,” The International Review of Research in Open and Distributed Learning, vol. 15, no. 5, Oct. 2014. Figure 3. Time sequence of activity in the forums in the three courses by students and instructors grouped by activity type