Educating Future Software Architects in the Art and Science of Analysing Software Data Wolfgang Mauerer Stefanie Scherzinger Technical University of Applied Sciences Regensburg Technical University of Applied Sciences Regensburg Siemens AG, Corporate Research, Munich Regensburg, Germany Germany stefanie.scherzinger@oth-regensburg.de wolfgang.mauerer@oth-regensburg.de Abstract—We report the design and teaching experience of a or have previous work experience2 ) how empirical methods Master-level seminar course on quantitative and empirical soft- are observed by junior-level software professionals. ware engineering. The course combines elements of traditional literature seminars with active learning by scientific project work, II. C OURSE D ESIGN in particular quantitative mixed-method analyses of open source systems. It also provides short introductions and refreshers to The computer science Master curriculum at Technical Uni- data mining and statistical analysis, and discusses the nature and versity of Applied Sciences Regensburg requires students to practice of scientific knowledge inference. Student presentations of published research, augmented by summary reports, bridge complete a scientific seminar worth 5 ECTS credits. In the to standard seminars. We discuss our educational goals and following, we detail organisation, learning goals and timeline the course structure derived from them. We review research of the course. So far, we have taught two iterations. questions addressed by students in mini research reports, and analyse them as tokens on how junior-level software engineers A. Learning Goals perceive the potential of empirical software engineering research. We assess challenges faced, and discuss possible solutions. The course description for the scientific seminar3 states Index Terms—Empirical Software Engineering, Teaching these learning goals: The students learn to 1) independently Quantitative Methods, Statistical Analysis, Literature Seminar research an area within the field of computer science, 2) crit- ically reflect and summarise central ideas of scientific work, 3) perform literature search and reviews, 4) give a professional I. I NTRODUCTION presentation, and 5) engage in an academic discussion. Effective decision making is a crucial part of being success- To reach these goals, scientific seminars traditionally com- ful in software engineering (SWE). Architects, programmers prise a seminar presentation as well as a seminar report on and even technical managers need to decide, among others, an existing body of research. However, we also made it our how to best organise team collaboration, how to choose goal that students actively experience empirical SWE (eSWE), appropriate software components and frameworks, and how beyond merely analysing existing research. They should gather to design entire software architectures. background knowledge as to why (and when) an empirical, Scientifically sound decision making is ideally based on quantitative approach is preferable over more orthodox SWE, measurable facts. Consequently, substantial portions of SWE and experience benefits, limitations and challenges of quan- research rest on empirical, quantitative methods. This consti- titative work. Consequently, we desire that they 1) do not tutes a teaching challenge: Beyond covering an already large merely read up on principles, but acquire a certain level of syllabus, advanced statistical methodology must be introduced, proficiency in using and also mining version control systems4 , to create an understanding of the benefits and limits of 2) gain first-hand experience with the technical and conceptual scientific knowledge inference. pitfalls in exploring a research question, 3) write a mini In the Master-level seminar described in this paper, we research report as a “training” opportunity before handing in address these challenges in a setting targeted at advanced the graded seminar report, and 4) are aware that not only students with a focus on practical engineering: We augment technical aspects of building software, but also socio-technical a traditional scientific seminar—avoiding to impose undue and social aspects of software development can be quantified. workload—with active, creative learning components, chal- lenging students with the quantitative, data-driven investiga- 2 All undergraduate students at Technical University of Applied Sciences tion of a research question of their choice. At the same time, Regensburg complete a mandatory, 18-week internship. Additionally, 40% of Bachelor graduates report in in-house surveys that they have held full-time we re-use this setting to learn from our students (many of occupations in the private sector before taking up their studies. who are part-time employees in the local software industry,1 3 This course is detailed in the department module guidelines. 4 The ubiquitous version control system git is an obvious choice, since it 1 In-house surveys show that 40% of all Master students dedicate over 40% is a popular data source in research; using the system for data engineering of their time to casual work (usually as programmers). Details upon request. usually implies a proficiency boosts in daily work, too. S. Krusche, S. Wagner (Hrsg.): SEUH 2020 56 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Prepare lab report Self-study Prepare seminar report 0 1 4 and presentations 7 8 9 14 15 Kickoff Lab session DDL 1 Presentations DDL 2 Fig. 1. Timeline breakdown of the 15 week term. “DDL” denotes a student deadline for submitting or presenting results. B. Organisation and Timeline packages?” may be stated declaratively, using SQL, as Figure 2 Figure 1 summarises the timeline of our seminar, broken illustrates. Together, we discussed the issue of reproducibility, down across the 15 week term, and highlights the main events. as the data collection is updated regularly. a) Kickoff session (week 1): Students enrolled in the We further provided a refresher on statistics and on scientific Masters program are assigned to one of several parallel writing, the latter based on Zobel’s book [3]. seminar tracks (organised by different professors), according For a mandatory two-page mini research report (discussed to their topical preferences. A track comprises 20 participants. in detail in Section III-B) that explores a self-chosen research We asked our students to prepare the online course “Version question, students collected ideas in an interactive brainstorm- Control with Git”.5 This course includes hands-on exercises, ing session,. We gave feedback on the validity and feasibility so our course participants can operate git directly on the of each question, taking into account the temporal constraints. command-line (and not just via feature constrained colourful We also commented on apparent threats to validity. user interfaces). This includes advanced working with different c) Mini research report deadline (week 7, Deadline 1): branches, cloning, fetching, forking, and cherry picking, as We graded the submitted mini research reports by detailed well as a basic understanding of the data storage model. criteria that we made public beforehand8 . b) Lab Session (week 4): The lab session is an all-day d) Seminar presentations (weeks 8 and 9): Each student workshop where the students focus on practical exercises. is assigned one (usually seminal) original research paper, or a This includes answering questions on more advanced aspects book chapter from Ref. [4], as a basis for the seminar presen- of git (the full list is available in the online supplement.6 tation and report. Prior to presentation and discussion, students This allows students to self-assess their level of proficiency in were mentored one-on-one, like in traditional seminars. handling git. (Additionally, we schedule two papers [1], [2] e) Seminar report (week 14, Deadline 2): The five page on the subject early in the paper presentation stage, see (d)). seminar report is prepared by week 14. It wraps up the core ideas of the underlying article or book chapter, and discusses it critically in the context of related work, methodological soundness9 , and practical utility. III. E XPERIENCE R EPORT SQL query over relations holding GitHub data. We next report on our experience. We begin by discussing Google BigQuery as a means of evaluating research questions. We then reflect on the mini research reports. We review encountered challenges in the upcoming section. Response within seconds. A. Data Provisioning with git and Google BigQuery The top-n Go packages. Felderer and Kuhrmann [5] confirm that students tend to underestimate the effort of data collection and preparation, Fig. 2. Using BigQuery to identify the most frequently imported Go packages. in agreement with common experience in data science. This calls for using sophisticated tools that come with powerful data Likewise, we provide small challenges that must be solved preparation pipelines. Yet unfortunately, we found that many using Google BigQuery. BigQuery is a cloud-based data ware- of the software solutions used by professional researchers house.7 It provides various open data sets, among them the lack in quality and maturity, particularly regarding ease of GitHub activity data. As of October 2019, this contains a snap- installation and setup, completeness of documentation, and shot of open source software (OSS) repositories amounting to usability, which was confirmed after consultation with the tool over 3 TiB of data (currently, over 2.8 million repositories, authors. In short, we failed to get any of the state-of-the-art 145 million unique commits, and over 2 billion different files). tools10 used in academic research to work for in-classroom Queries such as “What are the most frequently imported Go 8 The grading rubric is available in the online supplement. 5 The course is available on the Udacity MOOC platform. 9 This requires substantial individual guidance from the instructors. Addi- 6 Blue coloured text provides a link in the electronic version of this paper. tionally, the statistical refresher points out commonly encountered problems. 7 While BigQuery is a commercial service, it can be used without billing 10 Easy to install and use tools like gitstats are too simplistic even for enabled, but requires that students are comfortable with a Google account. less ambitious research questions chosen by students. S. Krusche, S. Wagner (Hrsg.): SEUH 2020 57 does the research question try to resolve a specific hypothesis? Students’ Research Questions 3) Time Resolution: Is the question applied in a time-resolved • 30%—Relationships between straight-forward observables way (i.e., did students consider that properties may change Time of day versus bug introduction? over time), or is each project analysed as single static entity? Does the number of bugs per developer vary with project age? Figure 4 provides a mosaic plot [7] of the resulting three- • 23%—Velocity of changes to observable quantities Speed of Java dependency updates after the weekly security issue? way contingency table. The largest group concerns the anal- How fast are bug tickets closed? ysis of several projects, and considers relationships between • 13%—Testing: effort, coverage, and utility variables—but without accounting for possible changes in the How are unit tests distributed by programming language? relationship over time. At the same time, no explicit testing How does test coverage evolve? of a hypothesis on a single project was suggested. • 10%—Hidden and indirect project properties How many OSS projects are company supported? • 10%—Test (anecdotal or established) SWE conjectures Analysis Method Developer group size versus the 7±2 scrum assumption? Do code of conducts have measurable effects? Measurement Relationship Hypothesis • 10%—Trivia — Do bigger files change more often? Time Resolved Multiple Project Scope No Fig. 3. Distribution of student research ideas, categorised (subjectively) by topic, along with typical research questions. Yes No Yes Single use within reasonable effort (we grudgingly need to accept a share of the blame since this also holds for our own tools). Thus, research questions based on complex socio-technical observations or multi-modal data sources cannot be addressed Fig. 4. Classification of mini research questions by methodological properties. Areas are proportional to the occurrence count of combinations, and flat lines in mini research reports. To compensate, we devote a substan- denote absent combinations. tial share of the discussed literature on such research (e.g., the seminal series of papers on socio-technical congruence by Discussion: We believe that to some degree, the topics Cataldo, Herbsleb and co-workers, initiated in Ref. [6]). chosen for mini research reports mirror the students’ expec- We settled on two recommendations for how students can tations and way of thinking: All seminar participants hold a conduct their own research. (1) First, we proposed individual, Bachelor degree. Thus, they are fully qualified as junior-level programmatic analysis using either scripted calls of git or developers. A survey after the winter 19/20 run showed that (preferably) using git front-end libraries from scripting lan- 75% of the participants have substantial work experience – guages for data collection (we recommend PyGit2, GitPython, 50% claimed work experience equivalent to about one year of and Git2R). (2) Alternatively, we proposed to use BigQuery, full time employment, 25% even more than three years; details as already discussed. The well-curated data relations of the online. The students’ intuition should therefore reflect on the latter alleviate common issues that trouble the collection of intuition about eSWE in practice. “big” data – students can focus on writing SQL. Most research questions proposed by the students concern Following these recommendations can reduce the effort measuring multiple projects instead of in-depth evaluations of spent with data ingestion nuisances like parsing (broken) a single one. Interestingly, all mini research reports involve dates, parsing (broken) strings, handling (broken and/or mixed) quantitative measurements, and do not suggest any ethno- encodings, or handling other (broken) system details. graphic or qualitative research, which does not mirror the topical distribution observed for published work. More than B. Mini Research Reports half of the research questions concern relationships between Mini research reports could be produced by teams of two, observed variables. This might indicate that students are inter- and students had free choice on the topic. Each run of the ested in finding universal relationships valid beyond the scope seminar produced about 20 suggestions with some overlap, of one particular undertaking, which meets our expectations resulting in 30 unique candidate questions (the full list of towards Master-level students. candidate questions is available in the online supplement). We Usually, either a visual description or simpler measures identified six topical groups, as shown in Figure 3, along with like correlations or a univariate linear regression model are typical research questions. We also show the distribution of employed. Given the short time frame, this is understandable, the questions according to our categorisation. but it might also indicate unease with more advanced analysis We additionally categorised each research question con- techniques. No team chose a machine learning-type analysis, cerning the research methodology: 1) Scope: Is the research despite the popularity of these methods among students. question related to a single project or does it pertain multi- Straightforward measurements of a single variable are usu- ple projects? 2) Analysis Method: Is a simple (count-based) ally intended to act as proxy for a (explicitly given, but measurement considered, are (correlations or stronger forms often only diffusely defined) quality property. For instance, the of) relationships between measured variables addressed, or number of tests is used as proxy for code quality, and number S. Krusche, S. Wagner (Hrsg.): SEUH 2020 58 and staleness of TODO entries in the code proxies for project Q1 progress. Students did not consistently realise that relations Q2 Q3 Yes between proxy and indirect observables are not always in Q4 Neutral direct proportion, and that assuming such connections in the Q5 No first place is a threat to validity. Thus, some research questions 0 5 10 might even be categorised as “bad smells”, as defined by Fig. 5. Student opinion after the winter 19/20 run regarding the usefulness Menzies and Shepperd [8]. Of course, we do not hold this of eSWE and scientific methods in practice (questions Qn: see text). against our students, who are novices in eSWE. Rather, we hope that by attending the seminar, the students learn to recognise “smelly” research questions. and solutions to the large body of existing work (Q3). The Interestingly, hardly any students set out to apply principles survey results in Figure 5 show that students mostly agree, and measures that are part of the standard SWE curriculum [9] except for Q2. The students share our enthusiasm that eSWE of their SWE lecture, such as code metrics or code coverage. methods will help them become better software engineers Overall, we typically see simple statistical analyses for (Q4), although confusingly only one student in four plans to relations. What is missing is the question on how any of the employ such methods in the future (Q5). measured co-variables influence quality or other properties of The attitude towards philosophical aspects of science versus projects, or can even induce actionable consequences. This the acquisition of practical knowledge is, for many students, indicates that prior to the seminar, there was no established not unambiguously in favour of the former. Two aspects notion if and how complex decisions in SWE projects can be require particular attention in teaching: Firstly, software engi- based on evidence- and measurement-based reasoning. Only neering comprises technical and social aspects, and it is usu- 50% of the participants of the winter 19/20 run reported prior ally impossible to derive quantitative a-priori theories in fields literature experience with eSWE methods; interestingly, no with such characteristics. Statistical inference therefore needs one reported prior use of eSWE in commercial projects. to be understood as the predominant means of establishing cer- tainty. Many statements that prevail in the industrial domain— IV. C HALLENGES however credible they may sound from “experience”—can In the following, we highlight several challenges that we only be rationalised or refuted in this way.13 Secondly, con- encountered in teaching the course. ducting a too delicately faceted discussion on the nature of science would distract from the seminar core. Differences A. Scientific Method: Theory and Application between scientific research and actions dictated by practical In the computer science curriculum at Technical University necessity can be exposed by entertaining the pragmatic view- of Applied Sciences Regensburg, the scientific seminar is only point of equating scientific insight with systematicity [10]. taught at the Master level. This exposes students later than Providing or refreshing the aforementioned knowledge ne- desirable11 to scientific processes and methodology, and to cessitates covering a substantial body of topics that often conducting systematic research.12 Students usually need to exceed what is covered in the non-elective parts of the curricu- sharpen their understanding on the differences between hy- lum. The lab session contains general guidance on these issues, potheses, theories, laws, observations, and conjectures, that is, but we further equip students with a comprehensive slide the basic building blocks of scientific insight, as we frequently deck that details some of the aspects, and contains appropriate observe when supervising student theses. pointers for self-study. Care is needed to not put undue burden Both authors have worked in industry, and have (or any perception of undue load) on the participants, to keep professionally built commercial software, before returning to the workload comparable between parallel seminar tracks. academia. We find that exposure to the scientific method is B. Statistics, Machine Learning and Data Analysis useful for properly evaluating and understanding contemporary Software engineering research rests on a wide body of results of empirical software engineering research (Q1), for statistical methods, but is also sometimes known to employ assessing the value of marketing claims of commercial these techniques in inappropriate or flawed ways [11]. We vendors (Q2), and for comparing the novelty of approaches believe this implies three challenges that need to be solved: 11 Experiences from multiple half-day refresher courses on scientific data Firstly, popular statistical methods in research (such as evaluation for early-stage PhD candidates confirmed, as far as the value advanced forms of multivariate regression, mixed models, of anecdotal evidence goes, that opportunities for improvement are not association rules etc.) are usually not covered in compulsory exclusively restricted to early-stage Master students. 12 Related lectures include a compulsory course on Automata, Formal undergraduate lectures. Secondly, students found it challenging Languages and Computation (4 ECTS) that discusses nature and limits of to apply their method knowledge to practical data sets (e.g., scientific inference; a checklist for preparing a scientific experience report knowing the principles of linear regression is not sufficient on a mandatory industrial internship; the preparation of a Bachelor’s thesis (12 ECTS, albeit often performed in industrial settings); and an elective short to interpret the comprehensive output delivered by statistical course on conducting research, intermittently taught by the authors of this software, as is evaluating quality or aptitude of models for a paper. The omission of a dedicated course on scientific procedure is in line with the German computer science curriculum recommendations [9], and 13 It seems not entirely impertinent to remark that many popular textbooks therefore probably extends to many other academic institutions as well. on the decades old agile credo do not ease the situation. S. Krusche, S. Wagner (Hrsg.): SEUH 2020 59 50 Researchers suggested ideas how to enable students to 40 build up skills in eSWE. Wohlin [17] proposes (i) integration 30 with a software engineering course, (ii) a stand-alone course, 20 or (iii) a dedicated research method course. Fagerholm et 10 al. [15] suggest to use eSWE methods as Master’s thesis 0 topics, which creates person-specific in-depth understanding, 0 10 20 30 40 50 but unfortunately does not widely distribute method awareness. Fig. 6. Comparson of grading results (winter 18/19 run, 50 points maximum), Option (iii) best matches our scientific seminar, whereas [5], together with a simple linear regression model (solid black), ideal correlation [18], [15] report on courses that match options (i) and (ii). curve (dashed red), and 95% confidence interval (shade of grey). For other courses comprising mini eSWE projects, stu- dents reported hands-on experience as beneficial for their future careers [5], which confirms our motivation and matches given body of data). Thirdly, students predominantly perceive our experience after two iterations in a high teaching-load, minimising the prediction error in statistical models as the application-oriented environment. sole quality criterion—most likely caused by the current surge of interest in machine learning and artificial intelligence—, R EFERENCES which overshadows other schools of statistical thinking [12]. [1] C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamilton, D. M. German, and Software engineering research, in particular, is often concerned P. Devanbu, “The promises and perils of mining git,” in 6th IEEE with parsimonious and interpretable models, and it was, for Int Working Conference on Mining Software Repositories, 2009. [2] E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and instance, necessary to remind students that common measures D. Damian, “The promises and perils of mining GitHub,” in Proceedings like the ubiquitously used R2 value in linear regression are of the 11th working conference on mining software repositories, 2014. sub-optimal discriminators to judge models, since closeness [3] J. Zobel, Writing for Computer Science, 3rd ed. Springer Publishing Company, Incorporated, 2015. of a model to data can (with over-fitting in mind) usually not [4] C. Bird, T. Menzies, and T. Zimmermann, The Art and Science of immediately be related to model quality. Analyzing Software Data, 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2015. C. Availability of Full-Fledged Textbooks [5] M. Felderer and M. Kuhrmann, “Using Mini-Projects to Teach Empirical Software Engineering,” in Tagungsband des 16. Workshops ”Software We are not aware of a textbook for SWE that is not Engineering im Unterricht der Hochschulen”, 2019, pp. 75–86. an edited collection of contributions by a large number of [6] M. Cataldo, J. D. Herbsleb, and K. M. Carley, “Socio-technical con- authors, or a collection of (essentially) research papers. We gruence: A framework for assessing the impact of technical and work dependencies on software development productivity,” in Proc. of the 2nd therefore decided to blend chapters from Ref. [4] with selected ACM-IEEE International Symposium on eSWE and Measurement, ser. scientific works on research issues, in particular Refs. [1], [2], ESEM ’08. New York, NY, USA: ACM, 2008, pp. 2–11. augmented by Easterbrooks et al. [13] on method selection [7] K. Hornik, A. Zeileis, and D. Meyer, “The Strucplot Framework: Visu- alizing Multi-way Contingency Tables with vcd,” Journal of Statistical for empirical research. Especially for presentations that estab- Software, vol. 17, no. 3, pp. 1–48, 2006. lish base method knowledge, students identified differences [8] T. Menzies and M. Shepperd, “Bad smells in software analytics papers,” in technical depth, scientific rigour, and focus, perhaps not Information and Software Technology, vol. 112, pp. 35 – 47, 2019. [9] Gesellschaft für Informatik, “GI-Empfehlungen Bachelor-Master,” entirely unjustified. Fully escaping this problem in a setting https://gi.de/fileadmin/GI/Hauptseite/Aktuelles/Meldungen/2016/ that discusses original research seems impossible. GI-Empfehlungen Bachelor-Master-Informatik2016.pdf, 2016, [Online; accessed 08-Jan-2020]. D. Grading Based on Methods Preached [10] P. Hoyningen-Huene, “Systematicity: The Nature of Science,” Philosophia, vol. 36, pp. 167–180, 06 2008. The difficulty of grading SWE projects is well known [14], [11] R. P. Reyes, O. Dieste, E. R. Fonseca, and N. Juristo, “Statistical Errors and extends to student work produced in this seminar. Our in Software Engineering Experiments: A Preliminary Literature Review,” major learning objective is to create awareness for data-driven in Proceedings of the 40th International Conference on Software Engi- neering, ser. ICSE ’18, 2018, pp. 1195–1206. methods, so we found it pertinent to hold grading to this [12] L. Breiman, “Statistical modeling: The two cultures (with comments and standard. As an experiment, the mini research reports were a rejoinder by the author),” Statistical science, vol. 16, no. 3, 2001. therefore independently graded by both authors, and results [13] S. M. Easterbrook, J. Singer, M.-A. D. Storey, and D. E. Damian, “Selecting Empirical Methods for Software Engineering Research,” were subjected to various statistical analyses and comparisons, in Guide to Advanced Empirical Software Engineering, F. Shull, Ed. which showcases their practical utility on an issue exposed to Springer London, 2008. much student curiosity. Fig. 6 does not only demonstrate a [14] O. Hummel, “Transparente Bewertung von Softwaretechnik-Projekten in der Hochschullehre,” in Tagungsband des 13. Workshops ”Software satisfactory consistency between graders, but can also be used Engineering im Unterricht der Hochschulen”, 2013, pp. 103–114. to remind students on the implications of residual correlation. [15] F. Fagerholm, M. Kuhrmann, and J. Münch, “Guidelines for Using Empirical Studies in Software Engineering Education,” in Software V. R ELATED W ORK AND C ONCLUSION Engineering und Software Management 2018, Fachtagung des GI- Fachbereichs Softwaretechnik, 2018, pp. 85–87. The idea of students writing mini research reports has been [16] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, and B. Regnell, pursued before. Our concept of mini research reports best Experimentation in Software Engineering. Springer, 2012. [17] C. Wohlin, Empirical Software Engineering: Teaching Methods and matches the experiments proposed by Fagerholm et al. [15] Conducting Studies. Berlin, Heidelberg: Springer, 2007, pp. 135–142. and Ref. [16], the former of which gives detailed guidelines [18] M. Kuhrmann, “Teaching Empirical Software Engineering Using Expert for including empirical studies in SWE education. Teams,” in Tagungsband des 15. Workshops ”Software Engineering im Unterricht der Hochschulen, 2017, pp. 20–31. We acknowledge helpful comments from our study dean Markus Westner. S. Krusche, S. Wagner (Hrsg.): SEUH 2020 60