Predicting Workout Quality to Help Coaches Support Sportspeople Ludovico Boratto Salvatore Carta Walid Iguider Data Science and Big Data Analytics Dip.to di Matematica e Informatica Dip.to di Matematica e Informatica EURECAT, Centre Tecnológic de Università di Cagliari Università di Cagliari Catalunya Cagliari, Italy Cagliari, Italy Barcelona, Spain salvatore@unica.it w.iguider@studenti.unica.it ludovico.boratto@acm.org Fabrizio Mulas Paolo Pilloni Dip.to di Matematica e Informatica Dip.to di Matematica e Informatica Università di Cagliari Università di Cagliari Cagliari, Italy Cagliari, Italy fabrizio.mulas@unica.it paolo.pilloni@unica.it ABSTRACT (eHPT) are designed to help people change their habits and to help The support of a qualified coach is crucial to keep the motivation of them overcome their frictions to healthier behaviors [7, 8, 10]. sportspeople high and help them pursuing an active lifestyle. In this The u4fit platform1 connects users with human coaches, allow- paper, we discuss the scenario in which a coach follows sportspeople ing for a tailored exercise experience at a distance [1, 14]. Indeed, remotely by means of an eHealth platform, named u4fit. Having users receive tailored workout plans from coaches and, thanks to to deal with several users at the same time, with no direct human a mobile application, they are guided to execute the workout cor- contact, means that it is hard for coaches to quickly spot who, rectly. Moreover, coaches receive the results of a workout and can among the people she follows, needs a more timely support. To this interact with the users via a live chat. end, in this paper we present an automated approach that analyzes However, a coach usually follows a lot of sportspeople so, after a the adherence of sportspeople to their planned workout routines. workout, it is not trivial to understand which sportsperson should The approach is able to suggest to the coach the sportspeople who be supported first (e.g., who should she chat with). Indeed, a training need earlier support due to a poor performance. Experiments on result is made up of several metrics to be carefully analyzed (e.g., real data, evaluated through classic accuracy metrics, show the speed and covered distance, just to name a few), so the effectiveness effectiveness of our approach. of a workout cannot be easily and quickly estimated. To face the problem of helping coaches support first the sports- CCS CONCEPTS people that performed a poor workout (since they are, trivially, those who need the most urgent support), in this paper we propose • Information systems → Mobile information processing sys- an approach that predicts the quality of a workout result by means tems; Data mining; of a rating. Based on the features that characterize previous work- outs and the ratings assigned to them by the coaches, we train a KEYWORDS classifier to predict the rating of the new workouts that the coach Personalized Persuasive Technologies, Health Recommendation, has not considered yet. This allows us to recommend to the coach Healthy Lifestyle, eCoaching, Motivation. the workouts (and, thus, the sportsperson who performed it), or- ACM Reference Format: dered by increasing predicted rating (i.e., those with a low rating Ludovico Boratto, Salvatore Carta, Walid Iguider, Fabrizio Mulas, and Paolo are presented first), allowing the coach to take action2 . Pilloni. 2018. Predicting Workout Quality to Help Coaches Support Sports- Being able to provide effective and timely support to the users people. In Proceedings of the Third International Workshop on Health Recom- who need the most support is a powerful form of motivation that it mender Systems co-located with Twelfth ACM Conference on Recommender is crucial for long-term adherence to a training routine [13]. Systems (HealthRecSys’18), Vancouver, BC, Canada, October 6, 2018 , 5 pages. Recommender systems (RS) can help supporting decisions in health environments. As highlighted in [23], when a RS is developed 1 INTRODUCTION for health professionals (as in our case) they provide information A regular physical activity is key to keep a good health [22]. In that allows them to address specific cases. Moreover, health RS help order to keep motivation high, eHealth persuasive technologies providing reliable and trustworthy information to the end users [23]. 1 www.u4fit.com. Please note that the coaches marketplace is visible only by setting the Italian language on the platform. HealthRecSys’18, October 6, 2018, Vancouver, BC, Canada 2 In case two users need equally urgent support, different strategies can be carried © 2018 Copyright for the individual papers remains with the authors. Copying permit- out, such as supporting first the elder sportperson, or the one who has not received ted for private and academic purposes. This volume is published and copyrighted by support for a longer amount of time. These decisions on how to rank the equally its editors. important cases goes beyond the scope of our paper and are left as future work, when the approach will be implemented in the u4fit platform. HealthRecSys’18, October 6, 2018, Vancouver, BC, Canada L. Boratto et al. Table 1: Samples count for each rating Figure 1: Ratings distribution Rating Count 1 216 2 723 3 994 4 977 5 683 The goal of health RS is usually to lead to lifestyle changes [20], to support users who are losing motivation when exercizing [15], and to improve the patients’ safety [5]. Readers can refer to [3] for a survey on health RS. To the best of our knowledge, no recommender system can help coaches by suggesting them the sportspeople that need more timely The workouts we considered are those performed by means support. This approach can help coaches to provide focused in- of the u4fit mobile app. Indeed, we excluded those performed by terventions in order to motivate poor performing users. Indeed, means of running watches, since users have to program their work- coaches can intervene quickly to persuade users change their nega- out routines manually and sometimes the workouts do not match tive attitude towards physical activity so that to favor a longer-term painstakingly the workout built by the coach. Instead, users of the adherence to their training routines. More specifically, our contri- mobile application receive their workout plan seamlessly inside the butions are the following: app, so the performed workouts always match those designed by their coaches. This allows the coaches to make a fair evaluation of • we provide, for the first time in the literature of health RS, the workout. an approach that recommends to a coach the sportspeople As we are dealing with real-world data, the main issues we she follows who need timely support, considering the work- encountered were the data imbalance and the small size of the mi- outs they recently performed and that the coach has not nority classes, as we can clearly notice from Figure 1 that represents considered yet; graphically the distribution of ratings. • we validated our proposal on a real-world dataset made up of approximately 3 years of data, by comparing different classifiers on standard accuracy metrics; 3 PREPROCESSING • our solution can be embedded in real-world persuasive eHealth Most Machine Learning classifiers get into trouble when dealing systems, thus finding practical and effective applications. with imbalanced data, given that the learning phase of classifiers We organize the rest of the paper as follows: in Section 2 we may be biased towards the instances that are frequently present in introduce the dataset and in Section 3 we present the techniques we the dataset [11, 19]. employed to preprocess the data. Section 4 presents the classifiers To deal with imbalanced data, researchers have suggested two we considered in this study, while in Section 5 we present the main approaches: the first approach consists of adapting the data experimental framework and results. We conclude the paper in by performing a sampling, and the other is to tweak the learning Section 6, with some final remarks and future developments. algorithm [11]. For the sake of simplicity and due to its effectiveness in our data, we employed the first approach. 2 DATASET Data sampling aims at modifying the data so that all the classes have the same distribution in the training set. There exist two This research work is based on data collected by means of the u4fit data sampling approaches known as oversampling and under- platform. The dataset contains 3593 workouts, which u4fit coaches sampling. evaluated by assigning a rating ranging between 1 (poorly per- Oversampling balances the training set by duplicating instances formed) and 5 (well performed). Each workout result is represented in the minority class or by generating new synthetic instances us- by the following aggregate statistics: ing Artificial Intelligence algorithms. Under-sampling instead • Covered distance (in meters); proceeds by removing instances from the majority class. • Workout duration (in seconds); In our case, we have considered the oversampling approach, since • Rest time (in seconds); it proved to be more effective for small dimension datasets [21]. • Average speed (in km/h); More specifically, we opted for Synthetic Minority Over-sampling • Maximum speed (in km/h); Technique (SMOTE), since it creates completely new samples in- • User age; stead of replicating the already existing ones, which offers more • User gender; examples to the classifier to learn from [4]. This means that the mi- • Burnt calories. nority classes are oversampled by introducing synthetic examples Ratings were distributed as described in Table 1, where “count” of each minority class considering all the k minority class nearest indicates the number of samples having the corresponding rating. neighbors [4]. Predicting Workout Quality to Help Coaches Support Sportspeople HealthRecSys’18, October 6, 2018, Vancouver, BC, Canada 4 CLASSIFICATION (3) Evaluation of the classifier with fewer features. After In order to identify the classification algorithm most suited for our choosing the most effective classifier, we took away the least use case, we compared tree-based and ensemble classifiers, since important features one by one, and evaluated the classifica- they perform better than those that are not ensemble or tree-based, tion accuracy to check how the less relevant features affected when dealing with low dimensionality data [19]. We evaluated the effectiveness of the classifier. and compared the performance of three among the most effective (4) Features impact on rating values. In the last set of exper- classifiers at state of the art [6]. iments, we measured the correlation between the value that Gradient Boosting (GB) is an ensemble algorithm that improves each feature took in a workout and the rating the workout re- the accuracy of a predictive function through incremental mini- ceived. This allows us to evaluate how each feature impacts mization of the error term. After the initial base learner (almost the quality of a workout. always a tree) is grown, each tree in the series is fit to the so-called "pseudo residuals" of the prediction from the earlier trees with the 5.2 Metrics purpose of reducing the error [2]. In order to evaluate the performance of our multi-class model, we Random Forest (RF) is a meta-estimator of the family of the en- had to choose metrics that are most suitable for multi-class datasets. semble methods. It fits a number of decision tree classifiers, such Nevertheless, the majority of the performance measures present in that each tree depends on the values of a random vector sampled the literature are designed only for two-class problems [9]. independently and with the same distribution for all the trees in Several performance metrics for two-class problems have been the forest. adapted to multi-class. Some measures that fit well our needs, give Decision Tree (DT) is a non-parametric supervised learning method us relevant information about the performance of our classifier, used for classification and regression. One of the main advantages and are successfully applied for multi-class problems are: Accuracy, of decision trees with respect to other classifiers is that they are Recall, Precision, F1-score, Informedness, Cohen’s Kappa [9]. In easy to inspect, interpret, and visualize, given they are less complex what follows, we present these metrics in detail. than the trees generated by other algorithms addressing non-linear Accuracy is defined as (T P + T N )/(P + N ), where P represents needs [16]. positively labeled instances, whereas N represents negatively la- beled ones. T P represents the true positives (i.e., instances of the 5 EXPERIMENTAL FRAMEWORK positive class that are correctly labeled as positive by a classifier), In this section, we will present the experimental setup and strategy, T N represents the true negatives (i.e., instances of the negative class the evaluation metrics, and the obtained results. that are correctly labeled as negative by a classifier). It represents the fraction of all instances that are correctly classified. 5.1 Experimental Setup and Strategy Recall is defined as T P/P and it measures the completeness of a classifier. The experimental framework exploits the Python scikit-learn 0.19.1 Precision is defined asT P/(T P +F P) and it measures the exactness library. The experiments were executed on a computer equipped of a classifier. with a 3.1 GHz Intel Core i7 processor and 16 GB of RAM. To balance F1 score is defined as the data we applied SMOTE, using imbalanced-learn, which is a TP package offering several sampling techniques used in datasets show- 2∗ (1) 2 ∗ TP + FP + FN ing strong class imbalance [12]. The classification was performed and it is a metric that considers both recall and precision. with 10-fold cross-validation. Both the parameters and the features None of the metrics presented so far takes into account the true importance of the classifiers were estimated using Grid Search. negative rate (defined as T N /N ) and this is an issue when deal- The classifier was run with the default parameters, except for the ing with imbalanced datasets [17]. Considered this, we decided to number of boosting stages in Gradient Boosting (n_estimators pa- measure Informedness, which is the clearest measure of the pre- rameter) and the number of nodes in each tree of Gradient Boosting dictive value of a system [18]. Informedness is defined as: Recall (max_depth parameter). This is because a larger number of boost- + true_negative_rate - 1, where true_neдative_rate is T N /N . It ing stages (n_estimators) improves the performance of Gradient ranges between -1 and 1, where 1 represents a perfect prediction, 0 Boosting and max_depth limits the number of nodes of each tree in no better than random prediction, and -1 indicates total disagree- the boosting stages. The best parameters revealed to be max_depth ment between prediction and observation. equal to 9 and n_estimators equal to 400. Cohen’s Kappa is an alternative measure to Accuracy as it com- We performed four sets of experiments: pensates for randomly classified instances. As opposed to Accuracy, (1) Classifiers comparison. We evaluated the classifiers by Cohen’s Kappa evaluates the portion of classified instances that can running them on all the features, then we compared the ac- be attributed to the classifier itself, relative to all the classifications curacy metrics they obtained to determine the most effective that cannot be attributed only to chance. Its formula is: one. Accuracy − RandomAccuracy (2) Feature sets importance evaluation. During the feature Kappa = (2) selection phase, we used the Grid Search algorithm to eval- 1 − RandomAccuracy uate the impact of each feature on the result of the clas- where RandomAccuracy is defined as: sification, for the most effective classifier of the previous (T N + F P) ∗ N + (F N + T P) ∗ P experiment. RandomAccuracy = (3) (P + N )2 HealthRecSys’18, October 6, 2018, Vancouver, BC, Canada L. Boratto et al. Table 2: Classifiers comparison table. Figure 2: Features’ importance Classifier GB RF DT Accuracy 0.78 0.78 0.76 F1 0.49 0.48 0.44 Recall 0.51 0.50 0.44 Precision 0.48 0.47 0.44 Informedness 0.36 0.36 0.29 Cohen’s Kappa 0.35 0.34 0.29 Cohen’s Kappa ranges from -1 (total disagreement), through 0 (ran- dom classification), to 1 (perfect agreement). This metric is partic- ularly effective for multi-class problems as opposite to the accu- racy [9]. Indeed, it scores and aggregates successes independently Table 3: Results returned by training Gradient Boosting with for each class and thus it is less sensitive to the randomness caused different sets of features. by a different number of instances in each class. 1 2 3 4 5 6 7 8 5.3 Experimental Results Accuracy 0.78 0.78 0.63 0.63 0.63 0.63 0.68 0.68 5.3.1 Classifiers comparison. Table 2 shows that Gradient Boost- F1 0.49 0.49 0.03 0.03 0.03 0.03 0.08 0.22 ing is the classifier that performs better for all the metrics. The ac- Recall 0.51 0.50 0.20 0.20 0.20 0.20 0.20 0.21 Precision 0.49 0.48 0.04 0.04 0.02 0.04 0.55 0.27 curacy is about 78%, which means that we are correctly predicting Informedness 0.36 0.36 0.00 0.00 0.00 0.00 0.01 0.01 the rating of a workout in 78% or more of the cases. This means Cohen’s Kappa 0.35 0.35 0.00 0.00 0.00 0.00 0.01 0.03 that, in the vast majority of the cases, the coach would be able to properly support the sportspeople she follows, since she would receive an accurate ranking of those who performed worst in their training. 5.3.4 Features impact on rating values. After analyzing the im- pact of the features on the rating, we noticed that the workouts with 5.3.2 Feature sets importance evaluation. The feature selection lower ratings are those where the values of the features are low. So, process has shown that the ranking of the features, based on the the runners putting more effort during workouts are more likely to impact in the classification process (from the most important to the have a higher rating. The results of the individual experiments are least important), is : omitted due to space constraints. (1) Average speed; (2) Covered distance; 6 CONCLUSIONS AND FUTURE WORK (3) Burnt calories; In this paper, we proposed and validated an approach to identify (4) Workout duration; sportspeople that need immediate coach intervention due to poor (5) Maximum speed; quality workouts, so that we could suggest to their coaches to (6) User age; contact them with a higher priority. (7) Rest time; Our approach takes into account a set of the workouts performed (8) User gender. by a certain user, to which the coach assigned a rating. Then, by In order to analyze in more detail the relevance of these features, exploiting this data, we trained a classifier so that to predict the the diagram in Figure 2 shows the importance of each feature, using rating for new workout results. a scale ranging from 0 (no importance) to 100 (very important); Thanks to these ratings, we could be able to notify the coach we can see that each feature has an impact on the classification when the algorithm detects that the user is performing poorly. In process, since no one has a zero importance rate. this way, the coach can intervene quickly to try to overcome this situation. 5.3.3 Evaluation of the classifier with fewer features. After eval- Experimental results show the effectiveness of our method and, uating the importance of the features, we removed them one by as future work, we will integrate this recommender system in the one, to see how they are affecting the performance of the Gradi- u4fit platform, to be able to investigate the relationship between ent Boosting classifier. Table 3 contains the results removing the workout quality and users motivation. Moreover, we will also ana- features in the previous list one by one, starting from the least lyze the chats between coaches and their users. important one (i.e., setting 1 contains all the features, setting 2 run the classifier without the user gender, setting 3 removed the user gender and the rest time, and so on). As the results show, none of ACKNOWLEDGMENTS the features is negatively affecting the performance of the classifier, The authors would like to thank Marika Cappai, Davide Spano, and since the best results were obtained when using all the features. Daniela Lai for their contribution in this research work. Predicting Workout Quality to Help Coaches Support Sportspeople HealthRecSys’18, October 6, 2018, Vancouver, BC, Canada This work is partially funded by Regione Sardegna under projects 3 (2017), 255–285. AI4fit (Artificial Intelligence & Human Computer Interaction per l’e- [20] Haggai Roitman, Yossi Messika, Yevgenia Tsimerman, and Yonatan Maman. 2010. Increasing Patient Safety Using Explanation-driven Personalized Content Rec- coaching), through AIUTI PER PROGETTI DI RICERCA E SVILUPPO ommendation. In Proceedings of the 1st ACM International Health Informatics - POR FESR SARDEGNA 2014 - 2020, and NOMAD (Next generation Symposium (IHI ’10). ACM, New York, NY, USA, 430–434. [21] José A Sáez, Bartosz Krawczyk, and Michał Woźniak. 2016. Analyzing the over- Open Mobile Apps Development), through PIA - Pacchetti Integrati sampling of different classes and types of examples in multi-class imbalanced di Agevolazione “Industria Artigianato e Servizi" (annualità 2013). datasets. Pattern Recognition 57 (2016), 164–178. [22] Darren E.R. Warburton, Crystal Whitney Nicol, and Shannon S.D. Bredin. 2006. Health benefits of physical activity: the evidence. REFERENCES CMAJ 174, 6 (2006), 801–809. https://doi.org/10.1503/cmaj.051351 [1] Ludovico Boratto, Salvatore Carta, Fabrizio Mulas, and Paolo Pilloni. 2017. An arXiv:http://www.cmaj.ca/content/174/6/801.full.pdf e-Coaching Ecosystem: Design and Effectiveness Analysis of the Engagement [23] Martin Wiesner and Daniel Pfeifer. 2014. Health Recommender Systems: Con- of Remote Coaching on Athletes. Personal Ubiquitous Comput. 21, 4 (Aug. 2017), cepts, Requirements, Technical Basics and Challenges. International Journal of 689–704. https://doi.org/10.1007/s00779-017-1026-0 Environmental Research and Public Health 11, 3 (Mar 2014), 2580–2607. [2] Iain Brown and Christophe Mues. 2012. An experimental comparison of classifi- cation algorithms for imbalanced credit scoring data sets. Expert Systems with Applications 39, 3 (2012), 3446–3453. [3] André Calero Valdez, Martina Ziefle, Katrien Verbert, Alexander Felfernig, and Andreas Holzinger. 2016. Recommender Systems for Health Informatics: State- of-the-Art and Future Perspectives. Springer International Publishing, Cham, 391–414. [4] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321–357. [5] Robert G. Farrell, Catalina M. Danis, Sreeram Ramakrishnan, and Wendy A. Kellogg. 2012. Increasing Patient Safety Using Explanation-driven Personalized Content Recommendation. In Proceedings of the Workshop on Recommendation Technologies for Lifestyle Change (LIFESTYLE 2012) (CEUR Workshop Proceedings). CEUR-WS.org, 24–28. [6] Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research 15, 1 (2014), 3133–3181. [7] Brian J Fogg. 1999. Persuasive technologies. Commun. ACM 42, 5 (1999), 27–29. [8] Brian J Fogg. 2002. Persuasive technology: using computers to change what we think and do. Ubiquity 2002, December (2002), 5. [9] Mikel Galar, Alberto Fernández, Edurne Barrenechea, Humberto Bustince, and Francisco Herrera. 2011. An overview of ensemble methods for binary classi- fiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition 44, 8 (2011), 1761–1776. [10] Wijnand IJsselsteijn, Yvonne de Kort, Cees Midden, Berry Eggen, and Elise van den Hoven. 2006. Persuasive Technology for Human Well-Being: Setting the Scene. Springer Berlin Heidelberg, Berlin, Heidelberg, 1–5. [11] William Klement, Szymon Wilk, Wojtek Michaowski, and Stan Matwin. 2009. Dealing with severely imbalanced data. In Proc. of the PAKDD Conference. Citeseer, 14. [12] Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1–5. http://jmlr.org/papers/v18/16-365 [13] Geneviéve A Mageau and Robert J Vallerand. 2003. The coachâĂŞath- lete relationship: a motivational model. Journal of Sports Sciences 21, 11 (2003), 883–904. https://doi.org/10.1080/0264041031000140374 arXiv:https://doi.org/10.1080/0264041031000140374 PMID: 14626368. [14] Fabrizio Mulas, Paolo Pilloni, Matteo Manca, Ludovico Boratto, and Salvatore Carta. 2013. Linking Human-Computer Interaction with the Social Web: A web application to improve motivation in the exercising activity of users. In Cognitive Infocommunications (CogInfoCom), 2013 IEEE 4th International Conference on. 351–356. https://doi.org/10.1109/CogInfoCom.2013.6719270 [15] Paolo Pilloni, Luca Piras, Ludovico Boratto, Salvatore Carta, Gianni Fenu, and Fabrizio Mulas. 2017. Recommendation in Persuasive eHealth Systems: an Ef- fective Strategy to Spot Users’ Losing Motivation to Exercise. In Proceedings of the 2nd International Workshop on Health Recommender Systems co-located with the 11th International Conference on Recommender Systems (RecSys 2017), Como, Italy, August 31, 2017. (CEUR Workshop Proceedings), David Elsweiler, San- tiago Hors-Fraile, Bernd Ludwig, Alan Said, Hanna Schäfer, Christoph Trattner, Helma Torkamaan, and André Calero Valdez (Eds.), Vol. 1953. CEUR-WS.org, 6–9. http://ceur-ws.org/Vol-1953/healthRecSys17_paper_5.pdf [16] Paolo Pilloni, Luca Piras, Salvatore Carta, Gianni Fenu, Fabrizio Mulas, and Ludovico Boratto. 2018. Recommender System Lets Coaches Identify and Help Athletes Who Begin Losing Motivation. Computer 51, 3 (2018), 36–42. [17] David Martin Powers. 2011. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. (2011). [18] David MW Powers. 2012. The problem with kappa. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 345–355. [19] Santosh S Rathore and Sandeep Kumar. 2017. A decision tree logic based recom- mendation system to select software fault prediction techniques. Computing 99,