Prediction of Useless and irrelevant Comments in C Language as per Surrounding Code Context Bikram Ghosh2 , Pankaj Chowdhury1* and Utpal Sarkar1 1 Department of Computer Science and Engineering, Jadavpur University, India 2 Department of Mechanical Engineering, Jadavpur University, India Abstract Irrelevant and Useless Comments is a prevalent problem that Programmers/Coders generally struggle with everyday.Though comments increases the size of the code file, but it is very useful for a new programmer to understand which work has already been done in the past. The programmer can easily use a function without understanding the logic with the help of comments. However there are some unnecessary or out-of-context comments that not only decreases the readability of the codebase but also, sometimes, confuses the programmer. This presents the need of automatic detection of such comments on a codebase to ease the programmer with understanding code context, adding new features or debugging the entire code for a bug. In this paper, we have presented the machine learning models that can detect Useless Comments as per the Surrounding code context. Specifically, we described the model submitted for the shared task on Comment Classification in C Language at FIRE 2022.The problem concentrates on binary classification of comments into two classes, namely: Useful and Useless Comments. Overall, our performance is good but it needs some improvement that can be done by some more data pre- processing techniques.Our scores are encouraging enough to work for better results in future. Keywords Naive-Bayes Classification, SVM, TFIDF, Logistic regression 1. Introduction Commenting involves placing Human Readable Descriptions inside computer programs detailing what the Code is doing. Proper use of commenting can make code maintenance much easier, as well as helping to find bugs faster.Further, commenting is very important when writing functions so that other people will use the function just by understanding its utilization. Remember, a well-documented code is as important as a correctly working code. In C Language, the syntax of writing comments is: use // for a single-line comment. use /* for multiple lines */ However, when a codebase is filled with irrelevant or useless comments, it creates problems. Sometimes it even leads to misinterpreting the code when a programmer believes the comment. Also, a comment is definitely useless if it matches the signature of a function. Such useless comments reduce the clarity of a well-expressed code, take time for the pre-processor to remove, and also occupy screen space without any valuable contribution. Artificial Intelligence and different Machine Learning techniques can be implemented to remove such useless comments from the codebase. In this paper, we have implemented several Machine Learning models to classify the comments from the provided dataset into two classes: Useful(1) and Not-useful(0). Out of all the algorithms implemented, we were eventually able to achieve an F1 Accuracy Score of 0.83. Forum for Information Retrieval Evaluation, December 9-13, 2022, India *Corresponding authors bikramghosh9547@gmail.com (B. Ghosh); pankajchowdhury497@gmail.com (P. Chowdhury); sutpal872@gmail.com (U.Sarkar) https://github.com/bikramghosh-ux/IRSE_FIRE_2022_Submission ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 2. Related Work This task can be considered as a problem requiring a suitable implementation of Text Classification Algorithms as, herein, comments/text are being put into binary classification. Several Works have been proposed that can be effectively used to classify text into two classes based on data set. Mukesh Zaveri[1] proposed anautomatic text classification model that was implemented over the content of Blog posts to classify them as structured or unstructured. Moreover, Kamel Alreshedy[2] proposed a research paper where he showed his team work on machine learning algorithms to classify Code Snippets into programming Languages they belong. Also, Kriti Kumari[3] participated in HASOC-2019 task 1 i.e. Identification of Abusive Content in a text where they used Glove embedding and fastText embedding techniques to classify comments whether they are abusive or not. The problems that the references discussed above, were not the same as ours. However, these problems shared many similarities and hence were very useful to gain some intuition about the approach. 3. Task and Dataset Description In this section, we have described the Comment Classification shared task and the dataset provided to the participants. FIRE 2022 shared a task that basically aims to classify comments written in C Languageinto two classes, namely: Useful (U) and Not Useful (N). (shown in Table 1): 1. (U) Useful - This Comment is very useful as gives some vital information needed to be taken care of while running the code context. 2. (N) NotUseful - This Comment is not useful. It is unnecessarily provided or sometimes, even makes no sense in the codebase. Alongwith the Comments, their respective Surrounding Code Context was also provided.On the basis of which the comments were required to be classified. Table We have used the dataset available at IRSE 2022. The dataset consists of 8,047 comments for training provided separately as a training dataset and 1,001 comments for testingprovided as a testing dataset with a balanced distribution of each class. There was no need of splitting the dataset for testing and training purposes. 4. System Description 4.1. Text Pre-processing We have removed all the punctuations, numbers and stop words from the Surrounding Code context. We also removed the starting two characters (“/*”) and ending two characters (“*/”) from the Comments as they are just basic syntax of writing comments in C Language. We then converted all alphabetic characters present into lowercase. We have also used lemmatization for grouping together the different forms of a word into a single word. NLTK wordnet is used for lemmatization. Both Train and Test data uses same preprocessing. 4.2. Feature Extraction For Logistic regression,Multinomial Naive Bayes, and Support Vector Machine algorithms we have used TF- IDF Vectorizer [4] from the Sci-kit learn library. Tf-IdfVectorizeris used for converting the text into numerical features. Pipeline 1 is used for doing Tf-IDFVectorizer and classification in a pipelined manner. 4.3. Machine Learning Models We have submitted runs based on three different algorithms, namely- Logistic Regression [5], Support VectorMachine [6],and Multinomial Naive Bayes [7]. We have used the Sci-kit-learn library for logisticregression-based models. We scored a maximum F1 score of 0.83 using SVM and Logistic Regression (both) for the task. For SVM we have used a Linear kernel with a C=1.0 and gamma= ‘auto’ and degree=3. Proper value of c and gamma need to chosen for optimizing the performance of SVM Classifier. 5. Results and Discussion The results of the task are represented in terms of Macro-F1,Macro Precision, Macro Recall, and Accuracy (shown in Table below). The best score is Macro-F1, we got a maximum Macro Precision of 0.79 from the implementation of the Logistic Regression Model. Run Macro F1 Macro Recall Macro Precision Accuracy Logistic 0.8 0.8 0.79 83% Regression Linear Support Vector 0.79 0.8 0.79 83% Machine(SVM) Multinomial Naive Bayes 0.63 0.7 0.67 63% Classification 6. Conclusion and Future Work We have completed the given task using various text classification algorithms and evaluated the performance of different algorithms for Comment Classification. The task was very interesting and unique. We got to learn a lot from this task and we also got to test our knowledge of Machine Learning Algorithms. We look forward to experimenting with different advancedalgorithms or neural network models. Also, fine-tuning the parameters of the algorithm can help in the improvement of the overall performance. And the results of more than one classification algorithm can be combined to generate an overall better score.We shall be exploring these tasks in the coming days. References [1] Mukesh Zaveri, Mita K Dalal, Automatic Text Classification: A Technical Review: Sardar Vallabhai National Institute of Technology, Surat,India [2] Kamel Alreshedy, Dhanush Dharmaretnam, Daniel M. German, Venkatesh Srinivasan, T. Aaron Gulliver : SCC : Automatic Classification of Code Snippets: 21 September 2018 [3] Kriti Kumari, Jyotiprakash Singh : Deep Learning Approach for Classification of Abusive Text: National Institute of Technology, Patna: HASOC 2019 [4] V. Kumar, B. Subba, A TF-IDF vectorizer and SVM based sentiment analysis framework for text data corpus, in 2020 National Conference on Communications (NCC), 2020, pp. 1–6. DOI: 10.1109/NCC48643.2020.9056085. [5] Logistic regression (2010) 631–631. URL: https://doi.org/10.1007/978-0-387-30164-8_493DOI: 10.1007/978- 0- 387- 30164- 8_493 . [6] Support Vector Machines. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7687-1_810 [7] Multinomial Naive Bayes for Text Categorization Revisited. In: Webb, G.I., Yu, X. (eds) AI 2004: Advances in Artificial Intelligence. AI 2004. Lecture Notes in Computer Science(), vol 3339. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30549-1_43 [8]@inproceedings{majumdar2022overview,title={{Overview of the IRSE track at FIRE 2022: In- formation Retrieval in Software Engineering}},author={Majumdar,Srijoni and Bandyopadhyay, Ayan and Das, Partha Pratim and D Clough, Paul and Chattopadhyay, Samiran and Majum- der,Prsenjit},booktitle={ForumforInformationRetrievalEvaluation},publisher={ACM},year={2022 },month={December}} [9] Majumdar, S., Papdeja, S., Das, P.P., Ghosh, S.K. (2020). COMMENT-MINE—A Seman- tiSearch Approach to Program Comprehension from Code Comments. In: Chaki, R., Cortesi, A., Saeed, K., Chaki, N. (eds) Advanced Computing and Systems for Security. Advances in Intelligent Systems and Computing, vol 1136. Springer, Singapore. https://doi.org/10.1007/978-981-15-2930- 6_3 [10] Majumdar, S., Bansal, B., Das, P.P.,Clough, P.D., Dutta, K., Ghosh,S.K. (2022). Automatic Evaluation of Comments to aid software Maintenance.https://doi.org/10.1002/smr.2463 [11] Singer, J., Lethbridge, T., Vinson, N. and Anquetil, N., 2010. An examination of software en- gineering work practices. In CASCON First Decade High Impact Papers (pp. 174-188). [12] Etzkorn, Letha H., Carl G. Davis, and Lisa L. Bowen. "The language of comments in computer software: A sublanguage of English." Journal of Pragmatics 33.11 (2001): 1731-1756. [13] Tilus, Tero, Jussi Koskinen, Jarmo J. Ahonen, Heikki Lintinen, Henna Sivula, and Irja Kanka- napää. "Industrial application and evaluation of a software evolution decision model." In Technologies for Business Information Systems, pp. 417-427. Springer, Dordrecht, 2007. [14] Dehaghani SM, Hajrahimi N. Which factors affect software projects maintenance cost more? Acta Inform Med. 2013 Mar;21(1):63-6. doi: 10.5455/AIM.2012.21.63-66. PMID: 23572866; PMCID: PMC3610582