Standard

A Large Empirical Assessment of the Role of Data Balancing in Machine-Learning-based Code Smell Detection. / De Roover, Coen.

In: Journal of Systems and Software, Vol. 169, 110693, 11.2020.

Research output: Contribution to journalArticle

Harvard

APA

Vancouver

Author

BibTeX

@article{fe24c85094584d6aaa09547b941ed3a5,
title = "A Large Empirical Assessment of the Role of Data Balancing in Machine-Learning-based Code Smell Detection",
abstract = "Code smells can compromise software quality in the long term by inducing technical debt. For this reason, many approaches aimed at identifying these design flaws have been proposed in the last decade. Most of them are based on heuristics in which a set of metrics is used to detect smelly code components. However, these techniques suffer from subjective interpretations, a low agreement between detectors, and threshold dependability. To overcome these limitations, previous work applied Machine-Learning that can learn from previous datasets without needing any threshold definition. However, more recent work has shown that Machine-Learning is not always suitable for code smell detection due to the highly imbalanced nature of the problem. In this study, we investigate five approaches to mitigate data imbalance issues to understand their impact on Machine Learning-based approaches for code smell detection in Object-Oriented systems and those implementing the Model-View-Controller pattern. Our findings show that avoiding balancing does not dramatically impact accuracy. Existing data balancing techniques are inadequate for code smell detection leading to poor accuracy for Machine-Learning-based approaches. Therefore, new metrics to exploit different software characteristics and new techniques to effectively combine them are needed.",
keywords = "code smells, machine learning, data balancing, model-view-controller smells, object-oriented smells",
author = "{De Roover}, Coen",
year = "2020",
month = "11",
doi = "https://doi.org/10.1016/j.jss.2020.110693",
language = "English",
volume = "169",
journal = "Journal of Systems and Software",
issn = "0164-1212",
publisher = "Elsevier Inc.",

}

RIS

TY - JOUR

T1 - A Large Empirical Assessment of the Role of Data Balancing in Machine-Learning-based Code Smell Detection

AU - De Roover, Coen

PY - 2020/11

Y1 - 2020/11

N2 - Code smells can compromise software quality in the long term by inducing technical debt. For this reason, many approaches aimed at identifying these design flaws have been proposed in the last decade. Most of them are based on heuristics in which a set of metrics is used to detect smelly code components. However, these techniques suffer from subjective interpretations, a low agreement between detectors, and threshold dependability. To overcome these limitations, previous work applied Machine-Learning that can learn from previous datasets without needing any threshold definition. However, more recent work has shown that Machine-Learning is not always suitable for code smell detection due to the highly imbalanced nature of the problem. In this study, we investigate five approaches to mitigate data imbalance issues to understand their impact on Machine Learning-based approaches for code smell detection in Object-Oriented systems and those implementing the Model-View-Controller pattern. Our findings show that avoiding balancing does not dramatically impact accuracy. Existing data balancing techniques are inadequate for code smell detection leading to poor accuracy for Machine-Learning-based approaches. Therefore, new metrics to exploit different software characteristics and new techniques to effectively combine them are needed.

AB - Code smells can compromise software quality in the long term by inducing technical debt. For this reason, many approaches aimed at identifying these design flaws have been proposed in the last decade. Most of them are based on heuristics in which a set of metrics is used to detect smelly code components. However, these techniques suffer from subjective interpretations, a low agreement between detectors, and threshold dependability. To overcome these limitations, previous work applied Machine-Learning that can learn from previous datasets without needing any threshold definition. However, more recent work has shown that Machine-Learning is not always suitable for code smell detection due to the highly imbalanced nature of the problem. In this study, we investigate five approaches to mitigate data imbalance issues to understand their impact on Machine Learning-based approaches for code smell detection in Object-Oriented systems and those implementing the Model-View-Controller pattern. Our findings show that avoiding balancing does not dramatically impact accuracy. Existing data balancing techniques are inadequate for code smell detection leading to poor accuracy for Machine-Learning-based approaches. Therefore, new metrics to exploit different software characteristics and new techniques to effectively combine them are needed.

KW - code smells

KW - machine learning

KW - data balancing

KW - model-view-controller smells

KW - object-oriented smells

UR - http://www.scopus.com/inward/record.url?scp=85086465273&partnerID=8YFLogxK

U2 - https://doi.org/10.1016/j.jss.2020.110693

DO - https://doi.org/10.1016/j.jss.2020.110693

M3 - Article

VL - 169

JO - Journal of Systems and Software

JF - Journal of Systems and Software

SN - 0164-1212

M1 - 110693

ER -

ID: 52130310