THE EFFICIENCY OF MACHINE LEARNING ALGORITHMS IN CLASSIFYING NON-FUNCTIONAL REQUIREMENTS

. Machine learning (ML) algorithms are more and more widely applied in various types of systems, so the research related to them is also increasing. One of the areas of research under consideration is the classification of non-functional requirements (NFRs) us-ing ML algorithms. This area of research is important because the automatic classification of NFRs using high-performance ML algorithms and corresponding features helps requirements engineers classify non-functional requirements more accurately. This paper examines ML algorithms suitable for solving classification problems and their effectiveness in classifying non-functional requirements. Based on the described stages of the research methodology ML algorithms models were compared using the accuracy, precision, recall, and F-score metrics. A majority voting classifier model was created using Support Vector Machine, Naïve Bayes and K Nearest Neighbor Algorithm algorithms. After K-Fold cross validation were obtained these re-sults: accuracy – 0.710 (scale from 0 to 1), precision – 0.845, recall – 0.814 and F-score – 0.815.


Introduction
Machine learning (ML) is a branch of artificial intelligence (AI) and computer science that uses large amounts of data to create complex predictions and decision-making systems that would otherwise be difficult to achieve.Decision-making systems based on ML are increasingly being used in decisions about bank loans (Karthiban et al., 2019), employment (Imam & Ananda, 2022), clinical trials (Miller et al., 2023) and in many other areas.The success of ML-enabled systems depends on the properties of the ML solutions (like performance, transparency, maintainability, interoperability, etc.), which are known as non-functional properties in the domain of requirements engineering.In addition, ML systems' non-functional requirements (NFRs) may differ in their definition, measurement, scope and importance in comparative meaning (Habibullah & Horkoff, 2021).Our understanding of these aspects is inadequate compared to our knowledge of NFRs in traditional domains.
In addition, the task of classifying requirements is distinguished in software engineering because the manual process of classifying non-functional requirements is subjective, potentially erroneous, complex, and time-consuming.With ML, this process would reduce these disadvantages.This area of research is relevant because the automatic classification of NFRs using high-performance ML algorithms and corresponding features helps requirements engineers classify non-functional requirements more accurately (Khurshid et al., 2022).Therefore, this study shows a comparison of ML algorithms to the problem of non-functional requirements classification to answer the question: "Which works best for classifying software requirements into non-functional requirements?"and "Which Machine Learning Algorithm provides the best performance for the requirements classification task?".
The purpose of this work is to explore the performance of ML algorithms for classification of non-functional requirements and to propose a solution to improve the accuracy of the classifier.This work result is to suggest a more accurate method for NRF classification based on ML.The novelty of the research is as follows: ■ The study highlights the need and problem of classification of non-functional require- ments using ML algorithms.
■ The study focuses on using machine learning algorithms to classify non-functional re- quirements, a growing research area.
■ The results of the study show a more accurate non-functional requirement classification method based on ML.The rest of this paper is structured as follows.Section 2 presents the related works on the topic of classification of non-functional requirements by ML algorithms.Section 3 presents preliminaries related to the research topic.Section 4 shows the methodology and use case.Section 5 presents model accuracy improvement.Finally, Section 6 provides the conclusion and the outlines for future works.

Related works
This section presents related work on ML algorithms for classifying non-functional requirements.The aim of this section is to determine which ML algorithms researchers consider suitable for the classification of NFRs.The work on the classification of non-functional requirements using AI techniques relevant to this analysis is summarized in Table 1.The content of this table consists of the following columns: 1) reference (Ref.) of the scientific paper; 2) ML algorithm that showed the highest accuracy; 3) data set from which the requirements for the analyses were taken; 4) instruments used in the research; 5) results, i.e. best ML algorithm for NFR classification; 6) accuracy obtained with the best ML algorithm for NFR classification.
Analyzing the data in Table 1 shows that the authors reported these ML methods as the most accurate for the classification of NFRs: Support Vector Machine (SVM), Naïve Bayes (NB), K Nearest Neighbor Algorithm (KNN).

ML algorithms
ML systems aim to learn from data (Mahesh, 2020) to automate the process of creating an analytical model and solving related tasks (Janiesch et al., 2021).However, the researchers emphasize that there is no single ML algorithm that can best solve different problems.The algorithm used depends on the problem to be solved, its type, the number of variables, the most appropriate model, and other factors (Mahesh, 2020).ML algorithms are divided into two types of learning: supervised and unsupervised learning (Carta, 2022).The fundamental difference between these two learning algorithms is whether the examples given to the learning algorithm are labelled or not (Bao et al., 2022).

End of Table 2
According to Sarker et al. (2020), supervised learning occurs when specific goals are to be achieved with a given set of inputs, i.e., a task-driven approach.In this paper, supervised learning is best suited for NRF classification as a labelled dataset is provided for research.Table 2 lists the ML algorithms that are further used to evaluate the performance of the ML algorithms in NFR classification.These algorithms were selected after analyzing the scientific literature related to the subject area.After various research was analyzed, the conclusion was that these algorithms perform well in NFR classification tasks.Also, in the process of choosing algorithms, one of the most important properties was the algorithms that are characterized by accuracy.It is considered a powerful tool to make accurate predictions (Ho et al., 2021) Regression and classification problems (Yang & Shami, 2020) Naïve Bayes (NB) Its simplicity allows all functions to contribute equally to the final solution (Ibrahim & Abdulazeez, 2021) Popular statistical method for spam filtering (Wickramasinghe & Kalutarage, 2021)

K Nearest Neighbor Algorithm (KNN)
A simple ML algorithm used to classify data points by calculating distances between different data points (Yang & Shami, 2020) This algorithm is used to solve classification and regression problems (Ibrahim & Abdulazeez, 2021) Logistic regression is classified as a statistical ML method (Rymarczyk et al., 2019) Binary classification tasks by predicting the probability (Kanade, 2022) Thus, each ML algorithm has its own characteristics and is best suited to tackling certain problems listed in Table 2. Next, models are developed with the analyzed ML algorithms to verify the suitability of the algorithms for NFR classification.

Measurement metrics
Below, you will find the metrics against which the study results are measured.
Accuracy shows the number of correctly classified data units out of the total number of data units (Harikrishnan, 2019).The accuracy calculation formula is given (Silwal, 2022), (1):

TP TN Accuracy
TP TN FP FN where TP -is the number of correctly classified requirements, TN -is the number of true negative results, FP -is the number of falsely recognized as correct requirements, and FN -is the number of incorrectly classified requirements.
Precision is defined as the number of correctly identified true positives divided by the sum of the number of correctly identified true and false positive results (Koehrsen, 2018).The precision calculation formula is presented (Binkhonain & Zhao, 2019), (2): where TP -is the number of correctly classified requirements and FP -is the number of falsely recognized correct requirements.
Recall measures the percentage of correctly classified NFRs (Binkhonain & Zhao, 2019) and can be considered the ability of a model to find all data points of interest in that data set (Koehrsen, 2018).The formula for calculating recall is presented (Binkhonain & Zhao, 2019), (3): where TP -is the number of correctly classified requirements, and FN -is the number of incorrectly classified requirements.F-score considers both precision and recall, it is the harmonic mean of precision and recall (Ghoneim, 2019), and it is calculated according to the formula (Shung, 2018), (4):

Methodology and use case for Classifying Non-Functional Requirements
Based on Haque et al. (2019), Figure 1 shows a principal schema for analyzing ML models.
The models are created using ML algorithms for NFR classification.The steps involved in analyzing and fitting ML models are as follows: ■ Data processing.As with most ML projects with big data, data preprocessing is a nec- essary first step (Baker et al., 2019).This is the preparation of the Kaggle platform data set for further classification.Used Kaggle data set consists of 976 requirements, 346 of which are non-functional and divided into 11 categories (Shukla, 2023).This data set was last modified in February 2023.The requirements for the Kaggle data set consist of fault tolerance (10 requirements), maintainability (17 requirements), performance (54 requirements), portability (2 requirements), scalability (21 requirements), security (56 requirements), usability (63 requirements), legal and licensing (10 requirements), availability (21 requirements), look and feel (34 requirements), operability (58 requirements) NRF classes.Special characters are removed, as well as uppercase letters are converted to lowercase to classify the data set as simply as possible, words that complicate the algorithm are removed (such as "a", "an", etc.) and tokenization is performed -this divides the text into smaller parts (Haque et al., 2019).Python programming language, a Google Colab software tool, were used for data processing.
■ Converting a list of essential words into a set of features.This is changing the text so that it can be understood by ML algorithms.■ Creating models.By applying DT (Decision Tree), RF (Random Forest), SVM (Support Vector Machine), NB (Naïve Bayes), KNN (K Nearest Neighbor) and LR (Logistic Regression) ML algorithms for data classification, ML models are created, which are trained to classify NFRs, and their results are tested.
■ Comparison of results.All the results of the obtained models were compared according to accuracy, precision, recall and F-score results.Table 3 compares the achieved models results.The model of the support vector method provided the best accuracy results, while the accuracy of the decision tree reached only 0.529, which was the lowest among those examined.The best precision results were obtained using the simple Naïve Bayes classifier -0.811.The rest of the models gave similar rates, except for the decision tree model, which only had a rate of 0.539.Analysing recall results, the model of the support vector method achieved the best results.The best F-score was obtained using the support vector method (0.788).Also, the simple Naïve Bayes classifier gave a similar F-score of 0.785.The decision tree model produced the worst F-score.
Summarizing the results of all models, the model of the support vector method achieved the best indicators, but the results of other models, except for the decision tree model, can be considered good.By industry standards, results are considered good when they are between 70% and 90%.Anything above 70% is acceptable as valuable data output for the model (Hendricks, n.d.).Meanwhile, the indicators of the decision tree model do not fall within the specified range.

Improvement of ML model accuracy
To create a more accurate method for NRF classification, it is worth considering combining ML algorithms.Ensemble models can be used for this.Combining different sets of individual ML models can improve the stability of the overall model, resulting in more accurate predictions (Nelson, 2020).One of the ensemble ML techniques is majority voting classifier.To make a final prediction, voting classifier combines the predictions of several individual classifiers (Kumar, 2023).Majority voting classifier is commonly used for classification problems (Singh, 2023).The advantage of majority voting is that it reduces the prediction error rate (Bajaj, 2023).
In order to improve the accuracy of the NFR classification, it was found that the best results were achieved with the SVM, NB and KNN algorithms using majority voting classifier.Figure 2 shows the obtained model classification report.The results of the majority voting classifier provided better indicators and ML algorithms in precision and F-score.However, when analysing precision and recall, the SVM model produced the same results as the majority voting classifier.Overall, the model showed better results in the classification of NFR.In order to determine the stability of the model, K-Fold cross validation of the model was performed.10 subsets were used for model stability assessment.Table 4 shows the results of each of the subsets and their average.
Results: accuracy between 0.559 and 0.829 (from 0 to 1), precision between 0.617 and 0.861, recall between 0.559 and 0.829 and F-score between 0.525 and 0.817.Figure 3 presents an average of 10 subsets confusion matrix of majority voting classifier model after K-Fold cross validation.Confusion matrix presents how many classes were predicted as actual.In order to compare the results of the new majority model with the results of the best performing SVC algorithm model, after cross-validation, Table 5 is presented.Comparing the majority voting classifier model to the best-performing SVC model up to that point, we see a 1.7% increase in accuracy, a 7.9% increase in precision, a 1.7% increase in recall, and a 3.3% increase in F-score.

Conclusions
After analyzing the articles on the topic of ML algorithms for classifying non-functional requirements, it was found that one of the research directions is the classification of NFRs using ML algorithms.After deeper analyzing this, it is found that most researchers use SVM as the most accurate algorithm for classifying NFRs, but the opinions of researchers differ.
With the selected Kaggle data set and suitable tools, the development of ML methods was carried out, and performance indicators of models were obtained, based on which the accuracy of ML algorithms in classifying NFRs was compared.Among the ML algorithms used, the best results were provided by SVM, where model accuracy -0.814 (on a scale from 0 to 1), precision -0.770, recall -0.814 and F-score -0.788 were achieved.
After conducting experiments combining a ML algorithm, a more accurate NFRs classification method based on ML was obtained.The best results were obtained using majority voting classifier with SVM, NB, KNN algorithms, where accuracy -0.814 (on a scale from 0 to 1), precision -0.845, recall -0.814 and F-score -0.815.K-Fold cross validation for majority voting classifier with SVM, NB, KNN were performed.Model accuracy -0.710, precision -0.845, recall -0.814 and F-score -0.815.
The future research direction of this paper can be improving the results achieved by the non-functional requirements classification model by using hybrid algorithm or explore solutions contributed from other perspectives and AI techniques, such as Search-based Software Engineering.

Figure 1 .
Figure 1.Principal schema of analyzing ML models

Figure 2 .
Figure 2. The majority voting classifier model classification report

Table 1 .
Analysis of studies on NFR classification by ML algorithms

Table 2 .
Supervised ML algorithms for classification.
The data is divided into groups and used for model training and requirements classification.The data is automatically partitioned as follows: 80% of the data set is for training, and 20% of the data set is for testing.The data divided into training and testing sets are further vectorized using Term Frequency and Inverse document Frequency (TF-IDF).Other vectorization techniques such as Bidirectional Encoder Representations from Transformers (BERT) have been tried, but the best results have been obtained using TF-IDF.

Table 3 .
Comparisons of the accuracy, precision, recall and F-score results of the models

Table 4 .
Model subsets results Figure 3.The majority voting classifier model confusion matrix

Table 5 .
Model results after cross-validation.