APPLICATION OF MACHINE LEARNING ALGORITHMS TO PREDICT HOTEL OCCUPANCY

. The development and availability of information technology and the possibility of deep integration of internal IT systems with external ones gives a powerful opportunity to analyze data online based on external data providers. Recently, machine learning algorithms play a significant role in predicting different processes. This research aims to apply several machine learning algorithms to predict high frequent daily hotel occupancy at a Chinese hotel. Five machine learning models (bagged CART, bagged MARS, XGBoost, random forest, SVM) were optimized and applied for predicting occupancy. All models are compared using different model accuracy measures and with an ARDL model chosen as a benchmark for comparison. It was found that the bagged CART model showed the most relevant results (R 2 > 0.50) in all periods, but the model could not beat the traditional ARDL model. Thus, despite the original use of machine learning algorithms in solving regression tasks, the models used in this research could have been more effective than the benchmark model. In addition, the variables’ importance was used to check the hypothesis that the Baidu search index and its components can be used in machine learning models to predict hotel occupancy


Introduction
Machine learning and data-driven methods have been widely used in research and practice of forecasting (Strielkowski et al., 2023).As a scientific technology of how computers mimic human learning behavior (Al Shehhi & Karathanasopoulos, 2020), it acquires new knowledge or experience and reorganizes existing knowledge structures to improve performance.Machine learning could learn the laws and patterns of massive data through computers, mining poten-tial information, and is widely used to solve problems such as classification, regression, and clustering (Ahani et al., 2019;Aryai & Glodsworthy, 2023;Divasón et al., 2023;Kaya et al., 2022;Viverit et al., 2023).It is related to data learning that could help machines learn patterns from existing complex data, and is therefore widely used to forecast future behavioral outcomes and trends (Boriratrit et al., 2023;Gong et al., 2023;Kamm et al., 2023;Khalil et al., 2022;Kolomoyets & Dickinger, 2023;Sayed et al., 2023).Many types of machine learning methods have been published in research for decades.Multiple classification methods can be based on the emphasis on different learning strategies, such as (1) machine learning that simulates the human brain and (2) directly adopting mathematical methods.The mathematical methods mainly consist of three elements: statistical machine learning, model, strategy, and algorithm.A time series is a classic trend prediction method that assumes future values composed of patterns of current and historical data, and it is applied to constructing a model from historical data and then predict future data in machine learning (Caicedo-Torres & Payares, 2016;Mehmood et al., 2022;Qin et al., 2023;Sun & Lu, 2023).According to Calero-Sanz et al. (2022), Jiang et al. (2023), and Yang et al. (2015), the tourism sector, especially the hotel industry, has adopted machine learning for forecasting room booking and cancellation, demand, prices, and occupancy.This trend is approved by Huang and Zheng (2023), Li et al. (2020), Sánchez et al. (2020), Sánchez-Medina and Sánchez (2020), Koupriouchina et al. (2014), Al Shehhi and Karathanasopoulos (2020), and Zhai et al. (2023).
However, the hotel's corresponding forecasting accuracy has received widespread attention from scholars and industries.Hotel occupancy is the ratio of the total number of rooms a hotel occupies to the total number of rooms, usually expressed as a percentage.The hotel occupancy rate is one of the essential indicators to measure the hotel's operational status and management level, directly affecting the hotel's revenue and profits.Various factors, such as seasonal factors, pricing strategies, market demand, etc., influence hotel occupancy rates.In order to improve hotel occupancy, hotel managers can adopt various strategies, such as regular market analysis and competitive intelligence collection, formulating reasonable price strategies and promotion strategies, and improving customer satisfaction and loyalty.A hotel occupancy rate of over 60% is considered ideal, but different types of hotels and market demand can also impact the occupancy standards.Therefore, hotel occupancy prediction is an essential managerial tool for hotels.
Hotel occupancy prediction could use data analysis and machine learning algorithms to predict future occupancy rates of hotels.This prediction could help hotel managers formulate room prices, promotion strategies, and management strategies to meet market demand and improve hotel operating efficiency.Hotel occupancy forecast usually uses historical data, market trends, and other information to conduct model training and forecast.Prediction algorithms include time series analysis, regression analysis, neural network, random forest, etc.These algorithms could predict future occupancy rates by learning and predicting historical data while evaluating the impact of different factors on occupancy rates.Hotel managers can also use online prediction tools to predict occupancy rates.These tools usually use real-time data and market trends to make predictions and provide suggestions and decision support based on the prediction results.However, one of the worst impacts of the COVID-19 on the hotel industry, the hotel passenger flow has declined significantly.Due to the uncertainty of mobility control policies, travel is sometimes allowed but randomly controlled.Many tourists whose mobility has been restricted for a long time and who start planning to travel suddenly cancel their plans due to the lockdown of their residential areas or tourist destinations.The hotel customers flow significantly decreases and seriously impacts occupancy rates and profits.It leads to increased hotel operating costs, adjustments to hotel marketing strategies, hotel formats, and service models, and challenges the accuracy of hotel occupancy forecast models.
This paper aims to apply several machine learning algorithms to predict high frequent daily hotel occupancy at a Chinese hotel.Five machine learning algorithms (bagged CART, bagged MARS, XGBoost, random forest, and SVM) were used to achieve the research purpose.Usually, these methods are intensively used for solving the tasks of classification.However, in this paper, these methods were used unusually for solving the regression tasks (all these algorithms can be applied for classification as for regression).In addition, all five machine learning tools were compared with a benchmark represented by a traditional econometric model based on the ARDL methodology.Also, the following hypotheses were checked: 1.The machine learning tools are more effective than traditional methods of forecasting (ARDL model) and should show higher forecasting power.2. The Chinese Baidu search engine index and its components can be used for predicting hotel occupancy.The paper is organized in the following way.The actualization of the research conducted is given in the introduction part.The "Bibliographic mapping" provides the results of the network visualization based on the bibliometric data obtained from the Web of Science database using two searching keywords, "machine learning" and "hotel occupancy".The next chapter, "Description of variables used in the research", explains the variables that significantly impacted tourism demand.A description of the machine learning algorithms used follows this chapter.The measures of model accuracy allow us to compare the models between themselves as well as to assess their forecasting power.It is shown in chapter four.The presentation of the research results is shown in chapter five.The conclusion puts forward the main discoveries related to machine learning algorithms' forecasting power (accuracy) and the results of testing hypotheses.

Bibliographic mapping
Figure 1 shows the results of the network visualization based on the bibliometric data obtained from the Web of Science database using two searching keywords, "machine learning" and "hotel occupancy".In the network visualization, all items related to two keywords in the scientific papers are represented by words (labels) with the highest frequency matched in the articles.The font size of the label and the square of the circle of any item is determined by the item's weight in the total number of found words.The higher the frequency (weight) of an item, the larger the font size of the label and the square covered by the circle of the item.The different colors of items describe clusters to which the corresponding item belongs.Colorful lines represent the links between the items.In addition, the distance between two items approximately indicates the relatedness of the items, i.e., the closer are two items, the stronger their relatedness is (van Eck & Waltman, 2023).The results shown in Figure 1 can be analyzed by clusters highlighted in different colors: 1. Algorithms.The most frequent words are random forest, deep learning, neural networks, and support vector machine.All these words show the main trends in implementing the corresponding machine learning methods to predict hotel occupancy.2. Fields of forecasting.There are four main words associated with forecasting: hotel occupancy, demand, accuracy, and forecasting itself.3. A cluster of words "hotel", "algorithms", "model", "arrivals", "timeseries", "forecasting tourism demand" can be considered as evidence of the usage of different algorithms and models based on time series of data to predict hotel occupancy or demand.4. The last green cluster links occupancy, revenue management, tourism, big data, and genetic algorithm.Such a relationship between the words can be decoded as the usage of big data in tourism, particularly in occupancy, leads to more effective revenue management, and some genetic algorithms (probably as a class of optimizers) are used in this direction.Bibliographic mapping through network visualization discovered that machine learning algorithms such as random forest, support vector machine, neural network (recently defined as deep learning algorithms) are widely used in hotel business, including forecasting of hotel occupancy.Following the trend, similar methods were used in this research.

Description of variables used in the research
Analyzing scientific literature, Lim (1997) showed that over 1961-1994 a various number of variables had a significant impact on tourism demand, such as income, tourism prices, relative prices, transportation costs, exchange rates, sociodemographic characteristics, ethnic or immigration factors, destination marketing, tourism attractiveness, and climate.Kim (2010) expanded the list of variables by discovering the significance of traveler data from luxury hotels in Seoul, South Korea.Yang et al. (2014) conducted one of the most exciting research, showing how Baidu search data allowed us to predict the passenger flow received by tourist attractions in Hainan Province, China.
This paper used five main variables and their derivatives to predict hotel occupancy from July 1, 2017, to Nov 30, 2021, in a daily format.Table 1 shows the description of the variables.Consisted of the Baidu index itself and five components: "Guiyang tourism", "Guiyang tourism", "Guiyang food", "Guiyang air ticket", "Guiyang tourist attractions", "Guiyang hotels".These keywords can reflect tourists' travel motives and needs and support conducting subsequent research.
In addition to the main five variables, the following derivatives were added: -differenced variables of temperature, Baidu search index, as well as differenced values of five Baidu search index components; -lagged variables of temperature, Baidu search index, and its five components.The lag is from 1 to 7. Differenced and lagged variables were added to the set of variables based on the results of the autocorrelation function analysis (see Figures 2-9).Except for hotel occupancy and Guiyang air ticket keyword index value, all other variables show a clear presence of a trend by decaying the ACF values slowly.Based on the wave structure of the ACF of hotel occupancy, it can be concluded that the variable contains a seasonal component (maximal peaks are repeated each seventh day) (see Figure 2).The behavior of Guiyang air ticket keyword index values is similar to hotel occupancy but does not present a clear view of seasonality (see Figure 4).A seasonal component at lag seven also offered to create lagged variables.It should be noted that the necessity to have differenced and lagged variables is not required by tools from the field of machine learning.Nevertheless, an enhanced set of variables can lead to better results.

Machine learning algorithms
In this research, the following machine learning algorithms were used: 1. Bagged CART; 2. Bagged MARS; 3. XGBoost; 4. Random Forest; 5. SVM.Bagging or bootstrap aggregation is a specific methodology for reducing the forecasting error of learning algorithms.Breiman presented empirical evidence that bagging can reduce prediction error (Buja & Stuetzle, 2006).In general, bagging is realized in several steps: 1. Creating bootstrap samples from the training sample; 2. Applying the corresponding learning algorithm to each bootstrap sample; 3. Predicting by aggregating (usually averaging) the predicted values for test observations.It was proved that bagging is highly effective for CARTs (classification and regression trees) (Breiman, 1984).
Bagged CART was realized using the bagging function from the ipred library of R.
Bagged MARS is multivariate adaptive regression splines with bagging and is usually used for solving complex non-linear regression problems.As Friedman wrote in his article, "the model takes the form of an expansion in product spline basis functions, where the number of basis functions as well as the parameters associated with each one (product degree and knot locations) are automatically determined by the data" (Friedman, 1991).
Bagged MARS was realized using the earth function from the earth library of R.
Extreme gradient boosting package called xgboost is an efficient and scalable implementation of gradient boosting framework referencing Friedman (2001) and Friedman et al. (2000).This package can solve as classification tasks as regression ones (Chen & He, 2023).
XGBoost was realized using the xgboost function from the xgboost library of R.
Random forest is a supervised machine learning algorithm that can also be used for solving regression tasks based on a group of decision tree models (Afriyie et al., 2023).In practice, random forest is characterized by accurate predictions, estimation of variables' importance, etc. (Prajwala, 2015) Random forest was realized using the randomForest function from the randomForest library of R. Support vector machine, or support vector regression, is the adapted form of SVM when the dependent variable is numerical rather than categorical.The basic principle of SVR is to map the indistinguishable sample data from low dimension to high dimension, in which the indistinguishable sample data become distinguishable using the kernel function.Then, the SVR establishes a decision function based on the theory of structural risk minimization for regression analysis on distinguishable sample data (Sun et al., 2023).SVM was realized using the svm function from the e1071 library of R.
The period of research covers 1,612 observations from Jul 1, 2017, to Nov 30, 2021.The whole period is divided into three parts called "before", "during", and "after" regarding the situation with COVID-19 in China.The periods of "during" and "after" COVID-19 are defined with the 07/2020 announcement by the Guizhou Provincial Tourism and Culture Department of the full resumption of cross-province (region and city) group travel in Guizhou as the time division point.
All models were trained on the first 80% of data (724 observations) in the before-COVID-19-time-period.After that, all models were tested on the other three sets (see Figure 10).The first testing set belongs to the before-COVID-19-time-period and represents the last 20% of data (183 observations).The next during-COVID-19-time-period (red line in Figure 10) covers 212 observations.The third testing period represents the after-COVID-19-timeperiod containing 486 observations.Three testing periods were used separately to assess the forecasting power of the models in each period and to understand if there were some changes in the dynamics of hotel occupancy.

Benchmark
An ARDL model was chosen as a benchmark for comparison with machine learning tools.The ARDL model was built on the same training set as other models.Estimating the ARDL model in eViews 12, the following result was obtained (see Figure 11 and Figure 12) using automatic selection: Checking the Gauss-Markov conditions, it was concluded that: 1.The mean of the residuals is -3.75 × 10 -15, i.e., zero.
2. The existence of serial correlation in the residuals was checked by correlogram Q-statistics (see Figure 13) and the Breusch-Godfrey serial correlation LM test (see Figure 14).Although the Q-statistics did not show any serially correlated residuals (p-values > 0), it was decided to remediate serial correlation in the residuals by HAC correction because the second test showed the presence of serial correlation in the model residuals (p-value equals zero).
Figure 13.ACF of the model residuals Figure 14.ACF of the model residuals The problem of the heteroskedastic residuals (p-value < 0.05) (see Figure 15) was remediated by the HAC correction (see Figure 12).As a result, the model used as a benchmark can be mathematically expressed as follows: - The value of hotel occupancy calculated by Equation ( 1) was rounded down.

Variable importance
The importance of variables was used to check the hypothesis if the Baidu search index values and its components can be utilized to predict hotel occupancy.The values of the assessments are obtained automatically by using the corresponding function of R. The measurement units and the mathematical apparatus used in each case can be found with the help of the corresponding function.All functions that participated in extracting the importance of variables are mentioned in the main text within each machine learning tool.The main focus was paid to the order of variables' importance based on their quantitative assessments provided by the corresponding functions.

Measures of model accuracy
The measures of model accuracy allow us to compare the models between themselves as well as to assess their forecasting power.The primary focus is on the coefficient of determination (R 2 ) as the simplest value for understanding the model's forecasting power.

Empirical results
As previously mentioned, all models were built on the first 80% of the before-COVID-19 time period covering the first 731 observations from July 1, 2017 to July 1, 2019 and containing 68 variables.

Bagged CART
The bag function contains an internal parameter minsplit, which refers to the minimum number of observations required at each node to split further.Conducting some research on the dependence between the values of this parameter and model quality measure (see Figure 16), it was found that the minimal out-of-bag estimate of root mean squared error equaled 31.22 is reached when minsplit = 10.Table 3 displays the TOP-10 of variable importance in the model.Except temperature, all variables are related to people's activity in the Baidu search engine.Applying the optimal bagged CART model to the 20% of data left in the before-COVID-19 time period as well as on during-and after-COVID-19 time periods, the following results were obtained in the context of assessing model accuracy (see in Table 4).The negative value of forecast bias on the training data is smaller than -1,452 on the last 20% of data in the before-COVID-19 period.The negative value signalizes the underestimation of the forecasts.During the extremal times when the COVID-19 pandemic was officially announced, the value of forecast bias became positive, showing that the model started to overestimate.It is logical because significantly falling demand for hotel rooms and total restrictions seriously decreased hotel occupancy, and the model could not follow the tendencies so quickly.However, in the after-COVID-19 time period the value of forecast bias significantly fell, reaching -4,952, signaling colossal underestimation.Nevertheless, in all time periods, the model could describe more than 50% of the initial variance: 53% on the testing data, 54% in the during-COVID-19 time period, and 68% after.

Bagged MARS
A basic bagged MARS model was built using the earth function from the earth library and optimized by RMSE using a searching grid.Two parameters were changed: degree (potential interactions between different hinge functions) and nprune (number of terms to retain).The best model has nprune = 23 and degree = 1.Based on the best model the TOP-10 of variable importance was obtained (see Table 5).The values of importance were obtained using the vip function from the vip library.Despite quite advanced mathematics built in this algorithm, the behavior of this model needs to be revised.The coefficient of determination is negative, meaning that the model is worse than the simple average of values.

XGBoost
An XGBoost model was created using the xgb.train function from the xgboost library.In this function, two parameters can be used in optimization: max.depth (maximum depth of a tree) and nrounds (max number of boosting iterations).Figure 17 represents the results of optimizing the number of boosting operations vs. the model RMSE.The RMSE value is stabilized at approximately 20 for both lines.The XGBoost algorithm showed a perfect result on the training data (R 2 = 99%), but on the training data the result became unexpectedly negative: the coefficient of determination showed a negative value.However, on new data from the during-COVID-19 and the after-COVID-19 time periods, the model showed positive values of R 2 , but their values were less than desirable level of 50%.

Random forest
The random forest algorithm requires defining the number of trees.To solve this problem, the relationship between model quality expressed through the mean squared residuals (MSR) and the number of trees was analyzed (see Figure 18).Based on the variance explained and the behavior of MSR, it was decided to set the number of trees equaled 60.
Assessing the model accuracy metrics it was found that the model has the same characteristics as the previous ones (see Table 9).Forecast bias shows the same problems with underestimation (except the during-COVID-19 time period), and the coefficient of determination (R 2 ) is about zero.Summarizing the results produced by the random forest model, it can be concluded that the model cannot predict hotel occupancy appropriately despite significant results obtained from the training data.All R 2 under 50% were considered insufficient.

SVM
Implementing a support vector machine technique on the training data, the coefficient of determination (R 2 ) was sufficiently high (0.81).Nevertheless, extending the model on the testing data in each separate time period, these metrics dramatically decreased up to zero (see Table 11), signalizing the inability of the model to forecast hotel occupancy at an appropriate level.In the next two periods, the determination coefficient remained under 50%.

Benchmark
The following results were obtained by applying the benchmark model to different time periods (see Table 13).In the case of forecast bias, the picture is relatively typical compared with other models: the model tends to underestimate in all time periods except the during-COVID-19 time period.In the context of the coefficient of determination, the model shows relatively stable results.Based on the training data, the model was able to explain 69% of the initial variance.On the left 20% of the before-COVID-19 time period, the model showed 53% that can be considered appropriate.Notably, in the during-COVID-19 time period the model improved its forecasting power up to R 2 = 84%.After the coefficient fell to 76%.

Conclusions
Considering all results obtained during the research, it can be concluded that: 1.The behavior of all models, including the benchmark, was quite typical: on the testing data the models showed negative bias signalizing the underestimation of hotel occupancy.Mostly all models had forecast bias equaled approximately 300 (except XGBoost where forecast bias = -806).The feature of underestimation remained on the testing data, showing a significant difference between actual and forecasted data (forecast bias was expressed in negative thousands).Nevertheless, in the during-COVID-19 time period overestimation appeared.This effect can be explained by the significant decrease in hotel occupancy caused by external factors (sufficient restrictions and limitations realized by the Chinese government during the COVID-19 pandemic).When the situation with COVID-19 stabilized, the models again showed more serious negative forecast bias.Thus, bagged CART, bagged MARS, random forest, XGBoost, and SVM tend to underestimate hotel occupancy in a daily format.The benchmark model (ARDL model) increased forecast bias insignificantly compared to other models.2. The forecasting power (accuracy) of the machine learning algorithms used in this research can be estimated as weak because all coefficients of determination are less than 50% on the testing data (left 20% of the before-COVID-19 time period, duringand after-) except the bagged CART model that could explained 53%, 54%, and 68% of initial variance, respectively.The bagged CART model was the only one that could ensure the R 2 coefficient greater than 50%.However, the benchmark model based on the ARDL model class showed better results: 53%, 84%, and 76%, respectively.3.In the context of variable importance, it was found that the Baidu search index and its components related to hotel booking and visits to the corresponding city can be used to build a model to predict hotel occupancy.Summarizing the frequencies of how often the corresponding variable was used the TOP-10 of them looks as follows: -Guiyang.cuisine-Baidu.index-Guiyang.attractions-Guiyang.air.ticket-Guiyang.Hotel -Guiyang.tourism-Temperature -dGuiyang.Hotel -dTemperature -dGuiyang.cuisine4. The weak forecasting power of the models (excluding bagged CART) shows that the task of regression (not classification) is more typical for the machine learning algorithms used in this research.Nevertheless, at least one algorithm -bagged CARTwas able to predict more or less adequately high frequent daily hotel occupancy.Summarizing all above, the following conclusions on two hypotheses can be done: 1.In this case, the machine learning tools used in this research did not show more effective results than a traditional model based on the ARDL methodology.Thus, the first hypothesis on the higher effectiveness of machine learning tools compared to the traditional forecasting model based on the ARDL class of models is rejected.2. As the variables' importance showed, the Chinese Baidu search engine index and its components were used in the machine learning models.Thus, the second hypothesis on the possible usage of the Baidu search engines values and its components is accepted.

Figure
Figure 1.Network visualization

Figure 15 .
Figure 15.Test on homoskedasticity of the residuals The correlation coefficient between d(Baidu index) and d(Temperature) is 0.14, with the p-value = 0, signalizing about the lack of the multicollinearity problem (the correlation coefficient <0.7).As a result, the model used as a benchmark can be mathematically expressed as follows:

Figure 16 .
Figure 16.Model RMSE vs. minsplit value The values of importance were calculated automatically by the varImp function from the caret library.As seen, except temperature all variables are related to people activity in the Baidu search engine.

Figure 17 .
Figure 17.Training and testing RMSE Assessment of model accuracy and TOP-10 of variable importance are shown inTable 7 and Table 8, respectively.

Figure
Figure 18.MSR vs. number of trees

Table 1 .
A list of five main variables #Variable Description1 Hotel occupancy This is the target of interest.The variable is used as a dependent variable in the models considered in this paper.2WeekendItisadummyvariable: 1 if Saturday or Sunday and 0, otherwise.3TemperatureItrepresents the weather conditions by the average temperature in the region.The values are obtained from tianqihoubao.com.4PublicholidaysIt is a dummy variable: 1 if national holiday and 0, otherwise.The information on the public holidays formulated by the General Office of the National People's Congress (NPC) over the study period was obtained from www.gov.cn.
Table 2 displays all widely used measures used in this .In this table, ˆ is the forecasted value in time period t, y t is the actual value in time period t, y is the mean of y. paper

Table 4 .
Assessment of model accuracy (bagged CART)

Table 5 Table 7 .
Assessment of model accuracy (bagged MARS)

Table 9 .
Assessment of model accuracy(Random Forest)

Table 10
displays the TOP-10 of variable importance.The values of importance were extracted by the importance function from the randomForest library.

Table 11 .
Assessment of model accuracy (SVM) Table 12 displays the TOP-10 of variable importance.The values of importance were calculated by the Importance function from the rminer library.

Table 13 .
Assessment of model accuracy (benchmark)