THE ECONOMIC EXPLAINABILITY OF MACHINE LEARNING AND STANDARD ECONOMETRIC MODELS-AN APPLICATION TO THE U.S. MORTGAGE DEFAULT RISK

. This study aims to bridge the gap between two perspectives of explainability−machine learning and engineering, and economics and standard econometrics−by applying three marginal measurements. The existing real estate literature has primarily used econometric models to analyze the factors that affect the default risk of mortgage loans. However, in this study, we estimate a default risk model using a machine learning-based approach with the help of a U.S. securitized mort-gage loan database. Moreover, we compare the economic explainability of the models by calculating the marginal effect and marginal importance of individual risk factors using both econometric and machine learning approaches. Machine learning-based models are quite effective in terms of predictive power; however, the general perception is that they do not efficiently explain the causal relationships within them. This study utilizes the concepts of marginal effects and marginal importance to compare the explanatory power of individual input variables in various models. This can simultaneously help improve the explainability of machine learning techniques and enhance the performance of standard econometric methods.


Introduction
This study uses machine learning (ML) and standard econometric methodologies to analyze the U.S. mortgage default risk data obtained from Freddie Mac. Compared to standard econometric techniques, ML techniques are considered to have greater learning and forecasting capabilities; therefore, in practice, they play an effective role in the detailed management of default risk. However, unlike standard econometric approaches, ML techniques make a limited contribution to explaining the cause of phenomena or being applied in practice while holding accountability. Such techniques are viewed less favorably in the highly regulated field of credit risk management because the complexity and non-linearity of ML models and how their predictions are derived make them difficult for ordinary people to understand (Štrumbelj & Kononenko, 2011). Recently, numerous attempts have been made in the field of artificial intelligence to solve the problem of ML explainability and interpretability. Referred to as explainable artificial intelligence (XAI), this field attempts to produce results that can be understood by humans.
In this context, explainability can be addressed from two different perspectives: ML and engineering, and economics and standard econometrics. While Campbell and Cocco (2015) explain residential mortgage default from the perspective of economics and standard econometrics, Bracke et al. (2019) observe it from an ML and engineering perspective. Although words such as explain, explanation, explainable, and explainability are closely related in terms of their meaning, Campbell and Cocco's (2015) use of such terms has virtually no relationship with their use in Bracke et al. (2019).
This study aims to bridge the gap between the two perspectives of Bracke et al. (2019) and Campbell and Cocco (2015) using economic and econometric methods. First, we describe the current interpretation of explainability in ML models by evaluating the various stakeholders in this field (Preece et al., 2018). Second, we list similar concepts, including explainable, interpretable, understandable, comprehensible, accountable, responsible, justifiable, ethical, transparent, causality, trustworthy, and glass box (Arrieta et al., 2020). Third, we introduce the concept of marginal effect (ME) from an economics perspective, which is based on the rigorous notion of economic causality.
Interpretability and explainability are important to stakeholders such as data scientists, end users, and regulators, although for different reasons (Kabul, 2017). Data scientists aim at building models whose predictions are highly accurate (developers and theorists in Preece et al., 2018) by determining the best algorithm to improve a specific model. In this context, the major stakeholders may be the end users who primarily focus on understanding the factors behind a particular model's generation of a certain prediction (expert users, directors of a company, data subjects, and affected customers in Arrieta et al., 2020). These stakeholders want to know how such decisions affect them. Further, end users try to investigate whether they are subject to unbiased treatment and the necessity of objecting to a particular decision. Regulators and lawmakers try to protect end users; hence, regulations have been recently issued, such as the General Data Protection Regulation and EU Expert Group on AI (Bücker et al., 2020), Equal Credit Opportunity Act and Fair Credit Reporting Act (Chen, 2018), and Payment Service Directives 2 (Torrent et al., 2020).
However, with the inevitable rise in complex ensemble ML algorithms, regulators are becoming increasingly concerned about the decisions made by ML models, especially in the field of financial services. Initially, an ensemble model simply referred to a conglomeration of models (Nanni & Lumini, 2009). However, at present, the ensemble approach combines virtually anything from models to feature selection methods to improve prediction performance (Dahiya et al., 2017;Koutanaei et al., 2015). This trend of increasing ensemble modeling has not depended on the types of classifiers. In this context, prior studies have investigated decision tree (Wang et al., 2012) and random forest (RF) ensembles (Chopra & Bhilare, 2018), support vector machine (Pławiak et al., 2019), and neural networks, especially in the fields of credit scoring and credit risk management (Tsai & Wu, 2008). Overall, stakeholders want black box models to exhibit qualities similar to those of other causal models, such as transparency, trustworthiness, and explainability (Shead, 2020).
Analyses in the field of economics and econometrics involve a strong notion of cause-effect relationships, applying a rigorous background of utility (profit or cost) theory (Campbell & Cocco, 2015). In contrast, causality is not among the most important goals for studies related to engineering (Arrieta et al., 2020).
In this study, causality and explainability are described from the perspective of economics and standard econometrics; we explain our model via ME and test for statistical significance. Studies in the field of ML prefer using the term marginal contribution, that is, significant influence or causal importance, to measure explainability.
To understand both the model and its prediction, we estimate the model and calculate the ME of each independent variable on default risk, which is taken as the dependent variable. The explanatory power of ML models can be evaluated using ME analysis. A comparison of the alternative models can be conducted by implementing the concepts of ME and marginal importance (partial dependence plot [PDP] and Shapley additive explanation value [SHAP]). However, we do not compare their prediction power but focus on three different measures of explainability based on both economics and ML perspectives.
This study uses residential mortgage data comprising 15,904 loans originated from January 2005 to November 2010. We develop a residential mortgage risk prediction model by estimating the probability of default for each loan. To achieve this, we estimate a default risk model by employing logit regression, which is a standard econometric method, as well as the artificial neural network (ANN) and RF methods used in ML.
We choose RF and ANN among other ML approaches because both models are relatively simple compared to other deep learning approaches. If competing models are too different from each other, we might fail to compare them on an apples-to-apples basis. This restriction of similarity also influences the logit model. RF cannot be applied to non-i.i.d. data, inclusive of panel datasets (Pearl, 2019;Steinwart et al., 2009). To feasibly compare these three approaches, we fit a simple logit model instead of a panel logit model and use the information in the event month, where "event" refers to either default or right censored in the last observation month of September 2013, following Bracke et al. (2019).
The contributions of this study are as follows. First, our approach is distinctive in the sense that we try to bridge both approaches by applying ME from the economic perspective. Specifically, we explain the ML technique using the concept of ME, which is more conducive or intuitive for human understanding while also being based on economic theory. ME has many advantages; it is a model agnostic approach, as it does not depend on the model specification. It produces results very quickly, and implementation time does not depend on either the number of variables or the model specification. Moreover, ME does not require either model re-training or re-estimation or both. ME analysis reports the directional sign of the independent variable's influence on the dependent variable. In a measure of explainability, statistical significance is of primary importance (Chernozhukov et al., 2018;Mullainathan & Spiess, 2017); ME provides the useful tool of a statistical significance test. For government regulators, ME offers feasible guidelines regarding the required size (strength) of a policy variable change based on the targeted statistical significance level, such as 1%, 5%, or 10%.
Second, we also utilize the marginal importance measure of the ML model to conduct a detailed supplementary analysis of the standard econometrics model, following Athey and Imbens (2017), Mullainathan and Spiess (2017), Chernozhukov et al. (2018), Rudin (2019), and Torrent et al. (2020). Third, we examine specific non-linear functional forms of input variables, such as squared and cubed polynomials, and test the similarity of the models using the Mann-Whitney test.
Overall, this study utilizes the advantages of two alternative methodologies to propose a better model. By successfully formulating a model, we expect to achieve a more transparent and trustworthy ML model that is based on economic explainability and produces more accurate predictions.
The paper is structured as follows. The introduction explains the background and purpose of the study. In Section 1, we examine prior studies related to the econometrics model and ML approach by considering residential mortgage default and credit scoring models. In Section 2, we describe the measures of explainability. In Section 3, the results of the empirical analyses of the default model through econometrics and ML models are compared using the concepts of ME and marginal importance. We conclude the paper by summarizing the results and acknowledging its limitations.

Prior research
While the history of the development of basic mortgage modeling is studied by Wallace (2005), Campbell and Cocco (2015) examine its theoretical modeling from an economic perspective. In this context, we review the following related studies in the existing literature.
From the perspective of social science, Feldman and Gross (2005) were the first to use classification and regression trees (CART) in mortgage credit analysis. Sirignano et al. (2016) use a deep learning approach, while Galindo and Tamayo (2000) compare the prediction error of both standard econometric and ML models and find that CART performs the best. Fuster et al. (2020) observe the effect of the change in practice from linear to non-linear estimation models from the perspective of the people affected, especially in terms of race and gender. This change to non-linear models may produce winners and losers and reduce the cross-subsidy resulting from the linear model, which depends on the conditional expectations among multiple groups.

ML-based explainability measures
Explainability measures have played an exceptionally important role in current studies. Here, we analyze only the studies related to credit risk management. Explainability measures include PDP and individual contribution expectation (ICE) (Bücker et al., 2020;Fahner, 2018;Goldstein et al., 2015;Zhao & Hastie, 2021) (Wang et al., 2021), and LIME (Bücker et al., 2020). This study reports only PDP and SHAP measures for simplicity.
To improve the explainability of ML models, Rudin (2019) insists on building a model that can be explained fundamentally, rather than trying to explain a complex model. To this end, the model should reflect a viewpoint that emphasizes both domain knowledge (Chen, 2018;Fahner, 2018;Torrent et al., 2020;Zhao & Hastie, 2021) and the behavioral characteristics of stakeholders (Athey & Imbens, 2017;Mullainathan & Spiess, 2017;Rudin, 2019).

Methodology
Due to their non-parametric approach and non-linear structure, it is difficult to explain the prediction results of ML techniques; therefore, it is difficult to understand the model's internal structure (Arrieta et al., 2020). Moreover, there are few to no generally accepted definitions of explainability (Preece et al., 2018), similar to the definition of mortgage default itself.
During the process of analyzing explainability, the econometric and ML approaches are separately applied to the dataset to derive the model estimation results. Tools for estimating explainability are applied to the derived model and comparison analysis is discussed.
Specifically, we employ logit regression analysis along with ANN and RF techniques to estimate a default risk model for U.S. residential mortgage loans. To explain the parameters that influence the ML models, the ANN possessed one or two hidden layers, and the number of hidden nodes were either matched with the number of input variables or increased up to three times the number of variables. Fundamentally, one hidden layer was used along with the same number of hidden nodes and input variables to determine whether the explainability of the features applied to the ML-based model can be estimated utilizing econometrics-based ME techniques.
RF is an ensemble technique that makes a final decision through the majority vote of the results that were determined by multiple decision trees. For the prediction result (probability value) to be continuously distributed, a sufficient number of trees must be generated; a minimum of 100 trees was deemed appropriate for this study to allow continuous calculation of the probability value. Therefore, 100 decision trees were utilized.
The following provides a brief introduction to the methods of estimating explainability in ML. First, utilizing the visualization tool, PDP, we graph the changes in the dependent variable that result from a change in the variables, ME was estimated through 1% stepwise changes from 0 to 100% of the independent variable's standard error (SE) for △. The procedure and code for the ME algorithm are both included in Algorithm 1 in the Appendix. Accordingly, a t-test was utilized to identify the critical value at which the null hypothesis of ME can be rejected at the 10% and 1% significance levels. Since it is unrealistic to apply △, which represents infinitesimal changes, to the SE for dummy variables, we do not conduct the ME analysis for dummy variables.
By estimating ME, we identify the extent and direction of how the individual independent variables affect the model's dependent variable; further, it is possible to comprehend the range of change in the independent variable that has a statistically significant effect on the change in default risk probability.

Dataset and variables
This study utilizes Freddie Mac's U.S. Single-Family Loan Dataset. Here, default is operationally defined as a loan that is at least 90 days delinquent (90 DPD). Unlike the competing risks model approach undertaken in prior econometric analyses, the ML approach is generally applied to a single default risk model (without prepayment risk analysis). The mortgages are homogeneous 30-year fixed-rate mortgages, such that there exists a borrower with 1 unit of property and a purchase-money mortgage. Table 1 exhibits a left censored dataset from 2011 to a right censored one in September 2013. The number of loans originated by year and number of yearly defaults during the observation period are provided in columns (A) and (B), respectively. There is a total of 1.102 3-year loan defaults, which is 6.93% of all observations. The definitions of the independent variables and how they are calculated are provided in Table A.1 in the Appendix; the descriptive statistics of each variable are found in Table A.2 in the Appendix.
value of the independent variable(s). Additionally, the SHAP value is estimated to comprehend the influence of individual variable(s).
In addition to methods such as PDP and the SHAP value, this study introduces the concept of ME to estimate the explainability of logit regression analysis and ML models. ME indicates how infinitesimal changes in the independent variable affect the dependent variable. For example, ME has the same meaning as the regression coefficients of each variable in a linear regression model. Due to the absence of regression coefficients in ML, the concept of ME is introduced and utilized operationally. Non-linear model, such as nested logit model, also utilizes ME to calculate the scale of the causal effect of independent variables on dependent variables.
A simple ME approach is applied to perturb single inputs and measure the change in dependent variables in the model, that is, changing one input to observe the outputs. This method estimates the response of the dependent variable in the model to a specific independent variable on a single point as a linear approximation (Lundberg & Lee, 2017). By repeating this procedure with various variable values, a picture of the model's behavior can be illustrated. In numerous practical domains, ME results have been observed to be direct and intuitive. Cameron and Trivedi (2005) define the ME of dummy and continuous variables with the following equation: The equation for the ME of a continuous variable is defined in Equation (2): (2) Second, a rigorous discussion regarding the reliability of the estimated ME results is required; however, it is difficult for the estimated ME results of an ML-based model to be consistent. This is due to the non-linear or non-parametric characteristics typically inherent in an ML model, which causes high variance in the dependent variable values in ME depending on the location of the independent variable being estimated. In addition, if the dependent variable is discontinuous, it may not respond to Delta (∆), which denotes infinitesimal changes in the independent variable. As a result, a t-test is conducted to explain the statistical significance of the changes observed in ME to check its reliability.
Since ML models take a non-linear form, infinitesimal changes in the independent variable may have no effect on the dependent variable (Bracke et al., 2019). To prevent such an occurrence, we identify the critical value at which there is a statistically significant difference in the default risk by applying various ranges of change, that is △, to the independent variable. For continuous independent

Logit model
To investigate the differences between the two main approaches and the possibility of mutual complementary support (Bücker et al., 2020;Chernozhukov et al., 2018;Mullainathan & Spiess, 2017;Rudin, 2019), we estimate a logit regression model (see Table A.3 in the Appendix).
The unemployment rate, current loan to value (LTV), original debt to income (DTI), number of times thirty days delinquent in the last 12 months (30 DPD) and number of times sixty days delinquent (60 DPD) displayed positive (+) effects on default, while the origination 2008 and 2009 dummy, original LTV, credit score, power of sale states dummy, and sand states dummy displayed negative (-) relationships. However, the relationships of the mortgage age-related variables, house price appreciation rate, owner occupied, first-time home buyer, piggyback, and equity ratio-related variables were not found to be statistically significant. While the model estimation results had no significant differences when compared to prior studies, differences were found in the case of original LTV. In existing literature, Foote et al. (2008) state that a higher original LTV (+) and interest rate (+) tend to raise the default probability. Moreover, Bracke et al. (2019) find that current LTV and original LTV have positive (+) effects on default risk in their logit model. However, the results of the original LTV in our model, which are contrary to common knowledge, may be explained in terms of endogeneity bias (Bhardwaj & Sengupta, 2010). Table 2 presents the results of ME based on our logit regression model. An additional amount of △ disruption was induced in the current value of each independent variable to calculate the changes observed in the dependent variable. All the ME analyses were conducted using Python software with the help of the scikit-learn ML toolkit and other ML libraries run on Ubuntu Linux and Amazon Web Services with 4 CPUs and 16 GB memory.
Further, a t-test was conducted after computing the mean of the ME vector and the SE of each continuous independent variable. To compare models, the size of the △ with respect to the SE of a variable was increased by 1% increments from 0% to 100% until ME (change in default probability) was statistically significant either at the 1% or 10% significance level. Columns (a) and (b) of Table 2 exhibit the ME values calculated using the critical values that reject the null hypothesis at the 1% statistical significance level. Columns (c) and (d) present the size of △ that rejects the null hypothesis of ME at the 1% or 10% confidence level. Here, we need to explain the economic implications of the results of the ME analysis in terms of mortgage age, house price appreciation rate, and equity to debt ratio, which are not statistically significant in the logit model estimation. Since the ME values of insignificant variables do not have any economic implications, we only add them for the purpose of comparing them with other ML models. Table 2 shows that a disruption of 40% in the SE of the original DTI variable results in an ME of 0.1037, which is found to be statistically significant at the 1% confidence level. For the current LTV variable, a disruption size of 4% implies that the null hypothesis of ME is rejected at the 10% significance level. Therefore, compared to the original DTI variable, the ME of the current LTV variable is more sensitive to a small disruption (4% of the standard deviation of a variable). This means that even if the variables possess the same level of statistical significance, their ME sensitivities can differ widely. Overall, since the ME of the logit model is calculated through a logit regression, the direction of all variables remains the same but the scales of their responses do not appear to be uniform. Square of equity to debt ratio 2.558*** 1.108 1 1 Note: # and * indicate the statistical significance level in the logit model estimation and the ME of the logit model, respectively.

ANN model
In complex non-linear ANN models, inducing separate changes in an individual variable has a limited effect on the dependent variable (Bracke et al., 2019). Table 3 displays the results of calculating ME for an ANN model. ME was calculated using the same process as that used for the logit regression model. The ANN model applied in this study is a relatively simple network with a single hidden layer and an equal number of hidden nodes and independent variables. Despite its simplicity, the ANN model shown required relatively large changes in independent variables for its ME to be statistically significant.
We discuss the estimation results using an economic intuitive perspective. The credit score variable has played an important role in most standard econometric models (Archer & Smith, 2013;Bhardwaj & Sengupta, 2010;Elul et al., 2010), although it has been deemed unimportant in the case of ML models (Bracke et al., 2019). However, it is significantly important in Sirignano et al. (2016) and in our ANN model, credit score is found to be the 5 th most important variable.
The original DTI does not play an important role in the ANN model estimation, although it is statistically significant in the ME analysis. We conjecture that the lack of importance of the economically important original DTI variable in the ML model might be the result of an effect that is similar to that of multicollinearity in standard econometrics; that is, the original DTI variable might be explained by other independent variables in a non-linear manner in the ANN model estimation process. This is certainly an advantage of the ME analysis, which provides ML modelers with both the directional signs of individual variable influences and the results of tests of statistical significance to consider further hand-tuning of the model.
In contrast to the ANN model, the ME analyses find that the coefficients from the logit model are more sensitive to small changes in the mean of the absolute values of ME. This is significant from the viewpoint of policy interventions, such as in the case of a predictive policing system (Datta et al., 2016). Since human regulators have limited capabilities for analyzing the influence of simultaneous changes in multiple variables and manipulating multiple variables jointly and marginally, it is more  practical to use a standard econometrics approach such as an ME analysis. Even though ML models consider the presence of many correlated input variables, the fact that the ML approach underestimates the infl uence of an individual input variable can result in serious problems. Table 4 presents a comparison between the ME of the logit regression and ANN models. In contrast to the logit model, the house price appreciation rate and square of equity to debt ratio variables in the ANN model fail to achieve statistical signifi cance level in the ME analysis. Given that the house price appreciation rate variable is the 10 th most important variable under the ANN model, the results do not appear to be plausible. Th erefore, while the two models are quite similar, the results of the logit model are slightly more consistent than those of the ANN model estimation.
By perturbing two variables simultaneously, we measured the combined ME on default probability. While the experiment can be conducted by grouping all variables, considering two variables at a time, this study focuses on the original LTV (ltv) and original DTI (dti), which are representative variables of the double trigger property (equity and cash-fl ow considerations). During this procedure, increasing the original LTV (ltv) and original DTI (dti) simultaneously was shown to increase default probability; therefore, both LTV and DTI were increased in the positive direction to identify the area where the default probability is found to exhibit a statistically signifi cant change. We expect that in comparison to changing one variable, changing two variables simultaneously will result in a more sensitive response in the dependent variable, such that the two variables should display a relationship of substitution.
In contrast to our expectation, Panel A of Figure 1 shows complementary relationships. Th is is simply because the original LTV in the ANN model has a negative ME. Similarly, Sirignano et al. (2016) conduct analyses by pairing the original interest rate, interest rate diff erentials, original loan term, FICO score, loan balance, and past de-linquency behavior, which were deemed to have high integrated eff ects. In their study, the interaction of the original interest rate and FICO score are explained with a contour plot in the presence of several regimes in the FICO score, such that for very high FICO scores, borrowers rarely become delinquent irrespective of how high the original interest rate is. In contrast, the study states that for low FICO scores, there exists a non-linear relationship with the interest rate, such that the likelihood of delinquency increases as the interest rate increases.
With the help of PDP, changes were observed in the dependent variable as a result of changes in the values of each independent variable. While the PDP of most variables displayed the same directions as that of ME, the PDP of the house price appreciation rate had a quadratic functional form, as shown in Figure 2. Th is may suggest that an additional sub-group analysis for house price appreciation rate values is necessary, following Athey and Imbens (2016), Chernozhukov et al. (2018), andTorrent et al. (2020). If these two sub-groups pass the Chow test, then we should follow this direction of sub-group analysis.
Changes in default probability resulting from changes in double trigger variables, such as the original LTV and original DTI, were estimated using the PDP technique. Th e results from the PDP analysis are similar to those of the ME analysis. If we increase original DTI (dti) and decrease original LTV (ltv) simultaneously, the default probability increases (Figure 3).
Further, the SHAP value allows an evaluation of each variable's marginal contribution by considering all correlations and interactions between them. However, this increases the time consumed for estimation. Taking this issue into consideration, a random sampling of 1.000 observation values was conducted to calculate the SHAP value. It is not possible to go through all the possible combinations of input variables (Štrumbelj & Kononenko, 2011).
Th e results of the SHAP value in the ANN model show that the current LTV (cu_ltv) variable is of primary importance, followed by credit score, mortgage age, origination sensitive response. We conjecture that it is a general property of an ML model, which allows for correlated changes among input variables, as compared to the logit model, which assumes the input variables are independent. Table 6 compares the ME from the logit regression and RF models. While the original LTV had a negative ME in both the logit regression and ANN model, it the ME in RF was positive. Logically, it should positively affect defaults; therefore, the RF model is deemed to be more appropriate in economic theory.
We need to be cautious while considering that we have mutually contradictory ME results between models. Similar findings from a comparison among competing models were reported in Bracke et al. (2019). While year, current rate type, current LTV, current interest, and outstanding balance, in this decreasing order, were considered important in the logit model estimation, the variables in the Gradient Tree Boosting model appear in the order of year, current interest, current LTV, gross income, and current rate type, indicating wide differences between the models. While it may be difficult to generalize these results in a broad application, we assert that they would provide insights when selecting the independent variable for every default model in the spirit of hand-curating, following Mullainathan and Spiess (2017).
In addition to these results, we estimated how perturbing two variables simultaneously affects the default probability. The original LTV (ltv), current LTV (cu_ltv), and original DTI (dti) were selected. The results between original DTI and current LTV are quite similar to those found in the ANN model, while the results of the original DTI and original LTV are quite different ( Figure 5). This is due to the difference in the sign of the original LTV variables in the two ML models.  Figure 4). Additionally, an analysis of the top 10 important variables provides the same results as those found in the logit estimation analysis. Therefore, these top 10 ANN model variables coincide with the logit model variables with statistical significance at the 1% level, except in the case of mortgage age and square of mortgage age. However, 60 DPD, with significance at the 1% level in the logit analysis, is weakly important in the SHAP value analysis of the ANN model. Furthermore, house price appreciation rate and equity to debt ratio, which were not statistically significant in the logit analysis, exhibited no or minimal effect in the SHAP value analysis of the ANN model. Bracke et al. (2019) utilize a logit model to calculate unary and SHAP values and find similar characteristics to those of their logit model. The SHAP value results from their study, in decreasing order of importance, indicate that year, current rate type, current LTV, current interest, outstanding balance, gross income, and LTV affect default risk.
When we compare the ME and SHAP value analyses of the ANN model, the house price appreciation rate and squared equity to debt ratio, which were not statistically significant in the ME analysis, also had no or minimal effects in the SHAP value analysis of the ANN model. Compared to the SHAP analysis, our proposed ME analysis is much easier to implement and understand. The results of the ME analysis provide both the directional sign and statistical significance, in addition to the size of the impact of the SHAP value analysis. Table 5 displays the results of calculating the ME for the RF model. The process of applying the ME was similar to that in the logit regression and ANN models. To achieve continuity of change in the probability value, 100 decision trees were generated for this analysis.

RF model
Similar to the ANN-based model, which requires a relatively significant change for the ME, the RF model also has similar characteristics while displaying a degree of     Unlike the ANN model, the RF technique displays the characteristics of a discrete model based on decision trees, resulting in a complex, non-linear, quadratic PDP. Unlike the smoothly curved or linear form of ANN, drastically different forms were observed. Both positive and negative directions, depending on the position of the variable, were confirmed through PDP, as shown in Figure 6. Further, the original LTV displayed a negative direction in the logit and ANN analyses, while the direction is a quadratic functional form in the case of the RF model.
Although the variable had a positive direction in ME, both negative and positive directions can be confirmed via PDP. This signifies that under various circumstances, this variable can have different effects on the dependent variable. If we pass the Chow test after sub-grouping the sample data based on the PDP analysis, additional subgroup analysis will certainly improve the quality of standard econometric approaches.
The changes in default probability resulting from changes in the double trigger variables (original LTV (ltv) and original DTI (dti)) were visualized in Figure 7. The PDP that responds to changes in both original DTI and original LTV showed a squiggly elliptical contour. This result differs from that of the ANN model and the RF model's ME analysis. Figure 8 displays the results of applying the SHAP value to the RF model. The top 10 most important variables from the SHAP value analysis indicate that the variables are also statistically significant in the RF model's ME within a 1% level, except for two mortgage age-related variables.
Further, similar to our previous argument in the ANN model, our proposed ME analysis is much easier to implement and understand.

Model explainability analysis through the addition of polynomial variables
To analyze the changes in the independent variables within the black box of the ML models, the ML models were estimated by squaring and cubing all continuous variables (see Table A.1 in the Appendix). Further, the Mann-Whitney Test was conducted to confirm whether the artificial addition of polynomial variables (hereinafter referred to as the "polynomial models") led to significantly different predictions compared to those of the original model. As Table 7 shows, based on default probability, of the polynomial models, only the ANN model differed from the original model with a significance level of 1%. In contrast, based on the classification results of default, the null hypothesis that there are no differences between the RF models was accepted. As a result, the argument that the ML model considers all possible function transformations is weakly rejected, which differs from Sirignano et al. (2016). Hence, modelers must continue to make diligent efforts in the selection of input variables in the spirit of hand-curating, following Mullainathan and Spiess (2017).

Conclusions
This study employs both standard econometrics and ML techniques to estimate a residential mortgage loan default risk model and introduces economic explainability measures for its interpretation. We estimated the models' explainability measures to identify the causal influences of individual independent variables through ME, PDP, and SHAP value analyses. Additionally, the significance levels  Figure 8. Results of the SHAP value in the RF model of the ME were utilized to conduct statistical tests on the ME of the respective variables on default probability. Existing empirical studies on residential mortgage default have mainly explained default factors through an econometric approach. In contrast, studies based on ML highlight the high suitability of ML models in the context of predictive power. However, this study simultaneously employs both the explainability measures of ML techniques and the ME of standard econometric approaches to illustrate that our proposed methods can be utilized to develop a better mortgage default risk model. Specifically, for a standard econometric approach, analyses using ML approaches, such as PDP and SHAP value, can provide additional information regarding the need to sub-group analysis (Bücker et al., 2020;Chernozhukov et al., 2018;Mullainathan & Spiess, 2017;Rudin, 2019). For ML models, the ME analysis provides ML with a quasi-statistical inference technique with sign and statistical significance.
The results from comparing the logit regression, ANN, and RF models based on the evaluation of explainability show that while the PDP, ME, and SHAP values of logit regression and ANN display similar characteristics, the characteristics of the RF are different. This is due to the presence of structural characteristics in the RF model estimation: because RF employs a greedy heuristic approach, the difference between models do not depend on the number of trees.
This study enabled us to interpret the ML-based mortgage default risk model, which possesses the characteristics of a black box structure, in terms of ME, which has its foundations in economic theory. We expect our findings to play a practical guiding role when mortgage lending institutions utilize ML techniques for risk management of mortgage loans. From the perspective of policy formulation, our findings can be utilized by government regulators who are responsible for implementing macroprudential policies to manage household debt in the real estate financial market.
However, several limitations are associated with this study. First, it only uses event-month data instead of panel data with monthly observations. Second, we do not use a competing risks model that simultaneously deals with prepayment risk. Third, we face a number of contradictory results, depending on the estimation models, ME, and marginal importance techniques. Considering that both the standard econometric and ML models use the same data and share the same research purpose in terms of explainability and accuracy, the direction of future studies will be developing a unified research framework. We believe that our investigation can serve as a steppingstone for a unified approach that can sufficiently offset this study's limitations.