--- title: "Getting Started with Fitting Fuzzy Linear Regression Models in R" author: "Pavel Škrabánek, Natália Martínková" date: "`r Sys.Date()`" output: bookdown::html_document2: base_format: rmarkdown::html_vignette toc: true vignette: > %\VignetteIndexEntry{Getting Started with Fitting Fuzzy Linear Regression Models in R} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#", fig.path = "figures/", fig.height = 4.5, fig.width = 6 ) ``` Fuzzy regression provides an alternative to statistical regression when the model is indefinite, the relationships between model parameters are vague, sample size is low or when the data are hierarchically structured. The fuzzy regression is thus applicable in cases, where the data structure prevents statistical analysis. Here, we explain the implementation of fuzzy linear regression methods in the `R` package `fuzzyreg`. The chapter **Quick start** guides the user through the direct steps necessary to obtain a fuzzy regression model from crisp (not fuzzy) data. Followed by the Chapter **Interpreting the fuzzy regression model**, the user will be able to infer and interpret the fuzzy regression model. #Quick start To install, load the package and access help, run the following `R` code: ```{r, eval=FALSE} install.packages("fuzzyreg", dependencies = TRUE) require(fuzzyreg) help(package = "fuzzyreg") ``` ```{r, echo=FALSE} if(!require(fuzzyreg)) install.packages("fuzzyreg", dependencies = TRUE) ``` Next, load the example data and run the fuzzy linear regression using the wrapper function `fuzzylm` that simulates the established functionality of other regression functions and uses the `formula` and `data` arguments: ```{r} data(fuzzydat) f <- fuzzylm(formula = y ~ x, data = fuzzydat$lee) ``` The result shows the coefficients of the fuzzy linear regression in form of non-symmetric triangular fuzzy numbers. ```{r} print(f) ``` We can next plot the regression fit, using shading to indicate the degree of membership of the model predictions. ```{r plrls, fig.width = 6, fig.height = 4.5, fig.align = 'center', fig.cap = 'Fuzzy linear regression using the PLRLS method'} plot(f, res = 20, col = "lightblue", main = "PLRLS") ``` Shading visualizes the degree of membership to the triangular fuzzy number (TFN) with a gradient from lightblue to white, indicating the decrease of the degree of membership from 1 to 0 (Figure \@ref(fig:plrls)). The central tendency (thick line) in combination with the left and the right spreads determine a support interval (dotted lines) of possible values, i.e. values with non-zero degrees of membership of the model predictions. The left and the right spreads determine the lower and upper boundary of the interval, respectively, where the degree of membership equals to 0. We can display the model equations with the `summary` function. ```{r} summary(f) ``` # Triangular fuzzy numbers The package [`FuzzyNumbers`](https://CRAN.R-project.org/package=FuzzyNumbers) provides an excellent introduction into fuzzy numbers and offers a great flexibility in designing the fuzzy numbers. Here, we implement a special case of fuzzy numbers, the triangular fuzzy numbers. A fuzzy real number $\tilde{A}$ is a fuzzy set defined on the set of real numbers. Each real value number $x$ belongs to the fuzzy set $\tilde{A}$, with a degree of membership that can range from 0 to 1. The degrees of membership of $x$ are defined by the membership function $\mu_{\tilde{A}}(x):x\to[0,1]$, where $\mu_{\tilde{A}}(x^*)=0$ means that the value of $x^*$ is not included in the fuzzy number $\tilde{A}$ while $\mu_{\tilde{A}}(x^*)=1$ means that $x^*$ is positively comprehended in $\tilde{A}$ (Figure \@ref(fig:TFN)). In `FuzzyNumbers`, the fuzzy number is defined using side functions. In `fuzzyreg`, we simplify the input of the TFNs as a vector of length 3. The first element of the vector specifies the central value $x_c$, where the degree of membership is equal to 1, $\mu(x_c)=1$. The second element is the left spread, which is the distance from the central value to a value $x_l$ where $\mu(x_l)=0$ and $x_lx_c$. The right spread is equal to $x_r-x_c$. The central value $x_c$ of the TFN is its core, and the interval $(x_l,x_r)$ is the support of the TFN. ```{r, TFN, echo=FALSE, eval=TRUE, fig.width = 6.5, fig.height = 3.5, fig.align = 'center', fig.cap = 'Triangular fuzzy numbers.'} plot(1, type = "n", xlim = c(0,3), ylim=c(0,1.1), xlab = expression(italic("x")), ylab = "degree of membership", las = 1) points(matrix(c(.5,1.5,2.5,1,1,1), ncol=2), pch=19) text(c(.5,1.5,2.5),1, labels=as.expression(lapply(LETTERS[1:3], function(x) bquote(italic(.(x))))),pos=2) segments(.5,0,.5,1,lty=2) segments(1.5,0,1.5,1,lty=2) segments(2.5,0,2.5,1, lty=2) segments(.7,0,1.5,1) segments(1.9,0,1.5,1) segments(2,0,2.5,1) segments(3,0,2.5,1) ``` The crisp number $a=0.5$ can be written as a TFN $A$ with spreads equal to 0: ```{r, eval=FALSE} A <- c(0.5, 0, 0) ``` The non-symmetric TFN $B$ and the symmetric TFN $C$ are then: ```{r, eval = FALSE} B <- c(1.5, 0.8, 0.4) C <- c(2.5, 0.5, 0.5) ``` # Converting crisp numbers to TFNs When the collected data do not contain spreads, we can directly apply the PLRLS method to model the relationship between the variables in the fuzzy set framework (**Quick start**). For other methods, the spreads must be imputed. * The TFN might have spreads equal to 0 as shown for $A$ in Figure \@ref(fig:TFN). ```{r} A <- rnorm(3) fuzzify(x = A, method = "zero") ``` * Small random spreads might be generated, e.g. using `abs(runif(n) * 1e-6)`, where `n` is the number of the observations. ```{r} fuzzify(x = A, method = "err", err = abs(runif(2 * 3) * 1e-6)) ``` * Spreads might represent a known measurement error of the device used to collect the data. ```{r} fuzzify(x = A, method = "err", err = 0.2) ``` * Spreads might be calculated as a statistic from the data. ```{r} fuzzify(x = A, method = "mean") fuzzify(x = A, method = "median") ``` The spreads must always be equal to or greater than zero. Note that for the statistics-based methods, the function `fuzzify` uses a grouping variable `y` that determines which observations are included. ```{r} fuzzify(x = A, y = c(1, 2, 2), method = "median") fuzzify(x = A, y = c(1, 1, 2), method = "median") ``` # Converting TFN from class `FuzzyNumber` The conversion from an object of the class `FuzzyNumber` to TFN used in `fuzzyreg` requires adjusting the core and the support values of the `FuzzyNumber` object to the central value and the spreads. For example, let's define a trapezoidal fuzzy number $B_1$ that is identical with TFN $B$ displayed in Figure \@ref(fig:TFN), but that is an object of class `FuzzyNumber`. ```{r} require(FuzzyNumbers) B1 <- FuzzyNumber(0.7, 1.5, 1.5, 1.9, left = function(x) x, right = function(x) 1 - x, lower = function(a) a, upper = function(a) 1 - a) B1 ``` The core of $B_1$ will be equal to the central value of the TFN if the `FuzzyNumber` object is a TFN. However, the `FuzzyNumbers` package considers TFNs as a special case of trapezoidal fuzzy numbers. The core of $B_1$ thus represents the interval, where $\mu_{B_1}(x^*)=1$, and the support is the interval, where $\mu_{B_1}(x^*)>0$. We can use these values to construct the TFN $B$. ```{r} xc <- core(B1)[1] l <- xc - supp(B1)[1] r <- supp(B1)[2] - xc c(xc, l, r) ``` ```{r, echo = FALSE} detach("package:FuzzyNumbers", unload = TRUE) require(fuzzyreg) ``` When the trapezoidal fuzzy number has the core wider than one point, we need to approximate a TFN. The simplest method calculates the mean of the core as `mean(core(B1))`. We can also defuzzify the fuzzy number and approximate the central value with the `expectedValue()` function. However, the expected value is a midpoint of the expected interval of a fuzzy number derived from integrating the side functions. The expected values will not have the degree of membership equal to 1 for non-symmetric fuzzy numbers. Constructing the mean of the core might be a more appropriate method to obtain the central value of the TFN for most applications. Fuzzy numbers with non-linear side functions may have large support intervals, for which the above conversion algorithm might skew the TFN. The function `trapezoidalApproximation()` can first provide a suitable approximation of the fuzzy number with non-linear side functions, for which the core and the support values will suitably reflect the central value and the spreads used in `fuzzyreg`. # Methods for fitting fuzzy regression models Methods implemented in `fuzzyreg 0.6` fit fuzzy linear models include: ```{r, methods, echo = FALSE} tab = data.frame(Method = c("PLRLS", "PLR", "OPLR", "FLS", "MOFLR", "BFRL"), m = c("$\\infty$","$\\infty$","$\\infty$","1","$\\infty$","1"), x = c("crisp","crisp","crisp","crisp","sTFN","crisp"), y = c("crips","sTFN","sTFN","nsTFN","sTFN","nsTFN"), yhat = c("nsTFN","sTFN","sTFN","nsTFN","sTFN","nsTFN"), ref = c("[Lee & Tanaka 1999](https://doi.org/10.15807/jorsj.42.98)", "[Tanaka et al. 1989](https://doi.org/10.1016/0377-2217(89)90431-1)", "[Hung & Yang 2006](https://doi.org/10.1016/j.fss.2006.08.004)", "[Diamond 1988](https://doi.org/10.1016/0020-0255(88)90047-3)", "[Nasrabadi et al. 2005](https://doi.org/10.1016/j.amc.2004.02.008)", "[Škrabánek et al. 2021](https://doi.org/10.3390/math9060685)")) colnames(tab) = c("Method", "$m$", "$x$", "$y$", "$\\hat{y}$", "Reference") knitr::kable(tab, format = "html", escape = FALSE, table.attr = "style='width:90%;'", caption = "Methods for fitting fuzzy linear regression with `fuzzyreg`. $m$ -- number of allowed independent variables $x$; $x$, $y$, $\\hat{y}$ -- type of expected number for independent, dependent variables and predictions; s -- symmetric; ns -- non-symmetric.", align = c("l", rep("c", 5))) ``` Methods that require symmetric TFNs handle input specifying one spread, but in methods expecting non-symmetric TFN input, both spreads must be defined even in cases when the data contain symmetric TFNs. A *possibilistic linear regression* (PLR) is a paradigm of whole family of possibilistic-based fuzzy estimators. It was proposed for crisp observations of the explanatory variables and symmetric fuzzy observations of the response variable. `fuzzyreg` uses the min problem implementation that estimates the regression coefficients in such a way that the spreads for the model response are minimal needed to include all observed data. Consequently, the outliers in the data will increase spreads in the estimated coefficients. The *possibilistic linear regression combined with the least squares* (PLRLS) method fits the model prediction spreads and the central tendency with the possibilistic and the least squares approach, respectively. The input data represent crisp numbers and the model predicts the response in form of a non-symmetric TFN. Local outliers in the data strongly influence the spreads, so a good practice is to remove them prior to the analysis. The OPLR method expands *PLR by adding an omission approach for detecting outliers*. We implemented a version that identifies a single outlier in the data located outside of the Tukey's fences. The input data include crisp explanatory variables and the response variable in form of a symmetric TFN. *Fuzzy least squares* (FLS) method supports a simple FLR for a non-symmetric TFN explanatory as well as a response variable. This probabilistic-based method (FLS calculates the fuzzy regression coefficients using least squares) is relatively robust against outliers compared to the possibilistic-based methods. A *multi-objective fuzzy linear regression* (MOFLR) method estimates the fuzzy regression coefficients with a possibilistic approach from symmetric TFN input data. Given a specific weight, the method determines a trade-off between outlier penalization and data fitting that enables the user to fine-tune outlier handling in the analysis. The *Boscovich fuzzy regression line* (BFRL) fits a simple model predicting non-symmetric fuzzy numbers from crisp descriptor. The prediction returns non-symmetric triangular fuzzy numbers. The intercept is a non-symmetric triangular fuzzy number and the slope is a crisp number. # Fitting a fuzzy linear regression The TFN definition used in `fuzzyreg` enables an easy setup of the regression model using the well-established syntax for regression analyses in `R`. The model is set up from a `data.frame` that contains all observations for the dependent variable and the independent variables. The `data.frame` must contain columns with the respective spreads for all variables that are TFNs. The example data from [Nasrabadi et al. (2015)](https://doi.org/10.1016/j.amc.2004.02.008) contain symmetric TFNs. The spreads for the independent variable $x$ are in the column `xl` and all values are equal to 0. The spreads for the dependent variable $y$ are in the column `yl`. ```{r} fuzzydat$nas ``` Note that the data contain only one column with spreads per variable. This is an accepted format for symmetric TFNs, because the values can be recycled for the left and right spreads. The `formula` argument used to invoke a fuzzy regression with the `fuzzylm()` function will relate $y \sim x$. The columns `x` and `y` contain the central values of the variables. The spreads are not included in the `formula`. To calculate the fuzzy regression from TFNs, list the column names with the spreads as a character vector in the respective arguments of the `fuzzylm` function. ```{r} f2 <- fuzzylm(formula = y ~ x, data = fuzzydat$nas, fuzzy.left.x = "xl", fuzzy.left.y = "yl", method = "moflr") ``` Calls to methods that analyse non-symmetric TFNs must include both arguments for the left and right spreads, respectively. The arguments specifying spreads for the dependent variable are `fuzzy.left.y` and `fuzzy.right.y`. However, if we wish to analyse symmetric TFNs using a method for non-symmetric TFNs, both argumets might call the same column with the values for the spreads. ```{r} f3 <- fuzzylm(y ~ x, data = fuzzydat$nas, fuzzy.left.y = "yl", fuzzy.right.y = "yl", method = "fls") ``` As the spreads are included in the model using the column names, the function cannot check whether the provided information is correct. The issue gains importance when developing a multiple fuzzy regression model. The user must ascertain that the order of the column names for the spreads corresponds to the order of the variables in the ```formula``` argument. The fuzzy regression models can be used to predict new data within the range of data used to infer the model with the ```predict``` function. The reason for disabling extrapolations from the fuzzy regression models lies in the non-negligible risk that the support boundaries for the TFN might intersect the central tendency. The predicted TFNs outside of the range of data might not be defined. The predicted values will replace the original variable values in the `fuzzylm` data structure in the element `y`. ```{r} pred2 <- predict(f2) pred2$y ``` # Comparing fuzzy regression models The choice of the method to estimate the parameters of the fuzzy regression model is data-driven. Following the application of the suitable methods, fuzzy regression models can be compared according to the sum of differences between membership values of the observed and predicted membership functions. In Figure \@ref(fig:TEF), the points represent central values of the observations and whiskers indicate their spreads. The shaded area shows the model predictions with the degree of membership greater than zero. We can compare the models numerically using the total error of fit $\sum{E}$ with the `TEF()` function: ```{r, TEF, echo=FALSE, eval=TRUE, fig.cap = 'Comparison of MOFLR and FLS model fits to the same data.', fig.width = 7, fig.height = 3.5, fig.align = 'center'} oldpar <- par() par(mfrow=c(1,2)) plot(f2, res=20, col.fuzzy="lightblue", main = "f2 - MOFLR") plot(f3, res=20, col.fuzzy = "lightblue", main = "f3 - FLS") ``` ```{r} TEF(f2) TEF(f3) ``` Lower values of $\sum{E}$ mean that the predicted TFNs fit better with the observed TFNs. # Interpreting the fuzzy regression model When comparing the fuzzy linear regression and a statistical linear regression models, we can observe that while the fuzzy linear regression shows something akin to a confidence interval, the interval differs from the confidence intervals derived from a statistical linear regression model (Figure \@ref(fig:regfig)). ```{r, regfig, eval=TRUE, echo=FALSE, warning = F, fig.cap = 'Fuzzy and statistical linear regression. Fuzzy regression displays the central value and the support boundaries determined by the left and right spread (PLRLS - left panel). Statistical least-squares regressionshows the confidence interval for the regression fit (LS - right panel).', fig.width = 7, fig.height = 3.5, fig.align = 'center'} par(mfrow=c(1,2)) f4 = fuzzylm(y~x, fuzzydat$lee, method="plrls") plot(f4, res=20, col.fuzzy="lightblue", main="PLRLS") f1 = lm(y~x, fuzzydat$lee) newx = seq(min(fuzzydat$lee$x), max(fuzzydat$lee$y), length.out = 500) conf1 = predict(f1, newdata=data.frame(x=newx), interval="confidence") plot(fuzzydat$lee$x, fuzzydat$lee$y, xlab = "x", ylab = "y", main="LS") abline(f1) lines(newx, conf1[,2], lty=2) lines(newx, conf1[,3], lty=2) par(oldpar) ``` The confidence interval from a statistical regression model shows the certainty that the modeled relationship fits within. We are 95\% certain that the true relationship between the variables is as displayed. On the other hand, the support of the fuzzy regression model prediction shows the range of possible values. The dependent variable can reach any value from the set, but the values more distant from the central tendency will have smaller degree of membership. We can imagine the values closer to the boundaries as vaguely disappearing from the set as if they bleached out (gradient towards white in the fuzzy regression model plots). # How to cite To cite `fuzzyreg`, include the reference to the software and the used method. > Škrabánek P. and Martínková N. 2021. Algorithm 1017: fuzzyreg: An R Package for > Fuzzy Linear Regression Models. *ACM Trans. Math. Softw.* **47:** 29. doi: 10.1145/3451389. The references for the specific methods are given in the above reference, through links in Table \@ref(tab:methods) or accessible through the method help, e.g. the default fuzzy linear regression method PLRLS: ```{r, eval=FALSE} ?plrls ```