Fuzzy regression provides an alternative to statistical regression when the model is indefinite, the relationships between model parameters are vague, sample size is low or when the data are hierarchically structured. The fuzzy regression is thus applicable in cases, where the data structure prevents statistical analysis.
Here, we explain the implementation of fuzzy linear regression
methods in the R
package fuzzyreg
. The chapter
Quick start guides the user through the direct steps
necessary to obtain a fuzzy regression model from crisp (not fuzzy)
data. Followed by the Chapter Interpreting the fuzzy regression
model, the user will be able to infer and interpret the fuzzy
regression model.
#Quick start To install, load the package and access help, run the
following R
code:
# Loading required package: fuzzyreg
Next, load the example data and run the fuzzy linear regression using
the wrapper function fuzzylm
that simulates the established
functionality of other regression functions and uses the
formula
and data
arguments:
The result shows the coefficients of the fuzzy linear regression in form of non-symmetric triangular fuzzy numbers.
print(f)
#
# Fuzzy linear model using the PLRLS method
#
# Call:
# fuzzylm(formula = y ~ x, data = fuzzydat$lee)
#
# Coefficients in form of non-symmetric triangular fuzzy numbers:
#
# center left.spread right.spread
# (Intercept) 17.761911 0.7619108 2.7380892
# x 2.746428 1.2464280 0.4202387
We can next plot the regression fit, using shading to indicate the degree of membership of the model predictions.
Shading visualizes the degree of membership to the triangular fuzzy
number (TFN) with a gradient from lightblue to white, indicating the
decrease of the degree of membership from 1 to 0 (Figure
@ref(fig:plrls)). The central tendency (thick line) in combination with
the left and the right spreads determine a support interval (dotted
lines) of possible values, i.e. values with non-zero degrees of
membership of the model predictions. The left and the right spreads
determine the lower and upper boundary of the interval, respectively,
where the degree of membership equals to 0. We can display the model
equations with the summary
function.
summary(f)
#
# Central tendency of the fuzzy regression model:
# 17.7619 + 2.7464 * x
#
# Lower boundary of the model support interval:
# 17 + 1.5 * x
#
# Upper boundary of the model support interval:
# 20.5 + 3.1666 * x
#
# The total error of fit: 126248409
# The mean squared distance between response and prediction: 262.1
The package FuzzyNumbers
provides an excellent introduction into fuzzy numbers and offers a great
flexibility in designing the fuzzy numbers. Here, we implement a special
case of fuzzy numbers, the triangular fuzzy numbers.
A fuzzy real number à is a fuzzy set defined on the set of real numbers. Each real value number x belongs to the fuzzy set Ã, with a degree of membership that can range from 0 to 1. The degrees of membership of x are defined by the membership function μÃ(x) : x → [0, 1], where μÃ(x*) = 0 means that the value of x* is not included in the fuzzy number à while μÃ(x*) = 1 means that x* is positively comprehended in à (Figure @ref(fig:TFN)).
In FuzzyNumbers
, the fuzzy number is defined using side
functions. In fuzzyreg
, we simplify the input of the TFNs
as a vector of length 3. The first element of the vector specifies the
central value xc, where the
degree of membership is equal to 1, μ(xc) = 1.
The second element is the left spread, which is the distance from the
central value to a value xl where μ(xl) = 0
and xl < xc.
The left spread is thus equal to xc − xl.
The third element of the TFN is the right spread, i.e. the distance from
the central value to a value xr where μ(xr) = 0
and xr > xc.
The right spread is equal to xr − xc.
The central value xc of the TFN is
its core, and the interval (xl, xr)
is the support of the TFN.
The crisp number a = 0.5 can be written as a TFN A with spreads equal to 0:
The non-symmetric TFN B and the symmetric TFN C are then:
When the collected data do not contain spreads, we can directly apply the PLRLS method to model the relationship between the variables in the fuzzy set framework (Quick start). For other methods, the spreads must be imputed.
A <- rnorm(3)
fuzzify(x = A, method = "zero")
# Ac Al Ar y
# [1,] 0.4991426 0 0 1
# [2,] 0.6085837 0 0 1
# [3,] 0.2613014 0 0 1
abs(runif(n) * 1e-6)
, where n
is the number of
the observations.fuzzify(x = A, method = "err", err = abs(runif(2 * 3) * 1e-6))
# Ac Al Ar y
# 1 0.4991426 3.798374e-07 1.771781e-07 1
# 2 0.6085837 5.284693e-07 6.561654e-07 1
# 3 0.2613014 6.030391e-07 2.476471e-07 1
fuzzify(x = A, method = "err", err = 0.2)
# Ac Al Ar y
# 1 0.4991426 0.2 0.2 1
# 2 0.6085837 0.2 0.2 1
# 3 0.2613014 0.2 0.2 1
fuzzify(x = A, method = "mean")
# Ac Al Ar y
# 1 0.4563426 0.1775532 0.1775532 1
fuzzify(x = A, method = "median")
# Ac Al Ar y
# 1 0.4991426 0.1189206 0.05472053 1
The spreads must always be equal to or greater than zero. Note that
for the statistics-based methods, the function fuzzify
uses
a grouping variable y
that determines which observations
are included.
FuzzyNumber
The conversion from an object of the class FuzzyNumber
to TFN used in fuzzyreg
requires adjusting the core and the
support values of the FuzzyNumber
object to the central
value and the spreads. For example, let’s define a trapezoidal fuzzy
number B1 that is
identical with TFN B displayed
in Figure @ref(fig:TFN), but that is an object of class
FuzzyNumber
.
require(FuzzyNumbers)
# Loading required package: FuzzyNumbers
B1 <- FuzzyNumber(0.7, 1.5, 1.5, 1.9,
left = function(x) x,
right = function(x) 1 - x,
lower = function(a) a,
upper = function(a) 1 - a)
B1
# Fuzzy number with:
# support=[0.7,1.9],
# core=[1.5,1.5].
The core of B1
will be equal to the central value of the TFN if the
FuzzyNumber
object is a TFN. However, the
FuzzyNumbers
package considers TFNs as a special case of
trapezoidal fuzzy numbers. The core of B1 thus represents the
interval, where μB1(x*) = 1,
and the support is the interval, where μB1(x*) > 0.
We can use these values to construct the TFN B.
When the trapezoidal fuzzy number has the core wider than one point,
we need to approximate a TFN. The simplest method calculates the mean of
the core as mean(core(B1))
.
We can also defuzzify the fuzzy number and approximate the central
value with the expectedValue()
function. However, the
expected value is a midpoint of the expected interval of a fuzzy number
derived from integrating the side functions. The expected values will
not have the degree of membership equal to 1 for non-symmetric fuzzy
numbers. Constructing the mean of the core might be a more appropriate
method to obtain the central value of the TFN for most applications.
Fuzzy numbers with non-linear side functions may have large support
intervals, for which the above conversion algorithm might skew the TFN.
The function trapezoidalApproximation()
can first provide a
suitable approximation of the fuzzy number with non-linear side
functions, for which the core and the support values will suitably
reflect the central value and the spreads used in
fuzzyreg
.
Methods implemented in fuzzyreg 0.6
fit fuzzy linear
models include:
Method | m | x | y | ŷ | Reference |
---|---|---|---|---|---|
PLRLS | ∞ | crisp | crips | nsTFN | Lee & Tanaka 1999 |
PLR | ∞ | crisp | sTFN | sTFN | Tanaka et al. 1989 |
OPLR | ∞ | crisp | sTFN | sTFN | Hung & Yang 2006 |
FLS | 1 | crisp | nsTFN | nsTFN | Diamond 1988 |
MOFLR | ∞ | sTFN | sTFN | sTFN | Nasrabadi et al. 2005 |
BFRL | 1 | crisp | nsTFN | nsTFN | Škrabánek et al. 2021 |
Methods that require symmetric TFNs handle input specifying one spread, but in methods expecting non-symmetric TFN input, both spreads must be defined even in cases when the data contain symmetric TFNs.
A possibilistic linear regression (PLR) is a paradigm of
whole family of possibilistic-based fuzzy estimators. It was proposed
for crisp observations of the explanatory variables and symmetric fuzzy
observations of the response variable. fuzzyreg
uses the
min problem implementation that estimates the regression coefficients in
such a way that the spreads for the model response are minimal needed to
include all observed data. Consequently, the outliers in the data will
increase spreads in the estimated coefficients.
The possibilistic linear regression combined with the least squares (PLRLS) method fits the model prediction spreads and the central tendency with the possibilistic and the least squares approach, respectively. The input data represent crisp numbers and the model predicts the response in form of a non-symmetric TFN. Local outliers in the data strongly influence the spreads, so a good practice is to remove them prior to the analysis.
The OPLR method expands PLR by adding an omission approach for detecting outliers. We implemented a version that identifies a single outlier in the data located outside of the Tukey’s fences. The input data include crisp explanatory variables and the response variable in form of a symmetric TFN.
Fuzzy least squares (FLS) method supports a simple FLR for a non-symmetric TFN explanatory as well as a response variable. This probabilistic-based method (FLS calculates the fuzzy regression coefficients using least squares) is relatively robust against outliers compared to the possibilistic-based methods.
A multi-objective fuzzy linear regression (MOFLR) method estimates the fuzzy regression coefficients with a possibilistic approach from symmetric TFN input data. Given a specific weight, the method determines a trade-off between outlier penalization and data fitting that enables the user to fine-tune outlier handling in the analysis.
The Boscovich fuzzy regression line (BFRL) fits a simple model predicting non-symmetric fuzzy numbers from crisp descriptor. The prediction returns non-symmetric triangular fuzzy numbers. The intercept is a non-symmetric triangular fuzzy number and the slope is a crisp number.
The TFN definition used in fuzzyreg
enables an easy
setup of the regression model using the well-established syntax for
regression analyses in R
. The model is set up from a
data.frame
that contains all observations for the dependent
variable and the independent variables. The data.frame
must
contain columns with the respective spreads for all variables that are
TFNs.
The example data from Nasrabadi et
al. (2015) contain symmetric TFNs. The spreads for the independent
variable x are in the column
xl
and all values are equal to 0. The spreads for the
dependent variable y are in
the column yl
.
fuzzydat$nas
# x xl y yl
# 1 1 0 6.4 2.2
# 2 2 0 8.0 1.8
# 3 3 0 16.5 2.6
# 4 4 0 11.5 2.6
# 5 5 0 13.0 2.4
Note that the data contain only one column with spreads per variable. This is an accepted format for symmetric TFNs, because the values can be recycled for the left and right spreads.
The formula
argument used to invoke a fuzzy regression
with the fuzzylm()
function will relate y ∼ x. The columns
x
and y
contain the central values of the
variables. The spreads are not included in the formula
. To
calculate the fuzzy regression from TFNs, list the column names with the
spreads as a character vector in the respective arguments of the
fuzzylm
function.
f2 <- fuzzylm(formula = y ~ x, data = fuzzydat$nas,
fuzzy.left.x = "xl",
fuzzy.left.y = "yl", method = "moflr")
# Warning in fuzzylm(formula = y ~ x, data = fuzzydat$nas, fuzzy.left.x = "xl", :
# fuzzy spreads detected - assuming same variable order as in formula
Calls to methods that analyse non-symmetric TFNs must include both
arguments for the left and right spreads, respectively. The arguments
specifying spreads for the dependent variable are
fuzzy.left.y
and fuzzy.right.y
. However, if we
wish to analyse symmetric TFNs using a method for non-symmetric TFNs,
both argumets might call the same column with the values for the
spreads.
f3 <- fuzzylm(y ~ x, data = fuzzydat$nas,
fuzzy.left.y = "yl",
fuzzy.right.y = "yl", method = "fls")
# Warning in fuzzylm(y ~ x, data = fuzzydat$nas, fuzzy.left.y = "yl",
# fuzzy.right.y = "yl", : fuzzy spreads detected - assuming same variable order
# as in formula
As the spreads are included in the model using the column names, the
function cannot check whether the provided information is correct. The
issue gains importance when developing a multiple fuzzy regression
model. The user must ascertain that the order of the column names for
the spreads corresponds to the order of the variables in the
formula
argument.
The fuzzy regression models can be used to predict new data within
the range of data used to infer the model with the predict
function. The reason for disabling extrapolations from the fuzzy
regression models lies in the non-negligible risk that the support
boundaries for the TFN might intersect the central tendency. The
predicted TFNs outside of the range of data might not be defined. The
predicted values will replace the original variable values in the
fuzzylm
data structure in the element y
.
The choice of the method to estimate the parameters of the fuzzy regression model is data-driven. Following the application of the suitable methods, fuzzy regression models can be compared according to the sum of differences between membership values of the observed and predicted membership functions.
In Figure @ref(fig:TEF), the points represent central values of the
observations and whiskers indicate their spreads. The shaded area shows
the model predictions with the degree of membership greater than zero.
We can compare the models numerically using the total error of fit ∑E with the TEF()
function:
Lower values of ∑E mean that the predicted TFNs fit better with the observed TFNs.
When comparing the fuzzy linear regression and a statistical linear regression models, we can observe that while the fuzzy linear regression shows something akin to a confidence interval, the interval differs from the confidence intervals derived from a statistical linear regression model (Figure @ref(fig:regfig)).
The confidence interval from a statistical regression model shows the certainty that the modeled relationship fits within. We are 95% certain that the true relationship between the variables is as displayed.
On the other hand, the support of the fuzzy regression model prediction shows the range of possible values. The dependent variable can reach any value from the set, but the values more distant from the central tendency will have smaller degree of membership. We can imagine the values closer to the boundaries as vaguely disappearing from the set as if they bleached out (gradient towards white in the fuzzy regression model plots).
To cite fuzzyreg
, include the reference to the software
and the used method.
Škrabánek P. and Martínková N. 2021. Algorithm 1017: fuzzyreg: An R Package for Fuzzy Linear Regression Models. ACM Trans. Math. Softw. 47: 29. doi: 10.1145/3451389.
The references for the specific methods are given in the above reference, through links in Table @ref(tab:methods) or accessible through the method help, e.g. the default fuzzy linear regression method PLRLS: