FSinR package

This is the official website of the FSinR package.

CRAN Package page`

A Short Introduction to the FSinR Package"

FSinR Package

The package FSinR contains functions for performing the feature selection task. More specifically, it contains a large number of filter and wrapper functions widely used in the literature that can be integrated into search methods, although they can also be executed individually. The FSinR package uses the functions for training classification and regression models available in the R caret package to generate wrapper measurements. This gives the package a great background of methods and functionalities. In addition, the package has been implemented in such a way that its use is as easy and intuitive as possible. This is why the calls to all search methods and all filter and wrapper functions follow the same structure.

The way to install the package from the CRAN repository is as follows:

install.packages("FSinR")

As mentioned above, the package contains numerous filter and wrapper methods that can be executed as evaluation measures within a search algorithm in order to find a subset of features. This subset of features is used to generate models that represent the data set in a better manner. Therefore, the best way to use the filter and wrapper methods is through the search functions (although they can also be run independently). The search functions present in the package are the following:

These search functions are the main functions on which the package works. The structure of all of them is the same and they contain the same following parameters:

It is important to note that the FSinR package does not split the data into training data and test data, but instead applies the feature selection over the entire data set passed to it as a parameter. Then, in a modeling process, the data should be separated by the user prior to the whole process of the training and test sets. Missing data must be processed prior to the use of the package. The following are some examples of using the package along with a brief description of how it works.

Wrapper example

To demonstrate in a simple manner how wrapper methods work, the iris dataset will be used in this example. The dataset consists in 150 instances of 4 variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) that determine the type of iris plant. The target variable, Species, has 3 possible classes (setosa, versicolor, virginica). For a correct use of the package, the data on which the feature selection is performed must be the training data. But the main objective of this vignette is to illustrate the use of the package, and not the complete modeling process, so in this case the whole dataset will be used without partitioning.

library(caret)
library(FSinR)

data(iris)

In the package, wrapper methods are passed as an evaluation measure to the search algorithms. The possible wrapper methods to use are the 238 models available in caret. In addition, the caret package offers the possibility of establishing a group of options to personalize the models (eg. resampling techniques, evaluation measurement, grid parameters, …) using the trainControl and train functions. In FSinR, the wrapperGenerator function is used to set all these parameters and use them to generate the wrapper model using as background the methods of caret. The wrapperGenerator function has as parameters:

In the example we use a knn model, since the iris problem is a classification problem. The FSinR package is able to detect automatically depending on the metric whether the objective of the problem is to maximize or minimize. To tune the model, the resampling method is established as a 10-fold crossvalidation, the dataset is centered and scaled, the accuracy is used as a metric, and a grid of the k parameter is performed.

resamplingParams <- list(method = "cv", number = 10)
fittingParams <- list(preProc = c("center", "scale"), metric="Accuracy", tuneGrid = expand.grid(k = c(1:20)))

wrapper <- wrapperGenerator("knn", resamplingParams, fittingParams)

For more details, the way in which caret train and tune a model can be seen here. The link contain tutorials on how to use the caret functions, and also show the parameters that accept the functions and the possible values they can take. A list of available models in caret can be found here.

The wrapper model is obtained as a result of the call to the wrapperGenerator function, and is passed as a parameter to the search function. The search algorithm used in this example is ts (taboo search). As mentioned earlier, the search methods have 3 parameters that are always present, which are the dataset, the class name, and the wrapper or filter method. And in addition, each algorithm has its own parameters for optimal modeling. An example of a call to the default search function is as follows:

result.search.fs <- ts(iris, "Species", wrapper)

But it is also possible to set some specific parameters of the search method. In this case, the number of algorithm iterations, the size of the taboo list, as well as an intensification phase and a diversification phase, both of 5 iterations each, are established. Most FSinR package search algorithms include a parameter called verbose, which if set to TRUE shows the development and information of the iterations of the algorithms per console.

result.search.fs <- ts(iris, 'Species', wrapper, tamTabuList = 4, iter = 5, intensification=2, iterIntensification=5, diversification=1, iterDiversification=5, verbose=FALSE)

The search algorithm call returns a list of the most important results and the most important details of the process.

result.search.fs$bestFeatures
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            0           1            1           1
result.search.fs$bestFitness
#> [1] 0.98

In the example, the output only shows the best subset of features chosen in the feature selection process and the accuracy measurement obtained. In this particular case, the taboo search result also returns the status of the taboo list in each iteration, as well as the subset of features chosen in each iteration. Although in this example is not shown to be extensive.

The wrapper method generated above can also be used directly, without being inside a search algorithm. To do this, the data set, the name of the variable to be predicted, and a vector with the names of the features to be taken into account are passed to the method as parameters.

wrapperMeasure <- wrapper(iris,"Species",c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))
wrapperMeasure
#> [1] 0.9666667

This returns as a result the evaluation measure of the wrapper method on the data set with all variables passed as parameter. As it can be seen, this value is lower, and therefore worse, than the one obtained with the feature selection process.

The above example is a classification problem, but the package methods can also be applied to regression problems. For this, we use the dataset mtcars which contains 10 design variables from 32 automobiles. The variable to predict, mpg, is a numerical variable.

data(mtcars)

The wrapper function has to be generated again. In this case it can be seen how to use a repeated 10-fold cross validation and how the metric also changes to a metric for regression problems, RMSE. In addition, in this case a neural network will be used to fit the model, and therefore has its specific parameters, size and decay.

resamplingParams <- list(method="repeatedcv", number = 10, repeats = 3)
fittingParams <- list(preProc = c("center", "scale"), metric="RMSE", tuneGrid = expand.grid(size = seq(1,12,by=2), decay=0), trace=FALSE)

wrapper <- wrapperGenerator("nnet",resamplingParams, fittingParams)

The search algorithm is then executed. In this case a genetic algorithm is used using the ga function and some parameters are defined.

result.search.fs <- ga(mtcars, 'mpg', wrapper, popSize = 10, pcrossover = 0.8, pmutation = 0.1, maxiter=5, verbose=TRUE)

The search algorithm returns again a list with the best subset of features found and the evaluation measure obtained. In addition, this algorithm also returns the final population of individuals along with their evaluation measure.

result.search.fs
#> $bestFeatures
#>      cyl disp hp drat wt qsec vs am gear carb
#> [1,]   1    0  1    0  0    1  1  1    1    0
#> 
#> $bestFitness
#> [1] 19.59814
#> 
#> $population
#>       x1 x2 x3 x4 x5 x6 x7 x8 x9 x10  fitness
#>  [1,]  1  0  1  0  0  1  1  1  1   0 19.59814
#>  [2,]  1  0  1  0  0  1  0  0  1   0 19.79300
#>  [3,]  1  0  1  0  0  1  1  1  1   0 19.90668
#>  [4,]  0  1  0  0  0  1  0  0  1   0 19.77031
#>  [5,]  0  1  0  0  0  1  1  0  1   0 19.91983
#>  [6,]  1  0  1  0  0  1  1  1  0   1 19.96582
#>  [7,]  1  0  1  0  1  1  0  0  1   0 19.82054
#>  [8,]  1  0  1  0  0  1  0  1  1   0 19.69648
#>  [9,]  1  0  1  0  0  1  1  1  1   0 19.59814
#> [10,]  1  0  1  0  0  1  1  1  1   0 19.59814

Filter example

For the example of the filter method, the iris dataset is used again as in the previous example. Again, it is important to note that the purpose of this example is to show how the package works, rather than a complete modeling process, so variable selection methods are applied to the entire dataset without partitioning. Otherwise the dataset should be partitioned into training and test data, and the feature selection should be applied over the training data.

library(FSinR)

data(iris)

The following filter methods are implemented in the package:

These methods can be passed as parameters to the search algorithms. But unlike wrapper methods, filter methods are directly implemented in the package for use, and there is no need to generate a function prior to use in the search methods as was the case with wrapperGenerator in the wrapper example.

Therefore, the search method is executed directly with the name of the filter function as parameter. In this case the search algorithm used in the example is Sequential forward selection, sfs. This algorithm is simple and therefore will not set additional parameters to the required. And on the other hand, the chosen filter measure is the Gini Index.

result.search.fs <- sfs(iris, "Species", giniIndex)

This search algorithm returns a list with the best subset of features it has found, and the value of the measurement obtained.

result.search.fs
#> $bestFeatures
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           0
#> 
#> $bestValue
#> [1] 1

As with wrapper measures, filter measures can be used directly without the need to include them in a search algorithm. To do this, the data set, the name of the variable to be predicted, and a vector with the names of the features to be taken into account are passed to the method as parameters.

filterMeasure <- giniIndex(iris, "Species", c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))

Filter methods can also be applied to regression problems. For this as in the previous wrapper regression example the dataset mtcars is used. As a search method the same algorithm is used as in the previous example, sfs, and as a filter method also the same method giniIndex.

result.search.fs <- sfs(mtcars, "mpg", giniIndex)
result.search.
#> $bestFeatures
#>      cyl disp hp drat wt qsec vs am gear carb
#> [1,]   1    0  0    0  0    1  0  0    0    0
#> 
#> $bestValue
#> [1] 1

Evaluation and algorithms combination"

Evaluation and algorithms combination

FSinR evaluation measures and algorithms functions are designed following a common interface so that they are easier to use.

In this example a selection of evaluation measures (Inconsistent Examples, Inconsistent Examples Pairs, Mutual Information and Binary consistency) are being used with Sequential Feature Selection algorithm.

library(FSinR)
#> 
#> Attaching package: 'FSinR'
#> The following object is masked from 'package:stats':
#> 
#>     ts

sfs(iris, 'Species', IEConsistency)
#> $bestFeatures
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           0            1           1
#> 
#> $bestFitness
#> [1] 1
sfs(iris, 'Species', IEPConsistency)
#> $bestFeatures
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           0
#> 
#> $bestFitness
#> [1] 1
sfs(iris, 'Species', mutualInformation)
#> $bestFeatures
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           0
#> 
#> $bestFitness
#> [1] 1.584963
sfs(iris, 'Species', binaryConsistency)
#> $bestFeatures
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           0
#> 
#> $bestFitness
#> [1] 1

In order to save some lines of code, as these functions share the same inteface, they can be used as following:

measures <- list(IEConsistency, IEPConsistency, mutualInformation, binaryConsistency)
for (measure in measures) {
  result <- sfs(iris, 'Species', measure)
  print(attr(measure,'name'))
  print(result$bestFeatures)
}
#> [1] "Inconsistent Examples Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           0            1           1
#> [1] "Inconsistent Examples Pairs Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           0
#> [1] "Mutual Information"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           0
#> [1] "Binary Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           0

Algorithms also share the same interface, so they can be combined as well. In this example Sequential Forward Selection, Genetic Algorithm and Las Vegas Wrapper are run with all the previous mentioned evaluation measures:

measures <- list(IEConsistency, IEPConsistency, mutualInformation, binaryConsistency)
algorithms <- list(sfs, ga, lvw)
for (algorithm in algorithms) {
  for (measure in measures) {
    result <- algorithm(iris, 'Species', measure)
    print(paste("Algorithm: ",attr(algorithm,'name')))
    print(paste("Evaluation measure: ", attr(measure,'name')))
    print(result$bestFeatures)
  }
}
#> [1] "Algorithm:  Sequential Forward Selection"
#> [1] "Evaluation measure:  Inconsistent Examples Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           0            1           1
#> [1] "Algorithm:  Sequential Forward Selection"
#> [1] "Evaluation measure:  Inconsistent Examples Pairs Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           0
#> [1] "Algorithm:  Sequential Forward Selection"
#> [1] "Evaluation measure:  Mutual Information"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           0
#> [1] "Algorithm:  Sequential Forward Selection"
#> [1] "Evaluation measure:  Binary Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           0
#> [1] "Algorithm:  Genetic Algorithm"
#> [1] "Evaluation measure:  Inconsistent Examples Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           1
#> [2,]            1           1            1           0
#> [3,]            1           0            1           1
#> [1] "Algorithm:  Genetic Algorithm"
#> [1] "Evaluation measure:  Inconsistent Examples Pairs Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            0           1            1           1
#> [2,]            1           1            1           1
#> [1] "Algorithm:  Genetic Algorithm"
#> [1] "Evaluation measure:  Mutual Information"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           1
#> [2,]            1           1            1           0
#> [1] "Algorithm:  Genetic Algorithm"
#> [1] "Evaluation measure:  Binary Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            0           1            1           1
#> [2,]            1           1            1           1
#> [3,]            1           1            0           1
#> [4,]            1           0            1           1
#> [1] "Algorithm:  Las Vegas Wrapper"
#> [1] "Evaluation measure:  Inconsistent Examples Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            0           1
#> [1] "Algorithm:  Las Vegas Wrapper"
#> [1] "Evaluation measure:  Inconsistent Examples Pairs Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           0
#> [1] "Algorithm:  Las Vegas Wrapper"
#> [1] "Evaluation measure:  Mutual Information"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            0           1            1           1
#> [1] "Algorithm:  Las Vegas Wrapper"
#> [1] "Evaluation measure:  Binary Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           0            1           1

Wrapper evaluation measures can also be combined. In the example a knn model is used, since the iris problem is a classification problem. The FSinR package is able to detect automatically depending on the metric whether the objective of the problem is to maximize or minimize. To tune the model, the resampling method is established as a 10-fold crossvalidation, the dataset is centered and scaled, the accuracy is used as a metric, and a grid of the k parameter is performed.
The wrapper evaluation measure function is created and added to the list of features:

resamplingParams <- list(method = "cv", number = 10)
fittingParams <- list(preProc = c("center", "scale"), metric="Accuracy", tuneGrid = expand.grid(k = c(1:20)))

wrapper <- wrapperGenerator("knn", resamplingParams, fittingParams)

measures <- list(IEConsistency, IEPConsistency, mutualInformation, binaryConsistency, wrapper)
algorithms <- list(sfs, ga, lvw)
for (algorithm in algorithms) {
  for (measure in measures) {
    result <- algorithm(iris, 'Species', measure)
    print(paste("Algorithm: ",attr(algorithm,'name')))
    print(paste("Evaluation measure: ", attr(measure,'name')))
    print(result$bestFeatures)
  }
}
#> [1] "Algorithm:  Sequential Forward Selection"
#> [1] "Evaluation measure:  Inconsistent Examples Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           0            1           1
#> [1] "Algorithm:  Sequential Forward Selection"
#> [1] "Evaluation measure:  Inconsistent Examples Pairs Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           0
#> [1] "Algorithm:  Sequential Forward Selection"
#> [1] "Evaluation measure:  Mutual Information"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           0
#> [1] "Algorithm:  Sequential Forward Selection"
#> [1] "Evaluation measure:  Binary Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           0
#> Loading required package: lattice
#> Loading required package: ggplot2
#> [1] "Algorithm:  Sequential Forward Selection"
#> [1] "Evaluation measure:  knn  Wrapper"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            0           1            1           1
#> [1] "Algorithm:  Genetic Algorithm"
#> [1] "Evaluation measure:  Inconsistent Examples Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           1
#> [2,]            1           0            1           1
#> [3,]            0           1            1           1
#> [1] "Algorithm:  Genetic Algorithm"
#> [1] "Evaluation measure:  Inconsistent Examples Pairs Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            0           1            1           1
#> [2,]            1           1            0           1
#> [3,]            1           1            1           0
#> [4,]            1           1            1           1
#> [5,]            1           0            1           1
#> [1] "Algorithm:  Genetic Algorithm"
#> [1] "Evaluation measure:  Mutual Information"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           1
#> [2,]            0           1            1           1
#> [1] "Algorithm:  Genetic Algorithm"
#> [1] "Evaluation measure:  Binary Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           1
#> [2,]            1           1            0           1
#> [1] "Algorithm:  Genetic Algorithm"
#> [1] "Evaluation measure:  knn  Wrapper"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            0           1            1           1
#> [1] "Algorithm:  Las Vegas Wrapper"
#> [1] "Evaluation measure:  Inconsistent Examples Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            0           1            1           1
#> [1] "Algorithm:  Las Vegas Wrapper"
#> [1] "Evaluation measure:  Inconsistent Examples Pairs Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            1           1            1           0
#> [1] "Algorithm:  Las Vegas Wrapper"
#> [1] "Evaluation measure:  Mutual Information"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            0           1            1           1
#> [1] "Algorithm:  Las Vegas Wrapper"
#> [1] "Evaluation measure:  Binary Consistency"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            0           1            1           1
#> [1] "Algorithm:  Las Vegas Wrapper"
#> [1] "Evaluation measure:  knn  Wrapper"
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,]            0           1            1           1

References

Measure utilities

Function Measure Family Measured property Evaluation Cite
chiSquared Chi-squared Filter Dependency Individual [@Pearson1900]
cramer Cramer V Filter Dependency Individual [@Cramer1946]
fscore F-score Filter Distance Individual [@Wang2018]
relief Relief Filter Distance Individual [@Kira1992]
roughsetConsistency Rough Sets consistency Filter Consistency Set [@Pawlak1982]
binaryConsistency Binary consistency Filter Consistency Set [@AlmuallimDietterich1991]
IEConsistency Inconsistent Examples consistency Filter Consistency Set [@DashLiu2003]
IEPConsistency Inconsistent Examples Pairs consistency Filter Consistency Set [@Arauzo2007]
determinationCoefficient Determination Coefficient (R2R^2, to continous features) Filter Dependency Set [@rsquared]
mutualInformation Mutual information Filter Information Set [@QianShu2015]
gainRatio Gain ratio Filter Information Set [@Quinlan1986]
symmetricalUncertain Symmetrical uncertain Filter Information Set [@WittenFrank2005]
giniIndex Gini index Filter Information Set [@Ceriani2012]
Jd Jd evaluation Filter Information Set [@Narendra1977]
MDLC MDLC evaluation Filter Information Set [@Sheinvald1990]
RFSM RFSM evaluation Filter Distance Set [@Arauzo2004]
wrapperGenerator Wrapper function Wrapper Accuracy Set [@kohavi1997]

Search Algorithms

Function Algorithm Evaluation Family Cite
breadhFirstSearch Breadth First Search Set Exhaustive Search [@Kozen1992]
deepFirstSearch Deep First Search Set Exhaustive Search [@Kozen1992]
hc Hill-Climbing Set Heuristics [@Russell2009]
ts Tabu search Set Heuristics [@glover1986; @glover1990]
ga Genetic algorithm Set Heuristics [@yang1998feature]
woa Whale optimization algorithm Set Heuristics [@Kumar2018]
sa Simmulated annealing Set Heuristics [@KirkpatrickGelattVecchi1983]
aco Ant colony optimization Set Heuristics [@Kashef2015]
lvw Las Vegas wrapper Set Random search [@LiuSetiono1996]
sfs Sequential Forward Selection Set Sequential selection [@Whitney1971]
sffs Sequential Floating Forward Selection Set Sequential selection [@Pudil1994]
sbs Sequential Backward Selection Set Sequential selection [@MarillGreen1963]
sfbs Sequential Floating Backward Selection Set Sequential selection [@Pudil1994]
LCC Linear Consistency-Constrained Hybrid Sequential selection [@ShinXu2009]
selectKBest Select K best Individual Direct selection
selectPercentile Select Percentile Individual Direct selection
selectThreshold Select threshold Individual Direct selection
selectThresholdRange Select threshold range Individual Direct selection
selectDifference Select difference Individual Direct selection
selectSlope Select slope Individual Direct selection