As first step we need to install the package from CRAN. Next we load the package using the comand library.
We now are ready to start using the package! We load the Coffee dataset that is a classification problem from the time series domain. It is possible to consult the help of this dataset.
Now we need to simulate the semi-supervised context. For this we obtain a partition of the dataset into three subsets: labeled set L, unlabeled set U and test set T.
x <- coffee[, -287] # instances without classes
y <- coffee[, 287] # the classes
set.seed(1) # set seed
tra.idx <- sample(x = length(y), size = ceiling(length(y) * 0.5))
xtrain <- x[tra.idx,] # training instances
ytrain <- y[tra.idx] # related classes
tra.na.idx <- sample(x = length(tra.idx),
size = ceiling(length(tra.idx) * 0.7))
ytrain[tra.na.idx] <- NA # remove classes from 70% of instances
xttest <- x[tra.na.idx,] # unlabeled training instances
yttest <- y[tra.na.idx] # real classes
tst.idx <- setdiff(1:length(y), tra.idx)
xitest <- x[tst.idx,] # testing instances
yitest <- y[tst.idx] # related classes
The training set (xtrain) includes the 50% of all instances and the testing set (xitest) contains the rest. In the xtrain set only the 30% of the instances are labeled. This information is included in the vector ytrain where the positions that have the value NA (Not Available) correspond with the unlabeled instances in xtrain. The labeled instances in xtrain were randomly selected.
The variables xitest and xttest are two set of instances which are used to test the prediction capabilities of the model. Specifically, xitest and xttest are used to test inductive and transductive prediction, respectively. The variables yitest and yttest correspond with the class information of the instances in xitest and xttest, respectively.
We now can train a semi-supervised model from data. We call the function selfTraining to train this model:
m.selfT1 <- selfTraining(x = xtrain, y = ytrain, dist = "euclidean")
In this example we specify the distance measure as “euclidean”. It is possible to define (in this way) any distance function available in the proxy package.
m.selfT1 is now a model trained and it is ready to classify instances. At first, we classify the unlabeled instances used to train the classifier. To check the results we use the function confusionMatrix from the caret package. This classification process is considered transductive because it assigns the label to the unlabeled instances used during the training phase.
p.selfT1 <- predict(m.selfT1, xttest)
resultT <- confusionMatrix(table(p.selfT1, yttest))$overall[1:2]
To test the inductive capabilities of m.selfT1 we classify unseen instances (instances that were not included in the training process). We call the function predict to classify the instances in the variable xitest.
p.selfT1 <- predict(m.selfT1, xitest)
resultI <- confusionMatrix(table(p.selfT1, yitest))$overall[1:2]
As result we obtain the accuracy and the kappa statistics of the classification. Frecuently, the transductive classification outferforms the inductive classification. This is because during training process additional information about the unlabeled instances used is obtained.
To finish we generate a barplot with the the results of the classification inductive and transductive.
barplot(cbind(resultI,resultT), beside = T, col=c("lightblue","blue"),
legend.text = c("Accuracy","Kappa"),
main="Classification of Coffee dataset", args.legend = list(x = 6, y = 0.4))