Mastering Machine Learning with R(Second Edition)
上QQ阅读APP看书,第一时间看更新

Data understanding and preparation

This dataset consists of tissue samples from 699 patients. It is in a data frame with 11 variables, as follows:

  • ID: Sample code number
  • V1: Thickness
  • V2: Uniformity of the cell size
  • V3: Uniformity of the cell shape
  • V4: Marginal adhesion
  • V5: Single epithelial cell size
  • V6: Bare nucleus (16 observations are missing)
  • V7: Bland chromatin
  • V8: Normal nucleolus
  • V9: Mitosis
  • class: Whether the tumor diagnosis is benign or malignant; this will be the outcome that we are trying to predict

The medical team has scored and coded each of the nine features on a scale of 1 to 10.

The data frame is available in the R MASS package under the biopsy name. To prepare this data, we will load the data frame, confirm the structure, rename the variables to something meaningful, and delete the missing observations. At this point, we can begin to explore the data visually. Here is the code that will get us started when we first load the library and then the dataset; using the str() function, we will examine the underlying structure of the data:

    > library(MASS)
> data(biopsy)
> str(biopsy)
'data.frame': 699 obs. of 11 variables:
$ ID : chr "1000025" "1002945" "1015425"
"1016277" ...

$ V1 : int 5 5 3 6 4 8 1 2 2 4 ...
$ V2 : int 1 4 1 8 1 10 1 1 1 2 ...
$ V3 : int 1 4 1 8 1 10 1 2 1 1 ...
$ V4 : int 1 5 1 1 3 8 1 1 1 1 ...
$ V5 : int 2 7 2 3 2 7 2 2 2 2 ...
$ V6 : int 1 10 2 4 1 10 10 1 1 1 ...
$ V7 : int 3 3 3 3 3 9 3 3 1 2 ...
$ V8 : int 1 2 1 7 1 7 1 1 1 1 ...
$ V9 : int 1 1 1 1 1 1 1 1 5 1 ...
$ class: Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1
1 1 ...

An examination of the data structure shows that our features are integers and the outcome is a factor. No transformation of the data to a different structure is needed. 

We can now get rid of the ID column, as follows:

    > biopsy$ID = NULL

Next, we will rename the variables and confirm that the code has worked as intended:

     > names(biopsy) <- c("thick", "u.size", "u.shape", 
"adhsn", "s.size", "nucl", "chrom", "n.nuc",
"mit", "class")

> names(biopsy)
[1] "thick" "u.size" "u.shape" "adhsn"
"s.size" "nucl"
"chrom" "n.nuc"
[9] "mit" "class"

Now, we will delete the missing observations. As there are only 16 observations with the missing data, it is safe to get rid of them as they account for only 2 percent of all the observations. A thorough discussion of how to handle the missing data is outside the scope of this chapter and has been included in the Appendix A, R Fundamentals, where I cover data manipulation. In deleting these observations, a new working data frame is created. One line of code does this trick with the na.omit function, which deletes all the missing observations:

    > biopsy.v2 <- na.omit(biopsy)

Depending on the package in R that you are using to analyze the data, the outcome needs to be numeric, which is 0 or 1. In order to accommodate that requirement, create the variable y, where benign is zero and malignant 1, using the ifelse() function as shown here:

   > y <- ifelse(biopsy$class == "malignant", 1, 0)

There are a number of ways in which we can understand the data visually in a classification problem, and I think a lot of it comes down to personal preference. One of the things that I like to do in these situations is examine the boxplots of the features that are split by the classification outcome. This is an excellent way to begin understanding which features may be important to the algorithm. Boxplots are a simple way to understand the distribution of the data at a glance. In my experience, it also provides you with an effective way to build the presentation story that you will deliver to your customers. There are a number of ways to do this quickly, and the lattice and ggplot2 packages are quite good at this task. I will use ggplot2 in this case with the additional package, reshape2. After loading the packages, you will need to create a data frame using the melt() function. The reason to do this is that melting the features will allow the creation of a matrix of boxplots, allowing us to easily conduct the following visual inspection:

    > library(reshape2)
> library(ggplot2)

The following code melts the data by their values into one overall feature and groups them by class:

    > biop.m <- melt(biopsy.v2, id.var = "class")

Through the magic of ggplot2, we can create a 3x3 boxplot matrix, as follows:

    > ggplot(data = biop.m, aes(x = class, y = value)) 
+ geom_boxplot() + facet_wrap(~ variable, ncol = 3)

The following is the output of the preceding code:

How do we interpret a boxplot? First of all, in the preceding screenshot, the thick white boxes constitute the upper and lower quartiles of the data; in other words, half of all the observations fall in the thick white box area. The dark line cutting across the box is the median value. The lines extending from the boxes are also quartiles, terminating at the maximum and minimum values, outliers notwithstanding. The black dots constitute the outliers.

By inspecting the plots and applying some judgment, it is difficult to determine which features will be important in our classification algorithm. However, I think it is safe to assume that the nuclei feature will be important, given the separation of the median values and corresponding distributions. Conversely, there appears to be little separation of the mitosis feature by class, and it will likely be an irrelevant feature. We shall see!

With all of our features quantitative, we can also do a correlation analysis as we did with linear regression. Collinearity with logistic regression can bias our estimates just as we discussed with linear regression. Let's load the corrplot package and examine the correlations as we did in the previous chapter, this time using a different type of correlation matrix, which has both shaded ovals and the correlation coefficients in the same plot, as follows:

    > library(corrplot)
> bc <- cor(biopsy.v2[, 1:9]) #create an object of
the features

> corrplot.mixed(bc)

The following is the output of the preceding code:

The correlation coefficients are indicating that we may have a problem with collinearity, in particular, the features of uniform shape and uniform size that are present. As part of the logistic regression modeling process, it will be necessary to incorporate the VIF analysis as we did with linear regression. The final task in the data preparation will be the creation of our train and test datasets. The purpose of creating two different datasets from the original one is to improve our ability so as to accurately predict the previously unused or unseen data.

In essence, in machine learning, we should not be so concerned with how well we can predict the current observations and should be more focused on how well we can predict the observations that were not used in order to create the algorithm. So, we can create and select the best algorithm using the training data that maximizes our predictions on the test set. The models that we will build in this chapter will be evaluated by this criterion.

There are a number of ways to proportionally split our data into train and test sets: 50/50, 60/40, 70/30, 80/20, and so forth. The data split that you select should be based on your experience and judgment. For this exercise, I will use a 70/30 split, as follows:

    > set.seed(123) #random number generator
> ind <- sample(2, nrow(biopsy.v2), replace = TRUE,
prob = c(0.7, 0.3))

> train <- biopsy.v2[ind==1, ] #the training data
set

> test <- biopsy.v2[ind==2, ] #the test data set
> str(test) #confirm it worked
'data.frame': 209 obs. of 10 variables:
$ thick : int 5 6 4 2 1 7 6 7 1 3 ...
$ u.size : int 4 8 1 1 1 4 1 3 1 2 ...
$ u.shape: int 4 8 1 2 1 6 1 2 1 1 ...
$ adhsn : int 5 1 3 1 1 4 1 10 1 1 ...
$ s.size : int 7 3 2 2 1 6 2 5 2 1 ...
$ nucl : int 10 4 1 1 1 1 1 10 1 1 ...
$ chrom : int 3 3 3 3 3 4 3 5 3 2 ...
$ n.nuc : int 2 7 1 1 1 3 1 4 1 1 ...
$ mit : int 1 1 1 1 1 1 1 4 1 1 ...
$ class : Factor w/ 2 levels benign","malignant":
1 1 1 1 1 2 1
2 1 1 ...

To ensure that we have a well-balanced outcome variable between the two datasets, we will perform the following check:

    > table(train$class)
benign malignant
302 172
> table(test$class)
benign malignant
142 67

This is an acceptable ratio of our outcomes in the two datasets; with this, we can begin the modeling and evaluation.