
Factors
When analyzing data, it's quite common to encounter categorical values. R provides a good way to represent categorical values using factors, which are created using the factor() function and are integer vectors with associated labels for each integer. The different values that the factor can take are called levels. The levels() function shows all the levels from a factor, and the levels parameter of the factor() function can be used to explicitly define their order, which is alphabetical in case it's not explicitly defined.
Note that defining an explicit order can be important in linear modeling because the first level is used as the baseline level for functions like lm() (linear models), which we will use in Chapter 3, Predicting Votes with Linear Models.
Furthermore, printing a factor shows slightly different information than printing a character vector. In particular, note that the quotes are not shown and that the levels are explicitly printed in order afterwards:
x <- c("Blue", "Red", "Black", "Blue") y <- factor(c("Blue", "Red", "Black", "Blue")) z <- factor(c("Blue", "Red", "Black", "Blue"),
levels=c("Red", "Black", "Blue")) x #> [1] "Blue" "Red" "Black" "Blue"
y
#> [1] Blue Red Black Blue
#> Levels: Black Blue Red
z
#> [1] Blue Red Black Blue
#> Levels: Red Black Blue
levels(y)
#> [1] "Black" "Blue" "Red"
levels(z)
#> [1] "Red" "Black" "Blue"
Factors can sometimes be tricky to work with because their types are interpreted differently depending on what function is used to operate on them. Remember the class() and typeof() functions we used before? When used on factors, they may produce unexpected results. As you can see below, the class() function will identify x and y as being character and factor, respectively. However, the typeof() function will let us know that they are character and integer, respectively. Confusing isn't it? This happens because, as we mentioned, factors are stored internally as integers, and use a mechanism similar to look-up tables to retrieve the actual string associated for each one.
Technically, the way factors store the strings associated with their integer values is through attributes, which is a topic we will touch on in Chapter 8, Object-Oriented System to Track Cryptocurrencies.
class(x)
#> [1] "character"
class(y)
#> [1] "factor"
typeof(x)
#> [1] "character"
typeof(y)
#> [1] "integer"
While factors look and often behave like character vectors, as we mentioned, they are actually integer vectors, so be careful when treating them like strings. Some string methods, like gsub() and grepl(), will coerce factors to characters, while others, like nchar(), will throw an error, and still others, like c(), will use the underlying integer values. For this reason, it's usually best to explicitly convert factors to the data type you need:
gsub("Black", "White", x)
#> [1] "Blue" "Red" "White" "Blue"
gsub("Black", "White", y)
#> [1] "Blue" "Red" "White" "Blue"
nchar(x)
#> [1] 4 3 5 4
nchar(y)
#> Error in nchar(y): 'nchar()' requires a character vector
c(x)
#> [1] "Blue" "Red" "Black" "Blue"
c(y)
#> [1] 2 3 1 2
If you did not notice, the nchar() applied itself to each of the elements in the x factor. The "Blue", "Red", and "Black" strings have 4, 3, and 5 characters, respectively. This is another example of the vectorized operations we mentioned in the vectors section earlier.