R Programming By Example
上QQ阅读APP看书,第一时间看更新

Checking normality with histograms and quantile-quantile plots

We will check normality with two different techniques so that we can exemplify the usage of a technique known as the strategy pattern, which is part of a set of patterns from object-oriented programming. We will go deeper into these patterns in Chapter 8, Object-Oriented System to Track Cryptocurrencies.

For now, you can think of the strategy pattern as a technique that will re-use code that would otherwise be duplicated and simply changes a way of doing things called the strategy. In the following code you can see that we create a function called save_png() which contains the code that would be duplicated (saving PNG files) and doesn't need to be. We will have two strategies, in the form of functions, to check data normality—histograms and quantile-quantile plots. These will be sent through the argument conveniently named functions_to_create_images. As you can see, this code receives some data, a variable that will be used for the graph, the file name for the image, and a function that will be used to create the graphs. This last parameter, the function, should not be unfamiliar to the reader as we have seen in Chapter 1, Introduction to R, that we can send functions as arguments, and use them as we do in this code, by calling them through their new name inside the function, function_to_create_image() in this case:

save_png <- function(data, variable, save_to, function_to_create_image) {
    if (not_empty(save_to)) png(save_to)
    function_to_create_image(data, variable)
    if (not_empty(save_to)) dev.off()
}

Now we show the code that will make use of this save_png() function and encapsulate the knowledge of the function that is used for each case. In the case of the histograms, the histogram() function shown in the following code simply wraps the hist() function used to create the graph with a common interface that will also be used by the other strategies (the quantile_quantile() function shown in the following code in this case). This common interface allows us to use these strategies as plugins that can be substituted easily as we do in the corresponding variable_histogram() and variable_qqplot() functions (they both do the same call, but use a different strategy in each case). As you can see, other details that are not part of the common interface (for example, main and xlab) are handled within each strategy's code. We could add them as optional arguments if we wanted to, but it's not necessary for this example:

variable_histogram <- function(data, variable, save_to = "") {
    save_png(data, variable, save_to, histogram)
}

histogram <- function(data, variable) {
    hist(data[, variable], main = "Histogram", xlab = "Proportion")
}

variable_qqplot <- function(data, variable, save_to = "") {
    save_png(data, variable, save_to, quantile_quantile)
}

quantile_quantile <- function(data, variable) {
    qqnorm(data[, variable], main = "Normal QQ-Plot for Proportion")
    qqline(data[, variable])
}

The following shows the graph for checking proportion normality:

quantile_quantile <- function(data, variable) {
qqnorm(data[, variable], main = "Normal QQ-Plot for Proportion")
qqline(data[, variable])
}

If we wanted to share the code used to create the PNG images with a third (or more) strategies, then we can simply add a strategy wrapper for each new case without worrying about duplicating the code that creates the PNG images. It may seem that this is not a big deal, but imagine that the code used to create the PNG files was complex and suddenly you found a bug. What would you need to fix that bug? Well, you'd have to go to every place where you duplicated the code and fix it there. Doesn't seem very efficient. Now, what happens if you no longer want to save PNG files and want to instead save JPG files? Well, again, you would have to go everywhere you have duplicated your code and change it. Again, not very efficient. As you can see, this way of programming requires a little investment upfront (creating the common interfaces and providing wrappers), but the benefit of doing so will pay for itself through the saved time, you do need to change the code, if only once, as well as more understandable and simpler code. This is a form of dependency management and is something you should learn how to do to become a more efficient programmer.

You may have noticed that in the previous code, we could have avoided one function call by having the user call directly the save_png() function. However, doing so would require the user to have knowledge of two things, the save_png() function to save the image and the quantile_quantile() or histogram() functions to produce the plots, depending on what she was trying to plot. This extra burden in the user, although seemingly not problematic, could make things very confusing for her since not many users are used to sending functions as arguments, and they would have to know two function signatures, instead of one.

Providing a wrapper whose signature is easily usable as we do with variable_histogram() and variable_qqplot() makes it easier on the user, and allows us to expand the way we want to show graphs in case we want to change that later without making the user learn a new function signature.

To actually produce the plots we're looking for, we use the following code:

variable_histogram(data = data, variable = "Proportion")
variable_qqplot(data = data, variable = "Proportion")

As you can see, the histogram shows an approximate normal distribution slightly skewed towards the right, but we can easily accept it as being normal. The corresponding quantile-quantile plot shows the same information in a slightly different way. The line it shows corresponds to the quantiles of the normal distribution, and the dots show the actual distribution in the data. The closer these dots are to the line, the closer the variable's distribution is to being normally distributed. As we can see, for the most part, Proportion is normally distributed, and it's at the extremes that we can see a slight deviation, which probably comes from the fact that our Proportion variable actually has hard limits at 0 and 1. However, we can also accept it as being normally distributed, and we can proceed to the next assumption safely.