247
$\begingroup$

I am working with a small dataset (21 observations) and have the following normal QQ plot in R:

enter image description here

Seeing that the plot does not support normality, what could I infer about the underlying distribution? It seems to me that a distribution more skewed to the right would be a better fit, is that right? Also, what other conclusions can we draw from the data?

$\endgroup$
7
  • 12
    $\begingroup$ You're correct that it indicates right skewness. I'll try to locate some of the posts on interpreting QQ plots. $\endgroup$
    – Glen_b
    Commented Jun 5, 2014 at 10:56
  • 4
    $\begingroup$ You don't have to conclude; you just need to decide what to try next. Here I would consider square rooting or logging the data. $\endgroup$
    – Nick Cox
    Commented Jun 5, 2014 at 12:56
  • 12
    $\begingroup$ Tukey's Three-Point Method works very well for using Q-Q plots to help you identify ways to re-express a variable in a way that makes it approximately normal. For instance, picking the penultimate points in the tails and the middle point in this graphic (which I estimate to be $(-1.5,2)$, $(1.5,220)$, and $(0,70)$), you will easily find that the square root comes close to linearizing them. Thus you can infer that the underlying distribution is approximately square root normal. $\endgroup$
    – whuber
    Commented Jun 5, 2014 at 13:09
  • 3
    $\begingroup$ @Glen_b The answer to my question has some information: stats.stackexchange.com/questions/71065/… and the link in the answer has another good source: stats.stackexchange.com/questions/52212/qq-plot-does-not-match-histogram $\endgroup$
    – tpg2114
    Commented Jun 5, 2014 at 16:26
  • 1
    $\begingroup$ If you do not know what transform to apply to retrieve normality, let an algorithm decide for you. Box-Cox transforms are a suitable algorithm for this task $\endgroup$
    – JacoSolari
    Commented Sep 9, 2020 at 10:34

5 Answers 5

422
$\begingroup$

If the values lie along a line the distribution has the same shape (up to location and scale) as the theoretical distribution we have supposed.

Local behaviour: When looking at sorted sample values on the y-axis and (approximate) expected quantiles on the x-axis, we can identify from how the values in some section of the plot differ locally from an overall linear trend by seeing whether the values are more or less concentrated than the theoretical distribution would suppose in that section of a plot:

sections out of four Q-Q plots

As we see, less concentrated points increase more and more concentrated points increase less rapidly than an overall linear relation would suggest, and in the extreme cases correspond to a gap in the density of the sample (shows as a near-vertical jump) or a spike of constant values (values aligned horizontally). This allows us to spot a heavy tail or a light tail and hence, skewness greater or smaller than the theoretical distribution, and so on.

Overall apppearance:

Here's what QQ-plots look like (for particular choices of distribution) on average:

enter image description here

But randomness tends to obscure things, especially with small samples:

enter image description here

Note that at $n=21$ the results may be much more variable than shown there - I generated several such sets of six plots and chose a 'nice' set where you could kind of see the shape in all six plots at the same time. Sometimes straight relationships look curved, curved relationships look straight, heavy-tails just look skew, and so on - with such small samples, often the situation may be much less clear:

enter image description here

It's possible to discern more features than those (such as discreteness, for one example), but with $n=21$, even such basic features may be hard to spot; we shouldn't try to 'over-interpret' every little wiggle. As sample sizes become larger, generally speaking the plots 'stabilize' and the features become more clearly interpretable rather than representing noise. [With some very heavy-tailed distributions, the rare large outlier might prevent the picture stabilizing nicely even at quite large sample sizes.]

You may also find the suggestion here useful when trying to decide how much you should worry about a particular amount of curvature or wiggliness.

A more suitable guide for interpretation in general would also include displays at smaller and larger sample sizes.

$\endgroup$
16
  • 39
    $\begingroup$ This is a very practical guide, thank you very much for gathering all that information. $\endgroup$
    – JohnK
    Commented Jun 5, 2014 at 12:57
  • 5
    $\begingroup$ I understand that it is shape and type of deviation from linearity what matters here, but still it looks odd that both axes are labeled " ... quantiles " and one axis goes as 0.2 0.4 0.6 and the other goes as -2 -1 0 1 2. Again it looks ok that some data points are within middle 40% of a theoretical distribution, but how can they be distributed between 3% of their own distributon, as the y-axis on your lower-right-most plot suggests? $\endgroup$
    – Macond
    Commented Dec 2, 2014 at 15:27
  • 2
    $\begingroup$ @Macond The y-axis shows the raw values of the data, not their quantiles. I agree that standardizing the y-axis would make things much clearer, and I have no idea why R doesn't do this by default. Could someone shed some light on this? $\endgroup$ Commented Feb 22, 2015 at 19:50
  • 4
    $\begingroup$ @GordonGustafson in respect of your first comment to Macond there's a very good reason why you don't standardize the data -- because a QQ plot is a display of the data! It's designed to show information in the data you supply to the function (it would make as much sense to standardize the data you supply to a boxplot or a histogram). If you transform it, it's no longer a display of the data (though the shape in the plot may be similar, you no longer show the location or scale on the plot). I'm not sure what it is you think would be clearer in a standardized plot - can you clarify? $\endgroup$
    – Glen_b
    Commented Feb 23, 2015 at 4:23
  • 2
    $\begingroup$ @ZiyaoWei No, a uniform really has very light tails -- arguably, no tails at all. Everything is within 2 MADs of the center. The first paragraph of this answer gives a clear, general, way to think about what 'heavier-tailed' means. $\endgroup$
    – Glen_b
    Commented May 14, 2015 at 3:48
100
+100
$\begingroup$

I made a shiny app to help interpret normal QQ plot. Try this link.

In this app, you can adjust the skewness, tailedness (kurtosis) and modality of data and you can see how the histogram and QQ plot change. Conversely, you can use it in a way that given the pattern of QQ plot, then check how the skewness etc should be.

For further details, see the documentation therein.


I realized that I don't have enough free space to provide this app online. As request, I will provide all three code chunks: sample.R, server.R and ui.R here. Those who are interested in running this app may just load these files into Rstudio then run it on your own PC.

The sample.R file:

    # Compute the positive part of a real number x, 
    # which is $\max(x, 0)$.
    positive_part <- function(x) {ifelse(x > 0, x, 0)}
    
    # This function generates n data points from some 
    # unimodal population.
    # Input: ----------------------------------------------------
    # n: sample size;
    # mu: the mode of the population, default value is 0.
    # skewness: the parameter that reflects the skewness of the 
    # distribution, note it is not
    #           the exact skewness defined in statistics textbook, 
    # the default value is 0.
    # tailedness: the parameter that reflects the tailedness 
    # of the distribution, note it is
    #             not the exact kurtosis defined in textbook, 
    # the default value is 0.
    
    # When all arguments take their default values, the data will 
    # be generated from standard 
    # normal distribution.
    
    random_sample <- function(n, mu = 0, skewness = 0, §
                        tailedness = 0){
      sigma = 1
      
      # The sampling scheme resembles the rejection sampling. 
      # For each step, an initial data point
      # was proposed, and it will be rejected or accepted based on 
      # the weights determined by the
      # skewness and tailedness of input.  

      reject_skewness <- function(x){
          scale = 1
          # if `skewness` > 0 (means data are right-skewed), 
          # then small values of x will be rejected
          # with higher probability.
          l <- exp(-scale * skewness * x)
          l/(1 + l)
      }
      
      reject_tailedness <- function(x){
          scale = 1
          # if `tailedness` < 0 (means data are lightly-tailed), 
          # then big values of x will be rejected with
          # higher probability.
          l <- exp(-scale * tailedness * abs(x))
          l/(1 + l)
      }
      
      # w is another layer option to control the tailedness, the 
      # higher the w is, the data will be
      # more heavily-tailed. 
      w = positive_part((1 - exp(-0.5 * tailedness)))/(1 + 
                 exp(-0.5 * tailedness))
      
      filter <- function(x){
        # The proposed data points will be accepted only if it 
        # satified the following condition, 
        # in which way we controlled the skewness and tailedness of 
        # data. (For example, the 
        # proposed data point will be rejected more frequently if it 
        # has higher skewness or
        # tailedness.)
        accept <- runif(length(x)) > reject_tailedness(x) * 
                    reject_skewness(x)
        x[accept]
      }
      
      result <- filter(mu + sigma * ((1 - w) * rnorm(n) + w * rt(n, 5)))
      # Keep generating data points until the length of data vector 
      #  reaches n.
      while (length(result) < n) {
        result <- c(result, filter(mu + sigma * ((1 - w) * rnorm(n) +
                        w * rt(n, 5))))
      }
      result[1:n]
    }
    
    multimodal <- function(n, Mu, skewness = 0, tailedness = 0) {
      # Deal with the bimodal case.
      mumu <- as.numeric(Mu %*% rmultinom(n, 1, rep(1, length(Mu))))
      mumu + random_sample(n, skewness = skewness, 
               tailedness = tailedness)
    }

The server.R file:

    library(shiny)
    # Need 'ggplot2' package to get a better aesthetic effect.
    library(ggplot2)
    
    # The 'sample.R' source code is used to generate data to be 
    # plotted, based on the input skewness,  
    # tailedness and modality. For more information, see the source 
    # code in 'sample.R' code.
    source("sample.R")
    
    shinyServer(function(input, output) {
      # We generate 10000 data points from the distribution which 
      # reflects the specification of skewness,
      # tailedness and modality. 
      n = 10000
      
      # 'scale' is a parameter that controls the skewness and 
      # tailedness.
      scale = 1000
      
      # The `reactive` function is a trick to accelerate the app, 
      # which enables us only generate the data
      # once to plot two plots. The generated sample was stored in 
      # the `data` object to be called later.
      data <- reactive({
        # For `Unimodal` choice, we fix the mode at 0.
        if (input$modality == "Unimodal") {mu = 0}
        
        # For `Bimodal` choice, we fix the two modes at -2 and 2.
        if (input$modality == "Bimodal") {mu = c(-2, 2)}
        
        # Details will be explained in `sample.R` file.
        sample1 <- multimodal(n, mu, skewness = scale * 
             input$skewness, tailedness = scale * input$kurtosis)
        data.frame(x = sample1)})
      
      output$histogram <- renderPlot({
        # Plot the histogram.
        ggplot(data(), aes(x = x)) + 
          geom_histogram(aes(y = ..density..), binwidth = .5, 
               colour = "black", fill = "white") + 
          xlim(-6, 6) +
          # Overlay the density curve.
          geom_density(alpha = .5, fill = "blue") + 
          ggtitle("Histogram of Data") + 
          theme(plot.title = element_text(lineheight = .8, 
           face = "bold"))
      })
      
      output$qqplot <- renderPlot({
        # Plot the QQ plot.
        ggplot(data(), aes(sample = x)) + stat_qq() + 
         ggtitle("QQplot of Data") + 
          theme(plot.title = element_text(lineheight=.8, 
             face = "bold"))
        })
    })

Finally, the ui.R file:

    library(shiny)
    
    # Define UI for application that helps students interpret the 
    # pattern of (normal) QQ plots. 
    # By using this app, we can show students the different patterns 
    # of QQ plots (and the histograms,
    # for completeness) for different type of data distributions. 
    # For example, left skewed heavy tailed
    # data, etc. 
    
    # This app can be (and is encouraged to be) used in a reversed 
    # way, namely, show the QQ plot to the 
    # students first, then tell them based on the pattern of the QQ 
    # plot, the data is right skewed, bimodal,
    # heavy-tailed, etc.
        
    shinyUI(fluidPage(
      # Application title
      titlePanel("Interpreting Normal QQ Plots"),
      
      sidebarLayout(
        sidebarPanel(
          # The first slider can control the skewness of input data. 
          # "-1" indicates the most left-skewed 
          # case while "1" indicates the most right-skewed case.
          sliderInput("skewness", "Skewness", min = -1, max = 1, 
             value = 0, step = 0.1, ticks = FALSE),
          
          # The second slider can control the skewness of input data. 
          #  "-1" indicates the most light tail
          # case while "1" indicates the most heavy tail case.
          sliderInput("kurtosis", "Tailedness", min = -1, max = 1, 
               value = 0, step = 0.1, ticks = FALSE),
          
          # This selectbox allows user to choose the number of modes 
          # of data, two options are provided:
          # "Unimodal" and "Bimodal".
          selectInput("modality", label = "Modality", 
                      choices = c("Unimodal" = "Unimodal", 
                         "Bimodal" = "Bimodal"),
                      selected = "Unimodal"),
          br(),
          # The following helper information will be shown on the 
          # user interface to give necessary
          # information to help users understand sliders.
          helpText(p("The skewness of data is controlled by moving 
            the", strong("Skewness"), "slider,", 
                   "the left side means left skewed while the right 
                   side means right skewed."), 
                   p("The tailedness of data is controlled by moving 
                    the", strong("Tailedness"), "slider,", 
                     "the left side means light tailed while the 
                      right side means heavy tailed."),
                   p("The modality of data is controlled by selecting 
                      the modality from", strong("Modality"),
                     "select box.")
                   )
      ),
      
      # The main panel outputs two plots. One plot is the histogram 
      # of data (with the non-parametric density
      # curve overlaid), to get a better visualization, we restricted 
      # the range of x-axis to -6 to 6 so 
      # that part of the data will not be shown when heavy-tailed 
      # input is chosen. The other plot is the 
      # QQ plot of data, as convention, the x-axis is the theoretical 
      # quantiles for standard normal distri-
      # bution and the y-axis is the sample quantiles of data. 
      mainPanel(
        plotOutput("histogram"),
        plotOutput("qqplot")
      )
    )
    )
    )
$\endgroup$
2
  • $\begingroup$ Link is not available !!!! @Zhanxiong $\endgroup$ Commented Sep 1, 2017 at 15:45
  • $\begingroup$ It seems that the link fails to respond after a limited number of clicks every month. That's the reason I pasted the source code here (as requested by other users who encountered the same issue as you). You can paste them to your R studio and run them on your own PC (after required packages are loaded in advance). $\endgroup$
    – Zhanxiong
    Commented Sep 1, 2017 at 19:24
25
$\begingroup$

A very helpful (and intuitive) explanation is given by prof. Philippe Rigollet in the MIT MOOC course: 18.650 Statistics for Applications, Fall 2016 - see video at 45 mins

https://www.youtube.com/watch?v=vMaKx9fmJHE

I have crudely copied his diagram which I keep in my notes as I find it very useful.

QQ plot sketch diagram

In example 1, in the top left diagram, we see that in the right tail the empirical (or sample) quantile is less than the theoretical quantile

Qe < Qt

This can be interpreted using the probability density functions. For the same $\alpha$ value, the empirical quantile is to the left of the theoretical quantile, which means that the right tail of the empirical distribution is "lighter" than the right tail of the theoretical distribution, i.e. it falls faster to values close to zero.

enter image description here

$\endgroup$
0
8
$\begingroup$

Since this thread has been deemed to be a definitive "how to interpret the normal q-q plot" StackExchange post, I would like to point readers to a nice, precise mathematical relationship between the normal q-q plot and the excess kurtosis statistic.

Here it is:

https://stats.stackexchange.com/a/354076/102879

A brief (and too simplified) summary is given as follows (see the link for more precise mathematical statements): You can actually see excess kurtosis in the normal q-q plot as the average distance between the data quantiles and the corresponding theoretical normal quantiles, weighted by distance from data to the mean. Thus, when the absolute values in the tails of the q-q plot generally deviate from the expected normal values greatly in the extreme directions, you have positive excess kurtosis.

Because kurtosis is the average of these deviations weighted by distances from the mean, the values near the center of the q-q plot have little impact on kurtosis. Hence, excess kurtosis is not related to the center of the distribution, where the "peak" is. Rather, excess kurtosis is almost entirely determined by the comparison of the tails of the data distribution to the normal distribution.

$\endgroup$
0
3
$\begingroup$

In addition to nice explanations above, I put a snippet generating a nice gallery of (pretty much) self-explanatory examples that I found useful when teaching university classes in statistics.

enter image description here

from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot as plt
import numpy as np
from scipy import stats


N = 1000


dists = {'light-tailed':stats.truncnorm(-1.5,1.5), 'heavy-tailed':stats.laplace(), 'right-skewed':stats.beta(2,8),'left-skewed':stats.beta(8,2)}


fig,axs = plt.subplots(4,2,figsize=(12,12))
for (name, dist),(ax1,ax2) in zip(dists.items(),axs):
 x = dist.rvs(N)
 qqplot(x, line='q',ax=ax1)
 ax2.hist(x, density=True, bins='auto', histtype='stepfilled')
 normal_approx = stats.norm(loc=x.mean(),scale=x.std())
 xgrid = np.linspace(x.mean()-x.std()*3,x.mean()+x.std()*3,50)
 ax2.plot(xgrid, normal_approx.pdf(xgrid),)
 label = name + ":" + dist.dist.name + str(dist.args)
 ax1.set_title(label)
 ax2.set_title(label)


plt.tight_layout()
$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.