Chapter 4 Descriptive statistics

4.1 Quick summary statistics

The first step of descriptive statistics is to have a quick (smart) overview of the data. And always ask yourself “is the output consistent with what you expected in terms of the variables’ types?”

R

To have a descriptive summary, the classical R function is summary()

summary(sleep[, 1:3])
     number            age           gender   
 Min.   :  1.00   Min.   :23.00   female: 19  
 1st Qu.: 36.25   1st Qu.:42.00   male  :111  
 Median : 71.50   Median :48.00               
 Mean   : 70.30   Mean   :47.66               
 3rd Qu.:103.75   3rd Qu.:54.00               
 Max.   :138.00   Max.   :74.00               

Note that in the example above we asked for the summary of the first 3 columns of the dataset. The indexing of data set in a matrix-like format often become useful and is similar to playing battle ship. You need to use dataset[row_index,column_index] to filter on rows, before the comma, or on columns, after the comma.

Additionnal statistical summary functions can be find in the other R libraries, like in the package pastecs and the function stat.desc().

Stata

The codebook function returns the main characteristics of the data set, i.e. the number of observations, number of variables, and characteristics of variables.

codebook

Warning: The output can be long as every variable is summarized.

A better approach could be to look variable by variable.

summarize age, detail
                             age
-------------------------------------------------------------
      Percentiles      Smallest
 1%           23             23
 5%           29             23
10%           32             23       Obs                 130
25%           42             27       Sum of wgt.         130

50%           48                      Mean           47.65891
                        Largest       Std. dev.       10.6905
75%           54             71
90%           61             71       Variance       114.2868
95%           62             72       Skewness      -.1218772
99%           72             74       Kurtosis        2.80462

4.2 Central and deviation parameters

Central and deviation parameters are essential parameters to summaryze and describe the distribution of your data. The most well not central parameter is the mean but you sholud know that it is not the only one and not always the best one to use, especially when you have skewed distribution. These parameters should be look at together and with the help of graphical representation for better description.

R

In R, you sholud think english what is the term for average it is “mean” and it is the name of the function; same for median and quantile. If you do not know the mane of the function search in the help pages.

# average
mean(sleep$age)
# variance
var(sleep$age)
# standard deviation
sd(sleep$age)
# median 
# When sort in increasing order, 50% of values below and above
median(sleep$age)
# interquartile range
IQR(sleep$age)

For the quantile() the probs= argument allows you to specify which quantile of the distribution you want to look at.

# 25%-75%
quantile(sleep$age, probs=c(0.25, 0.75))
25% 75% 
 42  54 
#quartile
quantile(sleep$age, probs=seq(0, 1, 1/4))
  0%  25%  50%  75% 100% 
  23   42   48   54   74 
# tercile
quantile(sleep$age, probs = seq(0,1,1/3))
       0% 33.33333% 66.66667%      100% 
       23        44        53        74 

Interpretation: For the age, the 1\(^{st}\) quantile indicates that 25% of the patients in ou sample are 42 years old or younger, wile 75% (3\(^{rd}\) quantile ) are 54 years old or older.

Stata

In Stata you have similar function calls

mean age
Mean estimation                            Number of obs = 130

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
         age |   47.65891   .9376188      45.80381    49.51402
--------------------------------------------------------------
tabstat age, stats(n mean median min max)
    Variable |         N      Mean       p50       Min       Max
-------------+--------------------------------------------------
         age |       130  47.65891        48        23        74
----------------------------------------------------------------

4.3 Confidence interval

The confidence interval represents the interval in which the true value of the estimated parameter stands with a certain confidence (often 95% - allowing 5% of the estimation being wrong if 100 estimation were perfomed).

R

In R the epiDisplay libary is useful for many computations in epidemiology and notably the confidence interval of the mean. A simple call to the ci() function will compute the 95% confidence interval. The level of confidence is specify using the risk alpha counterpart , here the default is alpha=0.05 (so not need to specify it).

library(epiDisplay)
ci(sleep$age)
   n     mean      sd        se lower95ci upper95ci
 130 47.65891 10.6905 0.9376188  45.80381  49.51402
ci(sleep$age, alpha = 0.01)
   n     mean      sd        se lower99ci upper99ci
 130 47.65891 10.6905 0.9376188  45.20753   50.1103

Stata

In Stata, the syntax is close but the parameter to estimate must be specified and the level of confidence is specified in term of confidence.

ci means age
ci means age, level(99)
    Variable |        Obs        Mean    Std. err.       [95% conf. interval]
-------------+---------------------------------------------------------------
         age |        130    47.65891    .9376188        45.80381    49.51402

    Variable |        Obs        Mean    Std. err.       [99% conf. interval]
-------------+---------------------------------------------------------------
         age |        130    47.65891    .9376188        45.20753     50.1103

4.4 Cross table

Contingency table or cross tabulated table are needed to compute absolute or relative frequency (proportion) of categorical variables.

R

In R the basic function are table() and prop.table().

crosstab <- table("Gender"=sleep$gender, "Diabetes"=sleep$diabetes, 
                  useNA = "always")
addmargins(crosstab)
        Diabetes
Gender    No Yes <NA> Sum
  female  17   2    0  19
  male    98   7    6 111
  <NA>     0   0    0   0
  Sum    115   9    6 130

In prop.table() the margin argument specify the dimension (row or column) along which the relative frequencies are computed.

# The 100% are on the grand total
addmargins(round(prop.table(crosstab)*100,digits=2))
        Diabetes
Gender       No    Yes   <NA>    Sum
  female  13.08   1.54   0.00  14.62
  male    75.38   5.38   4.62  85.38
  <NA>     0.00   0.00   0.00   0.00
  Sum     88.46   6.92   4.62 100.00
# Percentage by row
round(prop.table(crosstab,margin=1)*100,digits=2)
        Diabetes
Gender      No   Yes  <NA>
  female 89.47 10.53  0.00
  male   88.29  6.31  5.41
  <NA>                    
# Percentage by column
round(prop.table(crosstab,2)*100,digits=2)
        Diabetes
Gender       No    Yes   <NA>
  female  14.78  22.22   0.00
  male    85.22  77.78 100.00
  <NA>     0.00   0.00   0.00

Stata

In Stata the tabulate function produces a two-way table of frequency counts.

tabulate gender diabetes
tabulate gender diabetes, row
tabulate gender diabetes, column
           |             diabetes
    gender |        NA         No        Yes |     Total
-----------+---------------------------------+----------
    female |         0         17          2 |        19 
      male |         6         98          7 |       111 
-----------+---------------------------------+----------
     Total |         6        115          9 |       130 


+----------------+
| Key            |
|----------------|
|   frequency    |
| row percentage |
+----------------+

           |             diabetes
    gender |        NA         No        Yes |     Total
-----------+---------------------------------+----------
    female |         0         17          2 |        19 
           |      0.00      89.47      10.53 |    100.00 
-----------+---------------------------------+----------
      male |         6         98          7 |       111 
           |      5.41      88.29       6.31 |    100.00 
-----------+---------------------------------+----------
     Total |         6        115          9 |       130 
           |      4.62      88.46       6.92 |    100.00 


+-------------------+
| Key               |
|-------------------|
|     frequency     |
| column percentage |
+-------------------+

           |             diabetes
    gender |        NA         No        Yes |     Total
-----------+---------------------------------+----------
    female |         0         17          2 |        19 
           |      0.00      14.78      22.22 |     14.62 
-----------+---------------------------------+----------
      male |         6         98          7 |       111 
           |    100.00      85.22      77.78 |     85.38 
-----------+---------------------------------+----------
     Total |         6        115          9 |       130 
           |    100.00     100.00     100.00 |    100.00 

The tabulate function can be used along with various measures of association, including the common Pearson’s Chi-square and Fisher’s exact test (See univariate test for examples).