Chapter 4 Descriptive statistics
4.1 Quick summary statistics
The first step of descriptive statistics is to have a quick (smart) overview of the data. And always ask yourself “is the output consistent with what you expected in terms of the variables’ types?”
R
To have a descriptive summary, the classical R function is summary()
summary(sleep[, 1:3])
number age gender
Min. : 1.00 Min. :23.00 female: 19
1st Qu.: 36.25 1st Qu.:42.00 male :111
Median : 71.50 Median :48.00
Mean : 70.30 Mean :47.66
3rd Qu.:103.75 3rd Qu.:54.00
Max. :138.00 Max. :74.00
Note that in the example above we asked for the summary of the first 3 columns of the dataset. The indexing of data set in a matrix-like format often become useful and is similar to playing battle ship. You need to use dataset[row_index,column_index] to filter on rows, before the comma, or on columns, after the comma.
Additionnal statistical summary functions can be find in the other R libraries, like in the package pastecs and the function stat.desc().
Stata
The codebook function returns the main characteristics of the data set, i.e. the number of observations, number of variables, and characteristics of variables.
codebook
Warning: The output can be long as every variable is summarized.
A better approach could be to look variable by variable.
summarize age, detail
age
-------------------------------------------------------------
Percentiles Smallest
1% 23 23
5% 29 23
10% 32 23 Obs 130
25% 42 27 Sum of wgt. 130
50% 48 Mean 47.65891
Largest Std. dev. 10.6905
75% 54 71
90% 61 71 Variance 114.2868
95% 62 72 Skewness -.1218772
99% 72 74 Kurtosis 2.80462
4.2 Central and deviation parameters
Central and deviation parameters are essential parameters to summaryze and describe the distribution of your data. The most well not central parameter is the mean but you sholud know that it is not the only one and not always the best one to use, especially when you have skewed distribution. These parameters should be look at together and with the help of graphical representation for better description.
R
In R, you sholud think english what is the term for average it is “mean” and it is the name of the function; same for median and quantile. If you do not know the mane of the function search in the help pages.
# average
mean(sleep$age)
# variance
var(sleep$age)
# standard deviation
sd(sleep$age)
# median
# When sort in increasing order, 50% of values below and above
median(sleep$age)
# interquartile range
IQR(sleep$age)
For the quantile() the probs= argument allows you to specify which quantile of the distribution you want to look at.
# 25%-75%
quantile(sleep$age, probs=c(0.25, 0.75))
25% 75%
42 54
#quartile
quantile(sleep$age, probs=seq(0, 1, 1/4))
0% 25% 50% 75% 100%
23 42 48 54 74
# tercile
quantile(sleep$age, probs = seq(0,1,1/3))
0% 33.33333% 66.66667% 100%
23 44 53 74
Interpretation: For the age, the 1\(^{st}\) quantile indicates that 25% of the patients in ou sample are 42 years old or younger, wile 75% (3\(^{rd}\) quantile ) are 54 years old or older.
Stata
In Stata you have similar function calls
mean age
Mean estimation Number of obs = 130
--------------------------------------------------------------
| Mean Std. err. [95% conf. interval]
-------------+------------------------------------------------
age | 47.65891 .9376188 45.80381 49.51402
--------------------------------------------------------------
tabstat age, stats(n mean median min max)
Variable | N Mean p50 Min Max
-------------+--------------------------------------------------
age | 130 47.65891 48 23 74
----------------------------------------------------------------
4.3 Confidence interval
The confidence interval represents the interval in which the true value of the estimated parameter stands with a certain confidence (often 95% - allowing 5% of the estimation being wrong if 100 estimation were perfomed).
R
In R the epiDisplay libary is useful for many computations in epidemiology and notably the confidence interval of the mean. A simple call to the ci() function will compute the 95% confidence interval. The level of confidence is specify using the risk alpha counterpart , here the default is alpha=0.05 (so not need to specify it).
library(epiDisplay)
ci(sleep$age)
n mean sd se lower95ci upper95ci
130 47.65891 10.6905 0.9376188 45.80381 49.51402
ci(sleep$age, alpha = 0.01)
n mean sd se lower99ci upper99ci
130 47.65891 10.6905 0.9376188 45.20753 50.1103
Stata
In Stata, the syntax is close but the parameter to estimate must be specified and the level of confidence is specified in term of confidence.
ci means age
ci means age, level(99)
Variable | Obs Mean Std. err. [95% conf. interval]
-------------+---------------------------------------------------------------
age | 130 47.65891 .9376188 45.80381 49.51402
Variable | Obs Mean Std. err. [99% conf. interval]
-------------+---------------------------------------------------------------
age | 130 47.65891 .9376188 45.20753 50.1103
4.4 Cross table
Contingency table or cross tabulated table are needed to compute absolute or relative frequency (proportion) of categorical variables.
R
In R the basic function are table() and prop.table().
<- table("Gender"=sleep$gender, "Diabetes"=sleep$diabetes,
crosstab useNA = "always")
addmargins(crosstab)
Diabetes
Gender No Yes <NA> Sum
female 17 2 0 19
male 98 7 6 111
<NA> 0 0 0 0
Sum 115 9 6 130
In prop.table() the margin argument specify the dimension (row or column) along which the relative frequencies are computed.
# The 100% are on the grand total
addmargins(round(prop.table(crosstab)*100,digits=2))
Diabetes
Gender No Yes <NA> Sum
female 13.08 1.54 0.00 14.62
male 75.38 5.38 4.62 85.38
<NA> 0.00 0.00 0.00 0.00
Sum 88.46 6.92 4.62 100.00
# Percentage by row
round(prop.table(crosstab,margin=1)*100,digits=2)
Diabetes
Gender No Yes <NA>
female 89.47 10.53 0.00
male 88.29 6.31 5.41
<NA>
# Percentage by column
round(prop.table(crosstab,2)*100,digits=2)
Diabetes
Gender No Yes <NA>
female 14.78 22.22 0.00
male 85.22 77.78 100.00
<NA> 0.00 0.00 0.00
Stata
In Stata the tabulate function produces a two-way table of frequency counts.
tabulate gender diabetes
tabulate gender diabetes, row
tabulate gender diabetes, column
| diabetes
gender | NA No Yes | Total
-----------+---------------------------------+----------
female | 0 17 2 | 19
male | 6 98 7 | 111
-----------+---------------------------------+----------
Total | 6 115 9 | 130
+----------------+
| Key |
|----------------|
| frequency |
| row percentage |
+----------------+
| diabetes
gender | NA No Yes | Total
-----------+---------------------------------+----------
female | 0 17 2 | 19
| 0.00 89.47 10.53 | 100.00
-----------+---------------------------------+----------
male | 6 98 7 | 111
| 5.41 88.29 6.31 | 100.00
-----------+---------------------------------+----------
Total | 6 115 9 | 130
| 4.62 88.46 6.92 | 100.00
+-------------------+
| Key |
|-------------------|
| frequency |
| column percentage |
+-------------------+
| diabetes
gender | NA No Yes | Total
-----------+---------------------------------+----------
female | 0 17 2 | 19
| 0.00 14.78 22.22 | 14.62
-----------+---------------------------------+----------
male | 6 98 7 | 111
| 100.00 85.22 77.78 | 85.38
-----------+---------------------------------+----------
Total | 6 115 9 | 130
| 100.00 100.00 100.00 | 100.00
The tabulate function can be used along with various measures of association, including the common Pearson’s Chi-square and Fisher’s exact test (See univariate test for examples).