Chapter 3 Data management
3.1 Renaming variables
Variables names are not always explicit and you might want to rename some.
R
In R data.frame, variables are in column and to retrieve their names you can use the colnames() function. It will return a vector which values can be accessed and modify using indexing.
colnames(sleep)
colnames(sleep)[1] <- "identifier"
Note : rownames() does also exist if you need to access and manipulate the individuals
3.2 Missing values
Missing values are always problematic. First, because it is often difficult to know why they are missing: not answered, not attributable, lost in follow-up, miss recording…They can be largely missing for an individuals as well as for a specific variable.
R
In R, you can assess the number missing values with the summary() function or the is.na() conditional testing function.
is.na(sleep$age)
A basic and crude method is to replace missing value by a value such as the average value. For the variable age, let’s replace the missing value by the mean age of sample using the indexing technique.
# create a fake missing value
# for individual #1
$age[1] <- NA
sleep$age[1:5] sleep
[1] NA 59 54 53 51
<- is.na(sleep$age)
na_id $age[na_id] <- mean(sleep$age,
sleepna.rm=TRUE)
round(sleep$age[1:5],0)
[1] 48 59 54 53 51
Stata
In Stata an overview of the missing data can be done as follow:
summarize
misstable patterns misstable
(variables nonmissing or string)
(no missing values)
To replace one missing value by the mean of the variable, do
egen meanage = mean(age)
replace age = meanage if missing(age)
(0 real changes made)
Above a certain threshold that varies, from 4% to 20%, researchers try to do multiple imputation of missing values. Although there is no perfect solution, there are basically 3 scenario and associated imputation techniques. Data are missing completely at random (MCAR) or data are missing at random (MAR) or data are missing not at random (MNAR). For the latest, it would be difficult to perform a legitimate analysis. This advanced topic is not covered in MPH1 class but see below for references if needed.
For multiple imputation with R see the package Mice for more sophisticated missing values imputation
For multiple imputation with Stata see Rose Meideiros presentation.
3.3 Create new variables
3.3.1 Changing scale
In epidemiology, you may have to transform and change the scale of your variables. For instance, a logarithm transformation might help approximate a Normal distribution for some skewed continuous variable. You might also want to normalized or standardized (sometimes called z-score) your data. When you normalize, you bring all the values between 0 and 1, while preserving the distances between the values. When you standardize, you bring all the values between 0 and 1 with a mean \(\mu\) to 0 and a standard deviation \(\sigma\) to 1.
R
In R the transformation can only be some simple computation:
$age_log <- log(sleep$age)
sleep$age_sqr <- sleep$age^2 sleep
To normalize or standardize, you can do:
# normalization
<- sleep$age
minage <- sleep$age
maxage $age_norm<-
sleep$age-minage)/(maxage-minage)
(sleep# standardization (z-score)
$age_std<-
sleep$age-mean(sleep$age))/sd(sleep$age) (sleep
Note: be careful if you have missing values you should add na.rm=TRUE as argument in your function calls.
3.3.2 From continuous to categorical
In epidemiology and data sciences more globally, you often need to convert numerical variable into to groups, i.e. a categorical variable. Here is an example with age as a continuous variable that you split into age groups.
R
In R, you can use the function cut() to split into intervals (equal or not) a quantitative variable. The 3 main arguments of the function are:
- breaks : interval using number as boundaries
- include.lowest: whether or not you include the lowest value if the data series
- right : whether or not you include the upper-bound value of the interval in your group (inclusion is symbolized by
]
and exclusion by)
.
$age_group <- cut(sleep$age,
sleepbreaks = c(20,40,60,80),
include.lowest=TRUE,
right = FALSE)
table(sleep$age_group)
[20,40) [40,60) [60,80]
29 85 16
Stata
In Stata the syntax is quite similar
egen age_group = cut(age),at(20,40,60,80)
tabulate age_group
age_group | Freq. Percent Cum.
------------+-----------------------------------
20 | 29 22.31 22.31
40 | 85 65.38 87.69
60 | 16 12.31 100.00
------------+-----------------------------------
Total | 130 100.00
3.3.3 Dummy variables
A dummy variable is one that takes only the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.
Dummy variable are often create as the result of a conditional expression. The table below presents the differents symbols avaible in R and Stata for conditional testing.
Sign | Definition | TRUE. | FALSE. |
---|---|---|---|
== | equality | 3 == 3 | 3 == 4 |
!= | not equality | 3 != 4 | 3 != 3 |
> | greater than | 4 > 2 | 3 > 4 |
< | less than | 3 < 4 | 4 < 2 |
>= | greater than or equal to | 4 >= 3 | 4 >= 5 |
<= | less than or equal to | 4 <= 5 | 4 <= 3 |
& | and (both statements are true) | (4 > 2) & (3 == 3)) | (4 > 2) & (1 > 2) |
| | or (either statement is true) | (3 == 2) | (1 <= 2) | (4 < 2) | (1 > 2) |
R
The combination of any of the above symbols with the ifelse() function useful for creating a dummy variable based on conditional testing as you can assign 0/1 values (or any specified values) to your variable depending on whether the element of test is TRUE or FALSE. As example, we create below
$obese <- ifelse(sleep$bmi > 30, 1, 0) sleep
Note that with R there is more than one way of doing things and if you adopt the use of tidyverse suite of packages or the data.table library you might use other functions. Here are examples:
# with dplyr
<- sleep
sleep %>% mutate("obese"= as.numeric(bmi>25))
#with data.table
<- data.table(sleep)
sleep <- sleep[ , obese:=ifelse(bmi > 25, 1, 0)] sleep
R
In R and the library tidyverse, you can transform a long variable to wide. You can return boolean data to create dummy variable or even use other values from another variable.
<- sleep %>% mutate(i=1)
sleep %>% spread(gender, i, fill=0)
Stata
In Stata the function tabulate with the generate() option will generate a whole sets of dummy variables.
If you type as below you will see a frequency table of how many times gender takes on each of those values.
tabulate gender
If you type as below you will see the table, and tabulate will create variable names male, female that take on values 1 and 0, male being 1 when gender==male, female being 1 when gender==female.
tabulate group, generate(g)
3.4 Merging data set
Merging or joining data set can be particularly useful when enriching your data from multiple sources.
In the example below, we merge the results of a questionnaire on daily concentration level for each patient (here to download) with the sleep apnea data .
R
In R there is more than way way of doing things. The base function is merge().
<- read.table("DayConcentration.csv", header=TRUE, sep="\t")
dayconcentration # the commun key do not have the same label so it need to be specified
<- merge(sleep, dayconcentration, by.x="number", by.y="id")
sleep 1:5, c("number","age", "concentration_mark")] sleep[
number age concentration_mark
1 1 48 5.0
2 2 58 1.5
3 3 48 4.0
4 4 68 1.5
5 5 55 3.5
You can use the all.x=TRUE or all.y=TRUE arguments to specify the merging mode. With tidyverse, you can also use different convenient functions left_join(), right_join(), and full_join()
# all values of the sleep data set are kept
# non match in dayconcentration are ignored
<- left_join(sleep, dayconcentration, by=c ("number"="id"))
sleep # all values of dayconcentration are kept
<-right_join(sleep, dayconcentration, by=c ("number"="id"))
sleep # all values are kept
<-full_join(sleep, dayconcentration, by=c ("number"="id")) sleep
Note that the common key can be composed of more than one variable. See help pages for details.
Stata
In Stata the merging is less trivial than in R. The two datasets cannot be opened in Stata at the same time and the common (key) variables must have the same name.
Make sure that:
- both datasets are sorted by all the ids and save as dta
- one dataset is loaded into Stata (in this case sleep),
- use the merge function but make sure to map where the using data is located (in this case for example “c:.dta”)
merge number using dayconcentration