Chapter 3 Data management

3.1 Renaming variables

Variables names are not always explicit and you might want to rename some.

R

In R data.frame, variables are in column and to retrieve their names you can use the colnames() function. It will return a vector which values can be accessed and modify using indexing.

colnames(sleep)
colnames(sleep)[1] <- "identifier"

Note : rownames() does also exist if you need to access and manipulate the individuals

Stata

In Stata, one way is to use the rename function.

rename number identifier

3.2 Missing values

Missing values are always problematic. First, because it is often difficult to know why they are missing: not answered, not attributable, lost in follow-up, miss recording…They can be largely missing for an individuals as well as for a specific variable.

R

In R, you can assess the number missing values with the summary() function or the is.na() conditional testing function.

is.na(sleep$age)

A basic and crude method is to replace missing value by a value such as the average value. For the variable age, let’s replace the missing value by the mean age of sample using the indexing technique.

# create a fake missing value
# for individual #1
sleep$age[1] <- NA
sleep$age[1:5]
[1] NA 59 54 53 51
na_id <- is.na(sleep$age)
sleep$age[na_id] <- mean(sleep$age,
                         na.rm=TRUE)
round(sleep$age[1:5],0)
[1] 48 59 54 53 51

Stata

In Stata an overview of the missing data can be done as follow:

misstable summarize
misstable patterns
(variables nonmissing or string)

(no missing values)

To replace one missing value by the mean of the variable, do

egen meanage = mean(age)
replace age = meanage if missing(age)
(0 real changes made)

Above a certain threshold that varies, from 4% to 20%, researchers try to do multiple imputation of missing values. Although there is no perfect solution, there are basically 3 scenario and associated imputation techniques. Data are missing completely at random (MCAR) or data are missing at random (MAR) or data are missing not at random (MNAR). For the latest, it would be difficult to perform a legitimate analysis. This advanced topic is not covered in MPH1 class but see below for references if needed.

For multiple imputation with R see the package Mice for more sophisticated missing values imputation

For multiple imputation with Stata see Rose Meideiros presentation.

3.3 Create new variables

3.3.1 Changing scale

In epidemiology, you may have to transform and change the scale of your variables. For instance, a logarithm transformation might help approximate a Normal distribution for some skewed continuous variable. You might also want to normalized or standardized (sometimes called z-score) your data. When you normalize, you bring all the values between 0 and 1, while preserving the distances between the values. When you standardize, you bring all the values between 0 and 1 with a mean \(\mu\) to 0 and a standard deviation \(\sigma\) to 1.

R

In R the transformation can only be some simple computation:

sleep$age_log <- log(sleep$age)
sleep$age_sqr <- sleep$age^2

To normalize or standardize, you can do:

# normalization
minage <- sleep$age 
maxage <- sleep$age
sleep$age_norm<- 
  (sleep$age-minage)/(maxage-minage)
# standardization (z-score)
sleep$age_std<- 
  (sleep$age-mean(sleep$age))/sd(sleep$age)

Note: be careful if you have missing values you should add na.rm=TRUE as argument in your function calls.

Stata

The syntax are really similar with Stata

generate age_log = ln(age)
generate age_sqr = age^2
// normalization
generate age_norm = (age-23)/74-23
// standardization
generate age_std = (age-47.69)/10.69

3.3.2 From continuous to categorical

In epidemiology and data sciences more globally, you often need to convert numerical variable into to groups, i.e. a categorical variable. Here is an example with age as a continuous variable that you split into age groups.

R

In R, you can use the function cut() to split into intervals (equal or not) a quantitative variable. The 3 main arguments of the function are:

  • breaks : interval using number as boundaries
  • include.lowest: whether or not you include the lowest value if the data series
  • right : whether or not you include the upper-bound value of the interval in your group (inclusion is symbolized by ] and exclusion by ).
sleep$age_group <- cut(sleep$age, 
                       breaks = c(20,40,60,80),
                       include.lowest=TRUE, 
                       right = FALSE)
table(sleep$age_group)

[20,40) [40,60) [60,80] 
     29      85      16 

Stata

In Stata the syntax is quite similar

egen age_group = cut(age),at(20,40,60,80)
tabulate age_group
  age_group |      Freq.     Percent        Cum.
------------+-----------------------------------
         20 |         29       22.31       22.31
         40 |         85       65.38       87.69
         60 |         16       12.31      100.00
------------+-----------------------------------
      Total |        130      100.00

3.3.3 Dummy variables

A dummy variable is one that takes only the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.

Dummy variable are often create as the result of a conditional expression. The table below presents the differents symbols avaible in R and Stata for conditional testing.

Sign Definition TRUE. FALSE.
== equality 3 == 3 3 == 4
!= not equality 3 != 4 3 != 3
> greater than 4 > 2 3 > 4
< less than 3 < 4 4 < 2
>= greater than or equal to 4 >= 3 4 >= 5
<= less than or equal to 4 <= 5 4 <= 3
& and (both statements are true) (4 > 2) & (3 == 3)) (4 > 2) & (1 > 2)
| or (either statement is true) (3 == 2) | (1 <= 2) (4 < 2) | (1 > 2)

R

The combination of any of the above symbols with the ifelse() function useful for creating a dummy variable based on conditional testing as you can assign 0/1 values (or any specified values) to your variable depending on whether the element of test is TRUE or FALSE. As example, we create below

sleep$obese <- ifelse(sleep$bmi > 30,  1, 0)

Note that with R there is more than one way of doing things and if you adopt the use of tidyverse suite of packages or the data.table library you might use other functions. Here are examples:

# with dplyr
sleep <- sleep 
%>% mutate("obese"= as.numeric(bmi>25))
#with data.table
sleep <- data.table(sleep) 
sleep <- sleep[ , obese:=ifelse(bmi > 25,  1, 0)]

Stata

In Stata the commands are similar:

generate obese = 0 
replace obese = 1 if bmi>30
//handling possible missing values
replace obese = . if missing(bmi)

Or you can do

gen obese = bmi>30 if !missing(bmi)

R

In R and the library tidyverse, you can transform a long variable to wide. You can return boolean data to create dummy variable or even use other values from another variable.

sleep <- sleep %>% mutate(i=1) 
      %>% spread(gender, i, fill=0)

Stata

In Stata the function tabulate with the generate() option will generate a whole sets of dummy variables.

If you type as below you will see a frequency table of how many times gender takes on each of those values.

tabulate gender

If you type as below you will see the table, and tabulate will create variable names male, female that take on values 1 and 0, male being 1 when gender==male, female being 1 when gender==female.

tabulate group, generate(g)

3.4 Merging data set

Merging or joining data set can be particularly useful when enriching your data from multiple sources.

In the example below, we merge the results of a questionnaire on daily concentration level for each patient (here to download) with the sleep apnea data .

R

In R there is more than way way of doing things. The base function is merge().

dayconcentration <- read.table("DayConcentration.csv", header=TRUE, sep="\t")
# the commun key do not have the same label so it need to be specified
sleep <- merge(sleep, dayconcentration, by.x="number", by.y="id")
sleep[1:5, c("number","age", "concentration_mark")]  
  number age concentration_mark
1      1  48                5.0
2      2  58                1.5
3      3  48                4.0
4      4  68                1.5
5      5  55                3.5

You can use the all.x=TRUE or all.y=TRUE arguments to specify the merging mode. With tidyverse, you can also use different convenient functions left_join(), right_join(), and full_join()

# all values of the sleep data set are kept 
# non match in dayconcentration are ignored
sleep <- left_join(sleep, dayconcentration, by=c ("number"="id"))
# all values of dayconcentration are kept 
sleep <-right_join(sleep, dayconcentration, by=c ("number"="id"))
# all values are kept 
sleep <-full_join(sleep, dayconcentration, by=c ("number"="id"))

Note that the common key can be composed of more than one variable. See help pages for details.

Stata

In Stata the merging is less trivial than in R. The two datasets cannot be opened in Stata at the same time and the common (key) variables must have the same name.

Make sure that:

  • both datasets are sorted by all the ids and save as dta
  • one dataset is loaded into Stata (in this case sleep),
  • use the merge function but make sure to map where the using data is located (in this case for example “c:.dta”)
merge  number using dayconcentration