Chapter 2 Data type and data format

2.1 Data type

Often when importing data, statistical software can mistake a numeric variable for a string variable or a categorical variable for a string variable. It might not be visible at first but attempts to run commands will return error. Do verify you data type before type of data analysis.

Using the sleep apnea dataset, you can pratice and verify the data type of each of the variables. As mentioned previously in R you can have a overview with the function str() and in Stata you can use describe. You can also check each variables with quick function call.

2.1.1 Numeric

R

In R the is.numeric() function will return a logical which indicates you whether or not the variable is a numeric.

is.numeric(sleep$age)

[1] TRUE

## if it is not the proper type
sleep$age <- as.numeric(sleep$age)

Stata

In Stata, if age is encoded as a string and was therefore all numbers, you could use the destring command. If you want to replace the existing variable, the command is simply

(15 vars, 130 obs)

destring age

This will replace the existing specified variable with the same data but now in a nonstring format.

If you prefer to retain the existing variable, you can generate a new variable that is a nonstring version of the existing variable.

generate age_num = real(age)

2.1.2 Logical

R

In R the is.logical() function will return a logical which indicates you whether or not the variable is a logic. For example, have a look at the binary variable diabetes coded 0 and 1 for absence or presence.

is.logical(sleep$diabetes)

If needed, you can convert the variable to a logical using the as.logical() function.

str(sleep$diabetes)
db_logic <-as.logical(sleep$diabetes)
str(db_logic)

A logical vector can then be useful to select/replace values from other variables with the indexing method that we will see below.

Stata

In Stata logical are boolean with 0 for FALSE and 1 for TRUE. As in R, they can be generate by conditional/logical expression (see the Create new variables section)

2.1.3 Date

Statistical software are really powerful in manipulating dates which is often useful is epidemiology. However it is not always easy to handle dates as there is different format and referential.

R

Dates and time variables are best read as strings. The default formats in R follow the rules of the ISO 8601 international standard which expresses a day as “2001-02-03.”

Below is a basic operation (subtraction) between 2 dates.

x <- as.Date("2022-06-21", 
             format="%Y-%m-%d")
y <- as.Date("2022-06-30", 
             format="%Y-%m-%d")
y - x

Time difference of 9 days

In R, dates are represented as the number of days since 1970-01-01, with negative values for earlier dates. If you import dates in numeric values from an other software with a different origin you will have to specify that origin to convert the numeric values into the correct dates. Please see as.Date() help page for more details.

The package lubridate can be particularly helpful for manipulating dates.

Stata

Dates and time variables are best read as strings. The numeric encoding that Stata uses is centered on the first millisecond of 01jan1960, that is, 01jan1960 00:00:00.000. The example below shows how to generate a variable names birthday which is a date.

gen birthday = date("21 Jan 1952", "DMY")
list
format birthday %td
list

2.1.4 Character

Character or string data often the imported data type for categorical variables.

R

In our sleep data set for instance, the gender variable was imported as character.

is.character(sleep$gender)

[1] TRUE

However, gender is more of a categorical variable with levels and labels. You will see below how to convert a character string to a categorical variable or (factor)

Stata

As seen above, you can convert string to numeric using destring or generate and vice versa. For instance, if gensder is imported as numeric you can convert it into string using:

generate gender_cat=string(gender)

The above will only work if all of the data is numeric. However, sometimes it’s not. In a case where your string variables are in fact strings (e.g., “female” instead of “1”) you have to tell Stata to encode [varname] the string data.Running this command will cause Stata to make a new numeric categorical variable wherein the data has labels that correspond to the old string values.(see below for factor type)

In Stata, the format function call is especially useful when rendering data in table of results for nice display.

format gender %6s

2.1.5 Factor

As written above, you often want to convert character or string data into categorical variables.

R

In R, the categorical variables are factor. Below, are example to convert or create a factor.

table(sleep$diabetes, sleep$gender)

   
    female male
  0     17   98
  1      2    7

sleep$gender <- as.factor(sleep$gender)
sleep$diabetes <- factor(sleep$diabetes, 
                         levels = c(0, 1), 
                         labels = c("No","Yes"))
table(sleep$diabetes, sleep$gender)

     
      female male
  No      17   98
  Yes      2    7

Stata

In Stata, similar conversions are sometimes needed.

tab diabetes gender

           |        gender
  diabetes |    female       male |     Total
-----------+----------------------+----------
         0 |        17         98 |       115 
         1 |         2          7 |         9 
        NA |         0          6 |         6 
-----------+----------------------+----------
     Total |        19        111 |       130

Here diabetes is a string variable with the values 1-0-NA. You can replace the variable diabetes to manipulate it as categorical using the destring function and the force options to ignore missing values.

destring diabetes, replace force
label define yesno_lbl 1 "yes" 0 "no"
// assign value label to variables
label values diabetes yesno_lbl

tab diabetes gender

diabetes: contains nonnumeric characters; replaced as byte
(6 missing values generated)

           |        gender
  diabetes |    female       male |     Total
-----------+----------------------+----------
        no |        17         98 |       115 
       yes |         2          7 |         9 
-----------+----------------------+----------
     Total |        19        105 |       124

2.2 Data format

2.2.1 Vector

R

A vector is the simplest R object. A vector combines values of the same type (all numeric or all character). In data science, it can correspond to a variable. The <- symbolize the fact that you assign values to an object(variable).

# c for concatenate values
vector.a <- c(1,14,3)
vector.a

[1]  1 14  3

# using sequence
vector.b <- c(1:3)
vector.b

[1] 1 2 3

# giving names(adress)
vector.c <- c("x" = 11.0, 
              "y" = 23.4, "z" = 53.0)
vector.c

   x    y    z 
11.0 23.4 53.0

With the concept of vector we can introduce the concept of indexing: the position/address of the value in the vector.

# value if vector.b at position 1
vector.c[1]

 x 
11

# value if vector.b at position 1 and 3
vector.c[c(1,3)]

 x  z 
11 53

# value if vector.b at address "y"
vector.c["y"]

   y 
23.4

Stata

matrix vector.a=(1.0, 2.0, 3.0)

2.2.2 Matrix

R

A matrix is a 2 dimensional vector with rows and columns.

matrix.b <- matrix(1:6, nrow=3)
matrix.b

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

Similar to vector, one can access the values by indexation. Think Battle ship game !

#value at the third row and second column
matrix.b[ 3, 2]

[1] 6

Stata

In Stata, a matrice is created as follow:

matrix matrix.b=(1, 2, 3  \ /// 
                   4, 5, 6)

2.2.3 Data (set)

R

A data.frame is a rectangular data set with rows and columns of possibly different types.

dataset <- data.frame(vector.a, 
                      matrix.b, 
                      "code"=letters[1:3])
dataset

  vector.a X1 X2 code
1        1  1  4    a
2       14  2  5    b
3        3  3  6    c

The majority of the data set you will be manipulating will be of class data.frame or inherit from that class (e.g. tibble, data.table).

For accessing the variables, you can use indexation as for matrix or the name of the variable:

# Before the comma, you select individuals 
dataset[1, ]

  vector.a X1 X2 code
1        1  1  4    a

# After the comma, you select the variables
dataset[, 4]

[1] "a" "b" "c"

dataset[, "code"]

[1] "a" "b" "c"

# Combine both, you select individuals and variables
dataset[1:2, "code"]

[1] "a" "b"

Stata

2.2.4 List

R

A list is a primitive and complex element which can coerce different data type (format and length). It is the object return by many statistical function.

t1 <- t.test(x=rnorm(1:10,50),
             y=rnorm(2:11,50))
str(t1)

List of 10
 $ statistic  : Named num -2.18
  ..- attr(*, "names")= chr "t"
 $ parameter  : Named num 15.3
  ..- attr(*, "names")= chr "df"
 $ p.value    : num 0.0453
 $ conf.int   : num [1:2] -1.4317 -0.0171
  ..- attr(*, "conf.level")= num 0.95
 $ estimate   : Named num [1:2] 49.2 49.9
  ..- attr(*, "names")= chr [1:2] "mean of x" "mean of y"
 $ null.value : Named num 0
  ..- attr(*, "names")= chr "difference in means"
 $ stderr     : num 0.332
 $ alternative: chr "two.sided"
 $ method     : chr "Welch Two Sample t-test"
 $ data.name  : chr "rnorm(1:10, 50) and rnorm(2:11, 50)"
 - attr(*, "class")= chr "htest"