Chapter 2 Data type and data format
2.1 Data type
Often when importing data, statistical software can mistake a numeric variable for a string variable or a categorical variable for a string variable. It might not be visible at first but attempts to run commands will return error. Do verify you data type before type of data analysis.
Using the sleep apnea dataset, you can pratice and verify the data type of each of the variables. As mentioned previously in R you can have a overview with the function str() and in Stata you can use describe. You can also check each variables with quick function call.
R
In R the is.numeric() function will return a logical which indicates you whether or not the variable is a numeric.
is.numeric(sleep$age)
[1] TRUE
## if it is not the proper type
$age <- as.numeric(sleep$age) sleep
Stata
In Stata, if age is encoded as a string and was therefore all numbers, you could use the destring command. If you want to replace the existing variable, the command is simply
(15 vars, 130 obs)
destring age
This will replace the existing specified variable with the same data but now in a nonstring format.
If you prefer to retain the existing variable, you can generate a new variable that is a nonstring version of the existing variable.
generate age_num = real(age)
R
In R the is.logical() function will return a logical which indicates you whether or not the variable is a logic. For example, have a look at the binary variable diabetes coded 0 and 1 for absence or presence.
is.logical(sleep$diabetes)
If needed, you can convert the variable to a logical using the as.logical() function.
str(sleep$diabetes)
<-as.logical(sleep$diabetes)
db_logic str(db_logic)
A logical vector can then be useful to select/replace values from other variables with the indexing method that we will see below.
Stata
In Stata logical are boolean with 0 for FALSE and 1 for TRUE. As in R, they can be generate by conditional/logical expression (see the Create new variables section)
2.1.3 Date
Statistical software are really powerful in manipulating dates which is often useful is epidemiology. However it is not always easy to handle dates as there is different format and referential.
R
Dates and time variables are best read as strings. The default formats in R follow the rules of the ISO 8601 international standard which expresses a day as “2001-02-03.”
Below is a basic operation (subtraction) between 2 dates.
<- as.Date("2022-06-21",
x format="%Y-%m-%d")
<- as.Date("2022-06-30",
y format="%Y-%m-%d")
- x y
Time difference of 9 days
In R, dates are represented as the number of days since 1970-01-01, with negative values for earlier dates. If you import dates in numeric values from an other software with a different origin you will have to specify that origin to convert the numeric values into the correct dates. Please see as.Date() help page for more details.
The package lubridate can be particularly helpful for manipulating dates.
Stata
Dates and time variables are best read as strings. The numeric encoding that Stata uses is centered on the first millisecond of 01jan1960, that is, 01jan1960 00:00:00.000. The example below shows how to generate a variable names birthday which is a date.
gen birthday = date("21 Jan 1952", "DMY")
list
format birthday %td
list
R
In our sleep data set for instance, the gender variable was imported as character.
is.character(sleep$gender)
[1] TRUE
However, gender is more of a categorical variable with levels and labels. You will see below how to convert a character string to a categorical variable or (factor)
Stata
As seen above, you can convert string to numeric using destring or generate and vice versa. For instance, if gensder is imported as numeric you can convert it into string using:
generate gender_cat=string(gender)
The above will only work if all of the data is numeric. However, sometimes it’s not. In a case where your string variables are in fact strings (e.g., “female” instead of “1”) you have to tell Stata to encode [varname] the string data.Running this command will cause Stata to make a new numeric categorical variable wherein the data has labels that correspond to the old string values.(see below for factor type)
In Stata, the format function call is especially useful when rendering data in table of results for nice display.
format gender %6s
2.1.5 Factor
As written above, you often want to convert character or string data into categorical variables.
R
In R, the categorical variables are factor. Below, are example to convert or create a factor.
table(sleep$diabetes, sleep$gender)
female male
0 17 98
1 2 7
$gender <- as.factor(sleep$gender)
sleep$diabetes <- factor(sleep$diabetes,
sleeplevels = c(0, 1),
labels = c("No","Yes"))
table(sleep$diabetes, sleep$gender)
female male
No 17 98
Yes 2 7
Stata
In Stata, similar conversions are sometimes needed.
tab diabetes gender
| gender
diabetes | female male | Total
-----------+----------------------+----------
0 | 17 98 | 115
1 | 2 7 | 9
NA | 0 6 | 6
-----------+----------------------+----------
Total | 19 111 | 130
Here diabetes is a string variable with the values 1-0-NA. You can replace the variable diabetes to manipulate it as categorical using the destring function and the force options to ignore missing values.
destring diabetes, replace force
label define yesno_lbl 1 "yes" 0 "no"
// assign value label to variables
label values diabetes yesno_lbl
tab diabetes gender
diabetes: contains nonnumeric characters; replaced as byte
(6 missing values generated)
| gender
diabetes | female male | Total
-----------+----------------------+----------
no | 17 98 | 115
yes | 2 7 | 9
-----------+----------------------+----------
Total | 19 105 | 124
2.2 Data format
R
A vector is the simplest R object.
A vector combines values of the same type (all numeric or all character). In data science, it can correspond to a variable. The <-
symbolize the fact that you assign values to an object(variable).
# c for concatenate values
<- c(1,14,3)
vector.a vector.a
[1] 1 14 3
# using sequence
<- c(1:3)
vector.b vector.b
[1] 1 2 3
# giving names(adress)
<- c("x" = 11.0,
vector.c "y" = 23.4, "z" = 53.0)
vector.c
x y z
11.0 23.4 53.0
With the concept of vector we can introduce the concept of indexing: the position/address of the value in the vector.
# value if vector.b at position 1
1] vector.c[
x
11
# value if vector.b at position 1 and 3
c(1,3)] vector.c[
x z
11 53
# value if vector.b at address "y"
"y"] vector.c[
y
23.4
R
A data.frame is a rectangular data set with rows and columns of possibly different types.
<- data.frame(vector.a,
dataset
matrix.b, "code"=letters[1:3])
dataset
vector.a X1 X2 code
1 1 1 4 a
2 14 2 5 b
3 3 3 6 c
The majority of the data set you will be manipulating will be of class data.frame or inherit from that class (e.g. tibble, data.table).
For accessing the variables, you can use indexation as for matrix or the name of the variable:
# Before the comma, you select individuals
1, ] dataset[
vector.a X1 X2 code
1 1 1 4 a
# After the comma, you select the variables
4] dataset[,
[1] "a" "b" "c"
"code"] dataset[,
[1] "a" "b" "c"
# Combine both, you select individuals and variables
1:2, "code"] dataset[
[1] "a" "b"
R
A list is a primitive and complex element which can coerce different data type (format and length). It is the object return by many statistical function.
<- t.test(x=rnorm(1:10,50),
t1 y=rnorm(2:11,50))
str(t1)
List of 10
$ statistic : Named num -2.18
..- attr(*, "names")= chr "t"
$ parameter : Named num 15.3
..- attr(*, "names")= chr "df"
$ p.value : num 0.0453
$ conf.int : num [1:2] -1.4317 -0.0171
..- attr(*, "conf.level")= num 0.95
$ estimate : Named num [1:2] 49.2 49.9
..- attr(*, "names")= chr [1:2] "mean of x" "mean of y"
$ null.value : Named num 0
..- attr(*, "names")= chr "difference in means"
$ stderr : num 0.332
$ alternative: chr "two.sided"
$ method : chr "Welch Two Sample t-test"
$ data.name : chr "rnorm(1:10, 50) and rnorm(2:11, 50)"
- attr(*, "class")= chr "htest"