Chapter 1 Programming environment

1.1 Installation R and Stata

R is the free original statistical software and the language, available on CRAN. RStudio is an enhanced graphical user interface (GUI) to R with some commercial options. To install the two software see R-bloggers tuto.

Stata is a commercial software for which you need a licence, installation instructions will be given upon purchase.

1.2 R packages

R packages are a collection of R functions, complied code and sample data. By default, R installs a set of packages during installation. More packages are available and can be added later, when they are needed for some specific purpose. When we start the R console, only the default packages are available. Other packages (which are already installed) have to be loaded explicitly to use them in your current session.

There are two ways to add new R packages. One is installing directly from the CRAN directory and another is downloading the package to your local system and installing it manually.

To install a R package from CRAN, such as tydiverse use:

install.packages(tidyverse)

To load a package that is already installed but not available by default in the current environment do:

library(tidyverse)

1.3 Working directory

R

To specify your working directory (location of the files to use and save) use the via following command:

setwd("C:/Rlab")

To verify where, in your computer folder architecture, R is or has opened, using the function call getwd()

Next, use the drop-down menu (File > New File > R Script…) to create a new script and save the script using the drop-down menu, creating a .R file named firstscript.R*

In R you can also work with RProject that helps create and customized your working environment per project. See RStudio website for details.

Stata

To specify your working directory use the via following command:

cd "C:/statalab"

Note that path (“C:/…”) will vary based on the computer being used. To verify where, in your computer folder architecture, STATA is or has opened, using the pwdcommand.

Next, use the drop-down menu (STATA > File > Do…) to create a “.do” file and save it using the drop-down menu as firstscript.do.

1.4 Log files and history

You can create a .log file to save your output results and history commands. A log keeps all the results of your analyses saved to a file that can then be open in a text editor. It is classical functionality in STATA but not a default in R.

R

In R you have a history file that save all the commands run in the console but to save the R outputs from the console you need to sink them in a file as demonstrate below.

# example with the iris dataset 
# (available by default in R)
fit <- lm(Petal.Length ~ Sepal.Length, data = iris)
# opening log file
sink(file = "lm_output.log") 
fit    # print in log
sink() # closing log file

You also have the option of using specific packages such as logger [URL] or start thinking of generating reports as you do your analysis with the package Rmarkdown.

Stata

To create or open a saved log file in Stata run:

"analysis_firstscript.log", replace  

Option ‘replace’ specifies that filename that already exists will be overwritten. Otherwise, an error message is issued and logging does not start.

Always close the log file when the lab is finished to stop logging Use the command log close to close the log file when the lab (always) so it stops logging.

1.5 Importing Data

R and Stata can import data set in text format (txt, csv…).

R

R can read(load) an arbitrary number of data sets at once, so they must each be assigned a name. R will load this file from your current working directory. If the file is stored elsewhere in your computer you need to specify the entire path (see ?read.csv help page).

#import data from csv file, 
# with column names (header), 
#columns separator as comma, and decimals as "."
sleep<-read.csv("SleepApnea.csv", 
                header=TRUE, sep=",", 
                dec=".")

Note: Provided a Stata data set (.dta file), we can read it into R. The read_dta() in the haven package–part of tidyverse data science packages– is particularly useful for loading Stata files because it preserves Stata labels.

sleep <- haven::read_dta("SleepApnea.dta")

And R can many other proprietaries’ files (Excel, SAS, SPS,…).

Stata

Stata can only load one data set at a time, so it does not get assigned to an object. The CSV file import function is:

import delimited using SleepApnea
(15 vars, 130 obs)

This will load from Stata’s current working directory.

1.6 Saving and loading R and Stata files

R

You can save R object (data.frame, list, …) in a file for future use.

save(sleep, file="sleepApnea.Rdata")

When you want to import R object stored in R data file use the load function.

load("sleepApnea.Rdata")

Stata

To save an unnamed dataset (or an old dataset under a new name):
1. select File > Save As…; or
2. type save filename in the Command window.

save SleepApnea.dta

To save a dataset that has been changed (overwriting the original data file),
1. select File > Save;
2. click on the Save button; or
3. type save, replace in the Command window

save SleepApnea.dta, replace

To load a dataset from Stata’s current working directory:

use SleepApnea.dta

1.7 Delete data from workspace

R

You can save remove/delete R object from your current workspace using the rm() function (rm for remove). If you delete an object that was saved on your computer drive, it will be not deleted from your computer but just from your R session. If your object was create in your workspace but not saved it will be lost.

rm(sleep)

Stata

The drop command is used to remove variables or observations from the dataset in memory. For instance here the age variable.

drop age

If you want to clear out the data in memory, so that you start from fresk, use the clear function.

clear all

1.8 Data quick overview

R

For small and large datasets, the str() is useful to see the variables and their types (str stands for structure). The output is equivalent to the elements displayed in the RStudio panel “Environment” under the name of the R object.

str(sleep)

If you do have not have high dimensional data.

To have a sense of the first and last rows of your dataset you can have a look at the head(), the tail() of the dataset.

View() displays the data in a tabular spreadsheet (be careful with capital letter!)

head(sleep)
tail(sleep)
View(sleep)

Stata

The describe function returns the main characteristics of the uploaded data set, i.e. the number of observations, number of variables, and characteristics of variables.

describe
Contains data
 Observations:           130                  
    Variables:            15                  
-------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------------------
number          int     %8.0g                 
age             byte    %8.0g                 
gender          str6    %9s                   
height          int     %8.0g                 
weight          int     %8.0g                 
bmi             float   %9.0g                 
diabetes        str2    %9s                   
cholesterol     float   %9.0g                 
triglyceride    float   %9.0g                 Triglyceride
glycemia        str17   %17s                  
creatinine      str16   %16s                  
sleepapnea      str16   %16s                  SleepApnea
cardiacfreq     str3    %9s                   CardiacFreq
triglyceride_~f byte    %8.0g                 Triglyceride_Cl_Ref
trigly_4classes byte    %8.0g                 Trigly_4Classes
-------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

1.9 Help pages

R

R help pages are standardized with a description, how to use the function with its major argument and possible default values. The arguments are then explained with then details on the computation. The value section present the output of the function with possible extra notes and the authors of the function. Last but not least, examples are often good ways to understand a function. You can compy and paste the examples into the console learn the use and ouput of a function.

?summary

Stata

To access Stata’s help, you will either

  1. select Help from the menus, or
  2. use the help and search commands.

Regardless of the method you use, results will be shown in the Viewer or Results windows. Blue text indicates a hypertext link, so you can click to go to related entries.

help describe

1.10 Function call

In R and Stata most functions have required argument and optional argument with default values. It is important to look at the help pages to understand how each function sholud be used.

R

In R if the arguments are written in the order expected by the developer of the function (default order) no need to specify the name of the argument. If you want to change one optional argument listed at the end of the function you need to specify its name or R will return an error if the arguments does not match.

# Here the expected arguments are
# file, header, sep , quote
sleep<-read.csv("SleepApnea.csv", 
                TRUE, ",", 
                ".")
# what you really wanted was 
# "." for decimal (dec) instead of quote
sleep<-read.csv("SleepApnea.csv", 
                TRUE, ",", 
                dec=".")

Stata

In Stata the ordering is also important. Help pages need to be consulted for details.