Chapter 1 Programming environment
1.1 Installation R and Stata
R is the free original statistical software and the language, available on CRAN. RStudio is an enhanced graphical user interface (GUI) to R with some commercial options. To install the two software see R-bloggers tuto.
Stata is a commercial software for which you need a licence, installation instructions will be given upon purchase.
1.2 R packages
R packages are a collection of R functions, complied code and sample data. By default, R installs a set of packages during installation. More packages are available and can be added later, when they are needed for some specific purpose. When we start the R console, only the default packages are available. Other packages (which are already installed) have to be loaded explicitly to use them in your current session.
There are two ways to add new R packages. One is installing directly from the CRAN directory and another is downloading the package to your local system and installing it manually.
To install a R package from CRAN, such as tydiverse
use:
install.packages(tidyverse)
To load a package that is already installed but not available by default in the current environment do:
library(tidyverse)
1.3 Working directory
R
To specify your working directory (location of the files to use and save) use the via following command:
setwd("C:/Rlab")
To verify where, in your computer folder architecture, R is or has opened, using the function call getwd()
Next, use the drop-down menu (File > New File > R Script…) to create a new script and save the script using the drop-down menu, creating a .R file named firstscript.R*
In R you can also work with RProject that helps create and customized your working environment per project. See RStudio website for details.
Stata
To specify your working directory use the via following command:
"C:/statalab" cd
Note that path (“C:/…”) will vary based on the computer being used.
To verify where, in your computer folder architecture, STATA is or has opened, using the pwd
command.
Next, use the drop-down menu (STATA > File > Do…) to create a “.do” file and save it using the drop-down menu as firstscript.do.
1.4 Log files and history
You can create a .log file to save your output results and history commands. A log keeps all the results of your analyses saved to a file that can then be open in a text editor. It is classical functionality in STATA but not a default in R.
R
In R you have a history file that save all the commands run in the console but to save the R outputs from the console you need to sink
them in a file as demonstrate below.
# example with the iris dataset
# (available by default in R)
<- lm(Petal.Length ~ Sepal.Length, data = iris)
fit # opening log file
sink(file = "lm_output.log")
# print in log
fit sink() # closing log file
You also have the option of using specific packages such as logger [URL] or start thinking of generating reports as you do your analysis with the package Rmarkdown.
Stata
To create or open a saved log file in Stata run:
"analysis_firstscript.log", replace
Option ‘replace’ specifies that filename that already exists will be overwritten. Otherwise, an error message is issued and logging does not start.
Always close the log file when the lab is finished to stop logging
Use the command log close
to close the log file when the lab (always) so it stops logging.
1.5 Importing Data
R and Stata can import data set in text format (txt, csv…).
R
R can read(load) an arbitrary number of data sets at once, so they must each be assigned a name.
R will load this file from your current working directory. If the file is stored elsewhere in your computer you need to specify the entire path (see ?read.csv
help page).
#import data from csv file,
# with column names (header),
#columns separator as comma, and decimals as "."
<-read.csv("SleepApnea.csv",
sleepheader=TRUE, sep=",",
dec=".")
Note:
Provided a Stata data set (.dta
file), we can read it into R.
The read_dta()
in the haven
package–part of tidyverse
data science packages– is particularly useful for loading Stata files because it preserves Stata labels.
<- haven::read_dta("SleepApnea.dta") sleep
And R can many other proprietaries’ files (Excel, SAS, SPS,…).
1.6 Saving and loading R and Stata files
R
You can save R object (data.frame, list, …) in a file for future use.
save(sleep, file="sleepApnea.Rdata")
When you want to import R object stored in R data file use the load
function.
load("sleepApnea.Rdata")
Stata
To save an unnamed dataset (or an old dataset under a new name):
1. select File > Save As…; or
2. type save filename in the Command window.
save SleepApnea.dta
To save a dataset that has been changed (overwriting the original data file),
1. select File > Save;
2. click on the Save button; or
3. type save, replace in the Command window
save SleepApnea.dta, replace
To load a dataset from Stata’s current working directory:
use SleepApnea.dta
1.7 Delete data from workspace
R
You can save remove/delete R object from your current workspace using the rm() function (rm for remove). If you delete an object that was saved on your computer drive, it will be not deleted from your computer but just from your R session. If your object was create in your workspace but not saved it will be lost.
rm(sleep)
1.8 Data quick overview
R
For small and large datasets, the str() is useful to see the variables and their types (str stands for structure). The output is equivalent to the elements displayed in the RStudio panel “Environment” under the name of the R object.
str(sleep)
If you do have not have high dimensional data.
To have a sense of the first and last rows of your dataset you can have a look at the head(), the tail() of the dataset.
View() displays the data in a tabular spreadsheet (be careful with capital letter!)
head(sleep)
tail(sleep)
View(sleep)
Stata
The describe function returns the main characteristics of the uploaded data set, i.e. the number of observations, number of variables, and characteristics of variables.
describe
Contains data
Observations: 130
Variables: 15
-------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------------------
number int %8.0g
age byte %8.0g
gender str6 %9s
height int %8.0g
weight int %8.0g
bmi float %9.0g
diabetes str2 %9s
cholesterol float %9.0g
triglyceride float %9.0g Triglyceride
glycemia str17 %17s
creatinine str16 %16s
sleepapnea str16 %16s SleepApnea
cardiacfreq str3 %9s CardiacFreq
triglyceride_~f byte %8.0g Triglyceride_Cl_Ref
trigly_4classes byte %8.0g Trigly_4Classes
-------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
1.9 Help pages
R
R help pages are standardized with a description, how to use the function with its major argument and possible default values. The arguments are then explained with then details on the computation. The value section present the output of the function with possible extra notes and the authors of the function. Last but not least, examples are often good ways to understand a function. You can compy and paste the examples into the console learn the use and ouput of a function.
?summary
1.10 Function call
In R and Stata most functions have required argument and optional argument with default values. It is important to look at the help pages to understand how each function sholud be used.
R
In R if the arguments are written in the order expected by the developer of the function (default order) no need to specify the name of the argument. If you want to change one optional argument listed at the end of the function you need to specify its name or R will return an error if the arguments does not match.
# Here the expected arguments are
# file, header, sep , quote
<-read.csv("SleepApnea.csv",
sleepTRUE, ",",
".")
# what you really wanted was
# "." for decimal (dec) instead of quote
<-read.csv("SleepApnea.csv",
sleepTRUE, ",",
dec=".")