Chapter 2 Data: Statistical units and Variables

The W's and their types are what you should be interested in and careful about in the data:

  • Who
  • What
  • Where
  • When

Statistics, from the latin status and same root as State, is the art of counting and classifying. At first, statistics were used to describe countable information on population. Rapidly they became essential to model and predict data from experiences to foresee outcomes and help decision making.

2.1 Statistical units

“Statistics is all about variation.”

In Public Health, we are interested in population. In biostatistics, we like to compare groups in a population. To that aim we need to identify our groups which will be composed of statistical units. For example, the statistical units could be: - patients, schooled children - health care services - countries

The Who is the statistical unit, the unitary element of interest (individual, hospital, country…).

2.2 Variables

The statistical units are characterized by one or more variables, which by definition varies between statistical units.

The What are variables, the recorded characteristics of the statistical units.

For example, the variables characterizing the schooled children could be: age, height, weight…(Figure 2.1). For health care establishments, the characteristics could be the legal status, the number of beds, nurses, doctors, patients…(Figure 2.2).

Individuals as statistical units

Figure 2.1: Individuals as statistical units

Hospitals as statistical units

Figure 2.2: Hospitals as statistical units

2.3 Data storage

The information on each statistical unit can be stored and displayed in a data table. Typically, the Who of the table are found in the leftmost column and read row-wise. The What are stored in the remaining columns. Table 2.1 presents a snapshot of the HBSC data presented in the Introduction section (??).

Table 2.1: A table of the first 10 rows and first 8 columns of the HBSC data, France 2006.
ID Grade.level School.Status Gender Age Weight
5617 7th grade private boy 13.58 50.0
4578 6th grade public boy 12.00 47.7
6512 6th grade public girl 11.42 28.7
5695 10th grade private girl 15.58 48.5
3906 8th grade public boy 13.75 48.5
6266 6th grade public boy 11.67 59.8
363 9th grade public girl 15.25 NA
1095 6th grade public boy 12.75 47.5
5388 8th grade public girl 13.50 52.0
6730 8th grade public girl 15.25 58.0

Try to guess what these data represent and what information is available.

Hint: Do not forget to read the title of the table.

The Where and When are the context/location and time of the data collection.

For instance in our HBSC the When is the year 2006 and the Where is in France. The scale of time and place are of great importance that need to be clearly defined. For example, the time can be a time point or a period of several months or years. As for France, it could be metropolitan France (excluding overseas departments) or France with all its departments. We could also look at different geographical levels like the city, the county, the state…

Those information have to be reported in every titles of every tables and plots you will create from the data along with the Who and What. A table or a figure should be self-explanatory (self-content).

2.4 Variable types

Public Health data may come from various sources. For instance, they can be collected via interviews, surveys,or health information systems.

In qualitative sciences, interviews are often based on open questions where answers are free text. We will not discuss that case in this class.

In quantitative sciences, surveys or records from health information systems are based on short queries where short answers with a finite range of possibilities are expected. For instance, let’s say you are interested in tobacco consumption and plan a survey. You may ask the following questions with finite possibilities of answers:

Variables are of different types (Figure 2.3). When a variable is allowed to takes a limited number of categorical values, or categories, and answers questions about how cases fall into those categories, we call it a categorical, or qualitative, variable. When a variable corresponds to measured numerical values with units and the variable tells us about the quantity of what is measured, we call it a quantitative variable (Sharpe, De Veaux, and Velleman 2012). The type of a variable will condition the statistical method chosen to summarize and describe your population of interest (Chapter 3).

Different types of variables with example.

Figure 2.3: Different types of variables with example.

The categorical, or qualitative, variables can be of two sub-types: the nominal and the ordinal variables. The nominal variables are for instance the colors of the eyes or various professions that count many categories. The nominal variables can also be binary with only two categories like smoking/non smoking or boys/girls. The ordinal variables take into account an ordering between the possible categories of the variables. For instance, a ordinal variable could be a scale of spiciness: neutral, middle, hot, very hot ! Note that this is suggestive. You can come up with a ranking where the intervals between the categories are not of equal width.

The quantitative variables can be discrete or continuous. A quantitative discrete variable corresponds to numerical counts like the number of kids per household. A quantitative continuous variable corresponds to measures with potential decimals like the weight or height of pupils in the HBSC cohort.

Propose a set of variables, one of each type.

References

Sharpe, Norean Radke, Richard D De Veaux, and Paul F Velleman. 2012. Business Statistics. Boston: Pearson Education.