Point Estimates and CIs

rm(list=ls(all=TRUE)) 
cat("\014")

The source qmd file can be found here

Download the data

If you haven’t done so already, create a folder named “POLI502” somewhere on your computer (could be on your desktop, documents folder, etc.)
Download all the files you need for today from Blackboard and save them in the folder (world.csv).
Load packages we use

library(ggplot2)
library(magrittr)
library(tidyverse) # manipulate data

1. Constructing a 95% confidence interval

Whenever we report the results of our statistical analysis, we should report our uncertainty estimates. That is, we report the amount of error that we think exists in our inference. Put differently, we report how much confidence we can have about our estimate.

As we learned in the lecture, we do this in a way that is probably not very intuitive to you. We do NOT say things like, “we are 30% confident about this estimate”, or “we are 95% confident about that estimate”. Instead, what we do is the following.

For a given level of confidence (conventionally, it is 95%), we construct and report an interval of values that we think contains the true & unknown population parameter. This interval is called a confidence interval. If we construct this interval for a 95% confidence level, we call it a 95% confidence interval. (You might wonder why 95%. As I said, there is no good reason. We use 95% just because it is the convention.)

So, in reporting our estimate of the mean of a variable, we report the sample mean (point estimate) as well as the associated 95% confidence interval of the mean (interval estimate).

To construct a 95% confidence interval (CI), we need two things:

We first need the point estimate. When estimating the population mean, the sample mean is our point estimate (our “best” guess). The point estimate will be used as the center of the confidence interval.
Second, we need a standard error (SE) of the estimate. With these two,

The lower bound of the 95% CI is: point estimate - 2 * SE
The upper bound of the 95% CI is: point estimate + 2 * SE (We sometimes use 1.96 instead of 2 to be more precise.)

As we discussed in the lecture, this is the interval that would contain the true population parameter 95% of the time if we were to repeat the sampling process many, many times. This is how we interepret confidence intervals. This is how we report our uncertainty.

Let’s try constructing a 95% confidence interval of the mean for a numerical variable in the “world” data set.

Load the world data set

world.data <- read.csv("world.csv")

We will analyze the women09 variable, which measures the percentage of female congresspersons in lower house of parliament in 2009.

Numerical and graphical summaries

Numerical summaries

summary(world.data $ women09)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    9.70   15.55   17.18   22.95   56.30      11

world.data $ women09 %>% class

## [1] "numeric"

is.na(world.data$women09)

##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE

world.data $ women09 %>% na.omit %>% mean

## [1] 17.17722

sd(world.data $ women09, na.rm = TRUE)

## [1] 11.05299

Histogram:

g <- ggplot(world.data, aes(x = women09))
g <- g + geom_histogram()
g <- g + theme(axis.text.x = element_text(size = 12))
g <- g + xlab("Percent women in congress")
g <- g + ylab("Number of countries")
g

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_bin()`).

#same thing, just clearer in your code
g <- ggplot(world.data, aes(x = women09)) + 
  geom_histogram() + #geom_histogram(binwidth = 1)
  theme(axis.text.x = element_text(size = 12)) +
  xlab("Percent women in congress") + ylab("Number of countries") 
g

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_bin()`).

Density plot (smoothed histogram)

g <- ggplot(world.data, aes(x = women09))
g <- g + geom_density()
g <- g + theme(axis.text.x = element_text(size = 12))
g <- g + xlab("Percent women in congress")
g <- g + ylab("Number of countries")
g

## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_density()`).

Constructing a 95% confidence interval

First, we need the point estimate (center of the 95% CI)

mean(world.data $ women09, na.rm = TRUE)

## [1] 17.17722

Let’s store this into an object, so we can use it later.

pe <- mean(world.data $ women09, na.rm = TRUE)

Second, we need to obtain the standard error (SE). Recall that the SE for mean is obtained by SD divided by sqrt(n), so we need to obtain SD (standard deviation) and n (sample size).

SD: we already did this above. Let’s store it into an object.

sd <- sd(world.data $ women09, na.rm = TRUE)

n: the sample size is the number of observations included in this sample. We can use the length function to find out, but be very careful: we have missing values.

So, if we do this

length(world.data $ women09)

## [1] 191

R tells us that there are 191 values. However, we don’t actually have 191 values because some of them are NAs. As we saw above,

summary(world.data $ women09)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    9.70   15.55   17.18   22.95   56.30      11

There are 11 NA’s in the data set, so we actually have 191-11 = 180 observations. One way to obtain the correct sample size (n = 180) is to combine the is.na function.

The is.na function returns a logical value, TRUE or FALSE. For example,

x <- c(4, NA, 1, -2, NA, 2)

is.na(x)

## [1] FALSE  TRUE FALSE FALSE  TRUE FALSE

R says that the second and the fifth elements of this x vector are TRUE, meaning that they are NAs. Combining this is.na function and the square brackets [ ], we can do things like the following:

x[ is.na(x) == FALSE ]

## [1]  4  1 -2  2

Remember that square brackets [ ] let us extract a subset of a vector or a matrix that satisfies some conditions. In the example above, we extracted a subset of x that satisfies the condition is.na(x)==FALSE. (Note that we need two equal signs because we are specifying a condition.) Alternatively, we can write the following to get the same outcome.

x[ is.na(x) != TRUE ]

## [1]  4  1 -2  2

We can see that we extracted the non-missing values of the x vector. Then, we can find the number of non-missing elements in x, as follows:

length( x[ is.na(x) == FALSE ] )

## [1] 4

Or you can simply use “!” this instead.

x[!is.na(x)]

## [1]  4  1 -2  2

x[!is.na(x)] %>% length

## [1] 4

which is different from

length(x)

## [1] 6

Going back to the women09 variable, the sample size can be obtained by:

length( world.data $ women09[is.na(world.data $ women09) == FALSE] )

## [1] 180

length( world.data $ women09[!is.na(world.data $ women09)] )

## [1] 180

n <- length( world.data $ women09[is.na(world.data $ women09) == FALSE] )

Let’s store this into an object:

n <- world.data $ women09[is.na(world.data $ women09) == FALSE] %>% length()

# Then, the standard error (SE) is SD/sqrt(n), so

sd/sqrt(n)

## [1] 0.8238416

# we will store this into an object

se <- sd/sqrt(n)

To summarize, we need the following:

# 1: Point estimate
pe <- mean(world.data $ women09, na.rm = TRUE)

# 2: Standard error
sd <- sd(world.data $ women09, na.rm = TRUE)
n <- length( world.data $ women09[is.na(world.data $ women09) == FALSE] )
se <- sd/sqrt(n)


# Finally, we construct a 95% confidence interval by identifying the 
# lower and upper bounds

pe - 2 * se   # Lower bound

## [1] 15.52954

pe + 2 * se   # Upper bound

## [1] 18.82491

So, we report the results verbally, as follows: The mean of women09 is 17.18 % with a 95% confidence interval of [15.53, 18.82]. Or we can report them directly using R command here, [15.529539, 18.8249054]

2. Create a subset of a data set

Sometimes it makes sense to create a smaller data set that contains a subset of observations. For example, we have seen above that missing values can make our calculation unnecessarily complicated. We can avoid this by working with a smaller data set that excludes observations with missing values.

We have learned in the past weeks how to create a subset of a data set using square bracket. For example,

world.data %>% dim

## [1] 191  62

world.data[ 2:5, 1:4 ]

##   country       colony confidence decentralization
## 2 Albania Soviet Union   49.33593             0.74
## 3 Algeria       France   52.05573               NA
## 4 Andorra        Spain         NA               NA
## 5  Angola     Portugal         NA               NA

The code above tells R to extract rows 2 through 5 (2:5) and columns 1 through 4 (1:4) of the world.data.

The numbers BEFORE the comma specify what rows you want to extract and the numbers AFTER the comma specify what columns you want to extract.

We can also do things like

world.data[ 1:2,  ] %>% head

##       country       colony confidence decentralization dem_other dem_other5
## 1 Afghanistan           UK         NA               NA      10.5        10%
## 2     Albania Soviet Union   49.33593             0.74      63.0 Approx 60%
##   democ_regime district_size3 durable effectiveness      enpp_3         eu
## 1           No  single member       4      13.71158             Not member
## 2          Yes                      3      35.46099 1-3 parties Not member
##   fhrate04_rev fhrate08_rev frac_eth frac_eth3 free_business free_corrupt
## 1          2.5            3   0.7693      High            NA           NA
## 2            5            8   0.2204       Low            68           34
##   free_finance free_fiscal free_govspend free_invest free_labor free_monetary
## 1           NA          NA            NA          NA         NA            NA
## 2           70        92.6          74.2          70       52.1          78.7
##   free_overall free_property free_trade gdp08 gdp_10_thou gdp_cap2 gdp_cap3
## 1           NA            NA         NA  30.6          NA                  
## 2           66            35       85.8  24.3      0.1535      Low   Middle
##   gdppcap08 gender_equal3 gini04 gini08  hi_gdp indy       oecd  old2006
## 1        NA                   NA     NA         1919 Not member       NA
## 2      7715                 28.2   31.1 Low GDP 1991 Not member 8.479821
##    old2003     pmat12_3   pop03 pop08    pop08_3          popcat3 pr_sys
## 1       NA                   NA  27.4 >=16.8 mil Moderate (1-29m)     No
## 2 7.278363 Low post-mat 3169064   3.1  <=4.3 mil Moderate (1-29m)     No
##   protact3        regime_type3      region sources typerel unions urban03
## 1                 Dictatorship Middle East      NA  Muslim     NA      NA
## 2 Moderate Parliamentary democ  C&E Europe      NA  Muslim     NA  44.239
##   urban06 vi_rel3 votevap00s women05 women09 womyear       womyear2  yng2003
## 1   23.28                 NA      NA    27.7      NA                      NA
## 2   46.14  20-50%      59.56     6.4    16.4    1920 1944 or before 27.34834
##    young06
## 1       NA
## 2 26.35428

# Rows 1 through 2 for ALL columns, 
world.data[ , 1:2 ] %>% head

##             country       colony
## 1       Afghanistan           UK
## 2           Albania Soviet Union
## 3           Algeria       France
## 4           Andorra        Spain
## 5            Angola     Portugal
## 6 Antigua & Barbuda           UK

# All rows for columns 1 and 2.

# We can provide some logical conditions instead of row/column numbers
# as follows:

world.data[ world.data $ region == "N. America" ,  1:5 ]

##           country colony confidence decentralization dem_other
## 31         Canada     UK   58.20387             2.45       100
## 112        Mexico  Spain   26.04039             2.04       100
## 181 United States     UK   61.78521             2.20       100

# Or you can use dplyr to achieve the same outcome:
world.data %>% 
  filter(region == "N. America")%>%
  select(1:5)

##         country colony confidence decentralization dem_other
## 1        Canada     UK   58.20387             2.45       100
## 2        Mexico  Spain   26.04039             2.04       100
## 3 United States     UK   61.78521             2.20       100

# This extracts all the rows (= all the countries) that satisfy the 
# condition: world.data $ region == "N. America"

Combining what we have learned so far, we can create a smaller data set that excludes all countries for which women09 is missing, as follows:

wd.women09 <- world.data[ is.na(world.data $ women09) == FALSE , ]
head(wd.women09)

##             country       colony confidence decentralization dem_other
## 1       Afghanistan           UK         NA               NA      10.5
## 2           Albania Soviet Union   49.33593             0.74      63.0
## 3           Algeria       France   52.05573               NA      40.8
## 4           Andorra        Spain         NA               NA     100.0
## 5            Angola     Portugal         NA               NA      40.8
## 6 Antigua & Barbuda           UK         NA               NA      87.5
##   dem_other5 democ_regime    district_size3 durable effectiveness      enpp_3
## 1        10%           No     single member       4      13.71158            
## 2 Approx 60%          Yes                         3      35.46099 1-3 parties
## 3 Approx 40%           No 6 or more members       5      32.62411            
## 4       100%          Yes                        NA      78.72340            
## 5 Approx 40%           No                         3      19.14894            
## 6 Approx 90%          Yes     single member      NA      59.81088 1-3 parties
##           eu fhrate04_rev fhrate08_rev frac_eth frac_eth3 free_business
## 1 Not member          2.5            3   0.7693      High            NA
## 2 Not member            5            8   0.2204       Low          68.0
## 3 Not member          2.5            3   0.3394    Medium          71.2
## 4 Not member    Most free           12   0.7139      High            NA
## 5 Not member          2.5            3   0.7867      High          43.4
## 6 Not member            6           10   0.1643       Low            NA
##   free_corrupt free_finance free_fiscal free_govspend free_invest free_labor
## 1           NA           NA          NA            NA          NA         NA
## 2           34           70        92.6          74.2          70       52.1
## 3           32           30        83.5          73.4          45       56.4
## 4           NA           NA          NA            NA          NA         NA
## 5           19           40        85.1          62.8          35       45.2
## 6           NA           NA          NA            NA          NA         NA
##   free_monetary free_overall free_property free_trade gdp08 gdp_10_thou
## 1            NA           NA            NA         NA  30.6          NA
## 2          78.7         66.0            35       85.8  24.3      0.1535
## 3          77.2         56.9            30       70.7 276.0      0.1785
## 4            NA           NA            NA         NA    NA          NA
## 5          62.6         48.4            20       70.4 106.3      0.0857
## 6            NA           NA            NA         NA    NA      1.0449
##   gdp_cap2 gdp_cap3 gdppcap08 gender_equal3 gini04 gini08   hi_gdp indy
## 1                          NA                   NA     NA          1919
## 2      Low   Middle      7715                 28.2   31.1  Low GDP 1991
## 3      Low   Middle      8033                 35.3   35.3  Low GDP 1962
## 4                          NA                   NA     NA          1278
## 5      Low   Middle      5899                   NA     NA  Low GDP 1975
## 6     High     High        NA                   NA     NA High GDP 1981
##         oecd  old2006  old2003     pmat12_3    pop03 pop08    pop08_3
## 1 Not member       NA       NA                    NA  27.4 >=16.8 mil
## 2 Not member 8.479821 7.278363 Low post-mat  3169064   3.1  <=4.3 mil
## 3 Not member 4.578136 4.045199              31832610  34.4 >=16.8 mil
## 4 Not member       NA       NA                 66000    NA           
## 5 Not member 2.450295 2.930542              13522110  18.0 >=16.8 mil
## 6 Not member       NA 8.186610                 78580    NA           
##            popcat3 pr_sys protact3        regime_type3      region sources
## 1 Moderate (1-29m)     No                 Dictatorship Middle East      NA
## 2 Moderate (1-29m)     No Moderate Parliamentary democ  C&E Europe      NA
## 3 Moderate (1-29m)    Yes                 Dictatorship      Africa      NA
## 4 Small (under 1m)     No          Parliamentary democ   W. Europe      NA
## 5 Moderate (1-29m)    Yes                 Dictatorship      Africa      NA
## 6 Small (under 1m)     No          Parliamentary democ  S. America      NA
##          typerel unions urban03 urban06 vi_rel3 votevap00s women05 women09
## 1         Muslim     NA      NA   23.28                 NA      NA    27.7
## 2         Muslim     NA 44.2390   46.14  20-50%      59.56     6.4    16.4
## 3         Muslim     NA 58.8302   63.94    >50%         NA      NA     7.7
## 4 Roman Catholic     NA 91.7404   90.28              20.95    14.3    35.7
## 5 Roman Catholic     NA 36.1806   53.96                 NA      NA    37.3
## 6     Protestant     NA 37.7566   39.60              76.34    10.5    10.5
##   womyear       womyear2  yng2003  young06
## 1      NA                      NA       NA
## 2    1920 1944 or before 27.34834 26.35428
## 3    1962     After 1944 33.91887 28.94154
## 4    1973     After 1944       NA       NA
## 5    1975     After 1944 47.62524 46.32196
## 6    1951     After 1944 20.66509       NA

wd.women09 <- world.data %>% filter(!is.na(women09))
head(wd.women09)

##             country       colony confidence decentralization dem_other
## 1       Afghanistan           UK         NA               NA      10.5
## 2           Albania Soviet Union   49.33593             0.74      63.0
## 3           Algeria       France   52.05573               NA      40.8
## 4           Andorra        Spain         NA               NA     100.0
## 5            Angola     Portugal         NA               NA      40.8
## 6 Antigua & Barbuda           UK         NA               NA      87.5
##   dem_other5 democ_regime    district_size3 durable effectiveness      enpp_3
## 1        10%           No     single member       4      13.71158            
## 2 Approx 60%          Yes                         3      35.46099 1-3 parties
## 3 Approx 40%           No 6 or more members       5      32.62411            
## 4       100%          Yes                        NA      78.72340            
## 5 Approx 40%           No                         3      19.14894            
## 6 Approx 90%          Yes     single member      NA      59.81088 1-3 parties
##           eu fhrate04_rev fhrate08_rev frac_eth frac_eth3 free_business
## 1 Not member          2.5            3   0.7693      High            NA
## 2 Not member            5            8   0.2204       Low          68.0
## 3 Not member          2.5            3   0.3394    Medium          71.2
## 4 Not member    Most free           12   0.7139      High            NA
## 5 Not member          2.5            3   0.7867      High          43.4
## 6 Not member            6           10   0.1643       Low            NA
##   free_corrupt free_finance free_fiscal free_govspend free_invest free_labor
## 1           NA           NA          NA            NA          NA         NA
## 2           34           70        92.6          74.2          70       52.1
## 3           32           30        83.5          73.4          45       56.4
## 4           NA           NA          NA            NA          NA         NA
## 5           19           40        85.1          62.8          35       45.2
## 6           NA           NA          NA            NA          NA         NA
##   free_monetary free_overall free_property free_trade gdp08 gdp_10_thou
## 1            NA           NA            NA         NA  30.6          NA
## 2          78.7         66.0            35       85.8  24.3      0.1535
## 3          77.2         56.9            30       70.7 276.0      0.1785
## 4            NA           NA            NA         NA    NA          NA
## 5          62.6         48.4            20       70.4 106.3      0.0857
## 6            NA           NA            NA         NA    NA      1.0449
##   gdp_cap2 gdp_cap3 gdppcap08 gender_equal3 gini04 gini08   hi_gdp indy
## 1                          NA                   NA     NA          1919
## 2      Low   Middle      7715                 28.2   31.1  Low GDP 1991
## 3      Low   Middle      8033                 35.3   35.3  Low GDP 1962
## 4                          NA                   NA     NA          1278
## 5      Low   Middle      5899                   NA     NA  Low GDP 1975
## 6     High     High        NA                   NA     NA High GDP 1981
##         oecd  old2006  old2003     pmat12_3    pop03 pop08    pop08_3
## 1 Not member       NA       NA                    NA  27.4 >=16.8 mil
## 2 Not member 8.479821 7.278363 Low post-mat  3169064   3.1  <=4.3 mil
## 3 Not member 4.578136 4.045199              31832610  34.4 >=16.8 mil
## 4 Not member       NA       NA                 66000    NA           
## 5 Not member 2.450295 2.930542              13522110  18.0 >=16.8 mil
## 6 Not member       NA 8.186610                 78580    NA           
##            popcat3 pr_sys protact3        regime_type3      region sources
## 1 Moderate (1-29m)     No                 Dictatorship Middle East      NA
## 2 Moderate (1-29m)     No Moderate Parliamentary democ  C&E Europe      NA
## 3 Moderate (1-29m)    Yes                 Dictatorship      Africa      NA
## 4 Small (under 1m)     No          Parliamentary democ   W. Europe      NA
## 5 Moderate (1-29m)    Yes                 Dictatorship      Africa      NA
## 6 Small (under 1m)     No          Parliamentary democ  S. America      NA
##          typerel unions urban03 urban06 vi_rel3 votevap00s women05 women09
## 1         Muslim     NA      NA   23.28                 NA      NA    27.7
## 2         Muslim     NA 44.2390   46.14  20-50%      59.56     6.4    16.4
## 3         Muslim     NA 58.8302   63.94    >50%         NA      NA     7.7
## 4 Roman Catholic     NA 91.7404   90.28              20.95    14.3    35.7
## 5 Roman Catholic     NA 36.1806   53.96                 NA      NA    37.3
## 6     Protestant     NA 37.7566   39.60              76.34    10.5    10.5
##   womyear       womyear2  yng2003  young06
## 1      NA                      NA       NA
## 2    1920 1944 or before 27.34834 26.35428
## 3    1962     After 1944 33.91887 28.94154
## 4    1973     After 1944       NA       NA
## 5    1975     After 1944 47.62524 46.32196
## 6    1951     After 1944 20.66509       NA

# Then, the process of calculating the mean for women09 becomes a little 
# bit simpler. 

# 1: Point estimate
pe <- mean( wd.women09 $ women09 )

# 2: Standard error
sd <- sd( wd.women09 $ women09 )
n <- length( wd.women09 $ women09 )
se <- sd/sqrt(n)

pe          # Point estimate

## [1] 17.17722

pe - 2 * se # lower bound

## [1] 15.52954

pe + 2 * se # upper bound

## [1] 18.82491

# We get 17.18 with a 95% CI of [15.53, 18.82], just as before.

3. Re-labeling a factor variable

Let’s say we are interested in describing the gini08 variable for democracies and non-democracies using the democ_regime variable.

We have learned about the facet_wrap option previously. So, re-using the same commands, we can do

g <- ggplot(world.data, aes(x = gini08))
g <- g + geom_histogram()
g <- g + theme(axis.text.x = element_text(size = 12))
g <- g + xlab("Gini coefficient")
g <- g + ylab("Number of countries")
g <- g + facet_wrap( ~ democ_regime) # we can create separate histograms of gini08 for differenr regime types.
g

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 64 rows containing non-finite outside the scale range
## (`stat_bin()`).

What’s a bit annoying about this graph is that we have a blank panel on the far right. This happens because there are observations where the democ_regime variable is missing. To remedy this, we need to create a smaller data set that excludes observations with missing values, just like we did above.

Another undesirable thing is that the graph titles are not very informative. We just see “No”, “Yes”, and “NA”, but readers wouldn’t know what they mean. It would be better if we label them so that readers can easily see what they mean. To do so, we need to re-label the factor variable, democ_regime.

Construct a smaller data set first:

wd.dem <- world.data[ is.na(world.data $ democ_regime) == FALSE , ]
wd.dem <- world.data %>% filter(!is.na(democ_regime)) # remove NAs


# If we create the same histograms with this smaller data set, we will 
# no longer have a blank graph:

g <- ggplot(wd.dem, aes(x = gini08))
g <- g + geom_histogram()
g <- g + theme(axis.text.x = element_text(size = 12))
g <- g + xlab("Gini coefficient")
g <- g + ylab("Number of countries")
g <- g + facet_wrap( ~ democ_regime)
g

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 62 rows containing non-finite outside the scale range
## (`stat_bin()`).

Let’s now re-label the factor variable,democ_regime.

We will do so by creating a new variable that is equal to the original democ_regime variable, except for the labels. The new variable should be coded as “Democracy” (rather than “Yes”) for democratic countries and “Autocracy” (rather than “No”) for non-democratic countries.

To create a new factor variable, we use the factor variable. Let’s call the new variable “democ”. We do the following.

wd.dem $ democ <- factor(wd.dem $ democ_regime, 
                         levels = c("No", "Yes"),
                         labels = c("Autocracy", "Democracy")
                         )

# Then, if we create the histogram using the newly created factor:

g <- ggplot(wd.dem, aes(x = gini08))
g <- g + geom_histogram()
g <- g + theme(axis.text.x = element_text(size = 12))
g <- g + xlab("Gini coefficient")
g <- g + ylab("Number of countries")
g <- g + facet_wrap( ~ democ)
g + theme_bw() # I like a black-white theme more

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 62 rows containing non-finite outside the scale range
## (`stat_bin()`).

Now the titles are more informative. Hooray!

Last updated on October 24, 2024