Data management

Instructions

  1. Homework assignment is here with all questions commented out.
  2. Complete all of the coding tasks in the .qmd file
  3. Upload your individual exercise to your GitHub repo by Wed 11:59pm.
  4. Remember to clean your github repo and sort hw submissions by weeks. Each week should have one folder.
# ##-------------------------------------------------------------------------
# ## R code for POLI502 Lab, week 5: Individual Exercises
# ## Written by: Howard Liu
# ##-------------------------------------------------------------------------
# 
# 
# # 1. Exploring a data set -------------------------------------------------
# 
# # We have learned several functions to explore a data set, including
# 
# dim(world.data)
# head(world.data)
# tail(world.data)
# 
# # There are some other functions we can use. For example, the names
# # function tells us the names of all the variables included in a data frame
# # object. 
# 
# names(world.data)
# 
# # The colnames function gives us the same results as well. 
# 
# colnames(world.data)
# 
# 
# # We can also apply the summary function without specifying variable
# # names. Then, R will provide the summary of ALL the variables in cluded
# # in a data frame object. 
# 
# summary(world.data)
# 
# 
# # The str() function tells us the structure of a data frame object, 
# # meaning that it tells us which variables are factor, which ones are
# # numerical, which ones are logical, etc.
# 
# str(world.data)
# 
# 
# # 2. Summarizing categorical variables ------------------------------------
# 
# 
# # The output from the str function above tells us that there are many factor
# # variables in the data set. For example, the democ_regime variable 
# # is a factor variable (nominal-level). Summarize the information contained
# # in this variable by creating a frequency table. 
# 
# 
# WRITE YOUR COMMAND HERE
# 
# 
# # The typerel variable is another factor variable. 
# # This variable measures predominant religion in a given country. 
# # Create a frequency table for this variable. 
# 
# WRITE YOUR COMMAND HERE
# 
# 
# # Make this frequency table vertical using the data.frame function
# 
# WRITE YOUR COMMAND HERE
# 
# 
# # We have seen in the lecture that we often report RELATIVE frequencies
# # as well as raw frequencies. Relative frequencies can be obtained by 
# # dividing each of the raw frequency values by the total number of 
# # observations. Let's see how we do this. 
# 
# 
# # To do so, it is better if we create a new object that stores the 
# # frequency table. Let's create an object called ft.colony that is
# # equal to the vertical frequency table for the colony variable, 
# # as follows. 
# 
# ft.colony <- data.frame( table(world.data $ colony) )
# 
# # To make sure we did this correctly, let's take a look
# 
# ft.colony
# 
# 
# # We can see that the first column, Var1, records all possible
# # values and the second column, Freq, records the raw frequency. 
# 
# 
# # To convert the raw frequencies into relative frequencies, we divide
# # the values by the sum of Freq.
# # As we learned before, we use the sum function to calculate the sum of
# # all the values, as follows. 
# 
# sum( ft.colony $ Freq )
# 
# # The relative frequencies are Freq divided by sum( ft.colony $ Freq )
# 
# ft.colony $ Freq / sum( ft.colony $ Freq )
# 
# # Alternatively, we can use the prop.table function to obtain the 
# # same results
# 
# prop.table(ft.colony $ Freq)
# 
# # We would want to convert these further into percentages. 
# # To make a ratio into a percentage, we simply multiply it by 100
# 
# prop.table(ft.colony $ Freq) * 100
# 
# 
# # We would want to round these numbers to simplify the representation. 
# # As we leanred two weeks ago, we use the round function to do that. 
# 
# round(prop.table(ft.colony $ Freq) * 100, digits = 2)
# 
# 
# # Finally, we want to insert these numbers into the frequency table 
# # we created and stored in ft.colony. 
# 
# ft.colony
# 
# # How do we do it? 
# # We do this by creating a new column in the ft.colony object. 
# # As we learned last week, we use the $ symbol to create a new column
# # in a data frame object, as follows
# 
# ft.colony $ Percent <- round(prop.table(ft.colony $ Freq) * 100, digits = 2)
# 
# 
# # Now, our frequency table contains three columns, as follows
# 
# ft.colony
# 
# # Finally, we may want to change the column name for the first column
# # from "Var1" to something more intuitive. 
# # To do so, we use the colnames function, as follows
# 
# colnames(ft.colony)[colnames(ft.colony) == "Var1"] <- "Colonizer"
# 
# ft.colony
# 
# 
# # We can see that about 33% of the countries in the world are former colonies
# # of the UK, about 15% of them are former colonies of France, about 10% of them
# # were never colonized, etc. 
# 
# 
# # To summarize the steps to create a frequency table:
# 
# ft.colony <- data.frame( table(world.data $ colony) )
# ft.colony $ Percent <- round(prop.table(ft.colony $ Freq) * 100, digits = 2)
# colnames(ft.colony)[colnames(ft.colony) == "Var1"] <- "Colonizer"
# ft.colony
# 
# 
# 
# # Create a frequency table for the typerel variable
# 
# WRITE YOUR COMMAND HERE
# 
# # Which religion is the most "popular" in the world? 
# # That is, what is the mode of the "typerel" variable?
# 
# WRITE YOUR ANSWER IN WORD HERE
# 
# 
# # What is the percentage of countries where muslim is the majority? 
# 
# WRITE YOUR ANSWER HERE
# 
# 
# # Create a frequency table for democ_regime
# 
# WRITE YOUR COMMAND HERE
# 
# # What percentage of countries have a democratic regime?
# 
# WRITE YOUR ANSWER IN WORD HERE
# 
# 
# # Create a bar chart to summarize the typerel variable
# 
# WRITE YOUR COMMAND HERE
# 
# 
# # Create a bar chart to summarize the democ_regime variable
# 
# WRITE YOUR COMMAND HERE
# 
# 
# 
# 
# 
# # 3. Making ggplot graphs look nicer --------------------------------------
# 
# # We have seen how to create a graph using the ggplot function
# 
# ggplot(world.data, aes(x = colony)) + geom_bar()
# 
# # The command above is the easiest way to produce a simple ggplot
# # graph, but we would want to modify some parts of the graph, such as
# # axis labels. For example, the graph above currently says 
# # "colony" on the x-axis and "count" on the y-axis. We may want to 
# # modify them so they can be more informative. 
# 
# 
# # When we want to modify graphs, we usually create a ggplot graph
# # and store it into an object. Then we gradually add some features
# # to modify them. The above command can be re-written as follows:
# 
# 
# g <- ggplot(world.data)  # Tells R which data frame contains the variable to plot
# g <- g + aes(x = colony) # Tells R which variable to plot
# g <- g + geom_bar()      # Tells R what type of graph we want
# g                        # Tells R to show the contents of the object g
# 
# # Now that we stored the graph into an object called g, we can 
# # modify graph appearances by adding more options.
# 
# # To change the label for the x-axis, we use the xlab option, as follows
# 
# g <- g + xlab("Colony of Which Country?")
# g
# 
# # Similarly, we can modify the label for the y-axis
# 
# g <- g + ylab("Number of countries")
# g
# 
# # If you want to change the text size for axes, do
# 
# g <- g + theme(axis.text.x = element_text(size = 12))
# g
# 
# 
# # We can save this graph as a PDF file using the ggsave function. 
# 
# ggsave(file = "colony_bar.pdf", width = 10, height = 8)
# 
# # The file option specifies the file name of the PDF file
# # you want to create. 
# # The width and height option control the width and height of the
# # PDF file, respectively. 
# # Once you save a graph in a PDF, you can easily embed it in a 
# # Word document simply by drag & drop.
# 
# 
# # To summarize what we have done so far,
# 
# g <- ggplot(world.data)
# g <- g + aes(x = colony)
# g <- g + geom_bar()
# g <- g + xlab("Colony of Which Country?")
# g <- g + ylab("Number of countries")
# g <- g + theme(axis.text.x = element_text(size = 12))
# g
# ggsave(file = "colony_bar.pdf", width = 10, height = 8)
# 
# 
# 
# 
# # Create a bar chart for the typerel variable, and save it as a PDF file
# 
# WRITE YOUR COMMANDS HERE
# 
# 
# 
# 
# 
# 
# # 4. Summarizing numerical variables --------------------------------------
# 
# # There are two variables in the data set, gini04 and gini08, that measure 
# # the levels of economic inequality in a country numerically. 
# # These are what's called Gini coefficient (Gini index or Gini ratio), which
# # takes values between 0 and 1 (or 0% and 100%). A value of 0 corresponds to
# # the "perfect equality" case, where everyone in a country is earning the same 
# # amount of money, whereas a value of 1 (100%) corresponds to the maximal 
# # inequality case, where one person is earning ALL the money in a country and
# # everyone else is earning nothing. The gini04 variable is from the year 2004
# # whereas the gini08 variable is from the year 2008. 
# 
# 
# # Numerically summarize the gini04 variable. 
# # That is, calculate and present the measures for central tendency and 
# # those for dispersion. 
# 
# WRITE YOUR COMMANDS HERE
# 
# 
# # Numerically summarize the gini08 variable. 
# # That is, calculate and present the measures for central tendency and 
# # those for dispersion. 
# 
# WRITE YOUR COMMANDS HERE
# 
# 
# # Compare the distributions of gini04 and gini08. Do you think that the 
# # level of economic inequality is getting worse, getting better, or neither?
# # Why or why not? 
# 
# # Note: I'm not asking why it is getting better, worse, or neither; 
# #       I'm asking on what basis you can conclude that it's getting better or worse. 
# 
# 
# 
# WRITE YOUR ANSWER HERE
# 
# 
# 
# # Create a histogram of gini04
# # Modify the axis labels accordingly to make them informative and intuitive.
# # Save the graph as a PDF file.
# 
# WRITE YOUR COMMANDS HERE
# 
# 
# 
# # Create a histogram of gini08
# # Modify the axis labels accordingly to make them informative and intuitive.
# # Save the graph as a PDF file.
# 
# WRITE YOUR COMMANDS HERE
# 
# 
# 
# # Compare the distributions of gini04 and gini08 graphically by 
# # placing the two PDF files you just created side by side. 
# # Do you confirm the conclusion you derived previously? 
# 
# 
# WRITE YOUR ANSWER HERE
# 
# 
# 
# 
# # As we saw in the lecture, we sometimes create histograms for different
# # values of a nominal-level variable. For example, we may want to create
# # separate histograms of gini04 for countries in different regions.
# 
# 
# # To do so, we use the facet_wrap option, as follows.
# 
# 
# COPY & PASTE THE CODE YOU WROTE TO PRODUCE HISTOGRAM FOR gini04 HERE
# 
# g <- g + facet_wrap( ~ region)
# g
# 
# 
# # We can see that Scandinavian countries and Western European countries
# # have, on average, lower Gini coefficients (= they are more egalitarian),
# # whereas countries in South America have relatively high values. 
# 
# 
# 
# 
# # Create separate histograms of women09 for countries in different regions.
# 
# 
# WRITE YOUR ANSWER HERE
# 
# 
# 
# 
# # We may want to do the same using numerical methods.
# # That is, we may want to obtain central tendencies and dispersions for
# # a numerical variable for different groups. 
# 
# # To do so, we use the by function. 
# # The by function take the following form
# # 
# # by( VARIABLE_YOU_WANT_TO_ANALYZE, GROUP, FUNCTION )
# # 
# # That is, you provide 
# # (1) an interval-level variable you want to summarize first, 
# # (2) a comma
# # (3) a nominal variable that separates observations into groups
# # (4) a comma
# # (5) a function you want to apply (such as summary, mean, median, sd, etc.)
# 
# # For example, to obtain numerical summaries of gini04 for different regions, 
# # we write
# 
# by(world.data $ gini04, world.data $ region, summary)
# 
# 
# 
# # Calculate the standard deviation of gini04 for different regions using
# # the by function
# # Hint: we still need to take care of the missing value problem. 
# #       Use the na.rm = TRUE option. 
# 
# 
# WRITE YOUR COMMAND HERE
# 
# 
# # Which region has the smallest dispersion? 
# 
# 
# WRITE YOUR ANSWER HERE
# 
# 
# # End of file