So I’ve decided to start learning about statistical computing ahead of the harder stats classes that I’ll be taking this fall (my subfield within the political science major is Empirical Theory and Quantitative Methods) and as my first little project to teach myself the basics of the R language/environment I decided to take a look at the consumer price index in small cities (population less than 50,000) versus large cities (population greater than 1,500,000). To do that, I needed to get that data, format it in a way that was R-friendly, and then present it in a way that makes sense. Since I noticed that many of the R tutorials out there aren’t very clear on some things, I decided to document my steps as I figured out what worked.
The Bureau of Labor Statistics gives anyone access to their consumer price index database, and lets you see the information for specific regions. The two pieces of data I chose were Size Class A (over 1,500,000) and Size Class D (under 50,000) for 1993 to 2012. Retrieving the data as tables, I pasted each into a separate Numbers spreadsheet (this is on my MacBook Air) and exported them to my Downloads as “cpibig19932012.csv” and “cpi19932012.csv”, respectively.
Getting it into R
Working in RStudio, I clicked on the Files tab in the bottom right window, clicked Home, clicked Downloads (or wherever you decided to save the .csv files), clicked More, then Set As Working Directory. This lets us access the .csv files in the R environment.
In a new script in the top left window, I import the data into variables cpi and cpiBig for the small cities and big cities, respectively:
cpi <- read.csv(file=”cpi19932012.csv”,head=TRUE,sep=”,”)
cpiBig <- read.csv(file=”cpibig19932012.csv”,head=TRUE,sep=”,”)
Making a graph
I decided that the best way to represent the data over time would be a line chart showing both data sets on the same graph. I start by deciding on a heading, “Consumer Price Index in small vs. large cities 1993-2012”:
heading = “Consumer Price Index in small vs. large cities 1993-2012”
Next, I had to set up the axes of the graph:
xlab = “Year”,
ylab = “Average Annual CPI”)
- sets the x-axis as the years from the small cities dataset,
- sets the y-axis as the Average Annual consumer price index from the small cities data set,
- tells R not to also show the data points as a scatter plot on the graph,
- labels the x-axis as Year,
- labels the y-axis as Average Annual CPI
Note that to see all of your options for data to assign to axes for a dataset, you can type the following into the Console in the bottom left window:
Where you can replace “cpi” with whatever variable you’re interested in.
Then we graph the data as lines, with small cities colored red and large cities colored blue:
lines(cpi$Year, cpi$Annual, type=”l”, col=”red”)
lines(cpiBig$Year, cpiBig$Annual, type=”l”, col=”blue”)
Finally, we give the chart a legend:
legend(“topleft” , title=”City Size”, cex=0.75, pch=16,
col=c(“red”, “blue”), legend=c(“Pop. < 50,000”, “Pop. > 1,500,000”), ncol=2)
This tells R to put the legend in the top left of the chart, title it City Size, colors the lines the correct color values, and gives them the correct label for each line.
To see the output of your script, click Source and then Run in the top left window. You should have something like this show up in the bottom right window:
So what’s happening?
The line for small cities is consistently higher than the line for big cities. How does that make sense? Aren’t small towns full of poor rednecks, and cities full of wealthy-ish hipster urbanites?
I asked my friend Jason Zeng, an economic analyst friend here in Berkeley about it and he gave the following explanation: it comes down to rich suburbanites and urban squalor. The poor in big cities can’t buy the quality goods that the wealthier commuters in suburbs do, so their prices are lower. There are more poor in the cities than in the suburbs, so the CPI for cities is dragged lower than the CPI for suburbs.