Post

Locations

We are now going to plot where these jobs are located.

cities <- us_cities()

cities %<>%
  filter(state != "AK", state != "HI") %>%
  mutate(town = str_replace_all(name_2010, "\\s[a-z].+", ""), town_st = paste(town, state))

data_sci <- read_csv('data_sci.csv') %>%
  mutate(name = Town, town_st = paste(name, State), job = rep("Data Science", length(name)))

bus_intel <- read_csv('intelligence.csv') %>%
  mutate(name = Town, town_st = paste(name, State), job = rep("Business Intelligence", length(name)))

sbdata <- rbind(data_sci, bus_intel)

combined <- left_join(sbdata, cities, by = "town_st")

data_cities <- combined %>%
  filter(job == "Data Science")

bus_cities <- combined %>%
  filter(job == "Business Intelligence")

icons <- awesomeIcons(icon = 'users',
                      markerColor = ifelse(combined$job == 'Data Science','blue','red'),
                      library = 'fa',
                      iconColor = 'black')

leaflet(combined) %>% addTiles() %>%
  addAwesomeMarkers(~lon, ~lat, icon=icons, label=~as.character(job))

From this plot, it seems that the Business Intelligence jobs are a little bit more spread out, while the Data Science jobs seem to be more on the east and west coast.

There are many problems with this data. While scrapping and cleaning the data in Python, some of the data didn’t transfer over correctly, which made it unusable to use in this analysis. I also didn’t use a dataset that had spatial data for every town in the US, so while mapping it out there are some points that are missing from the map. Also, every instance of ‘R’ is not being accounted. One of the functions that I am using doesn’t recognize single letters as words, and with trying to capture the ’R’s I am only finding the ones with a space before and afer. Another problem is that I am only scraping dice.com, and not any other job search site. It is only the first four pages as well, so it’s not quite gathering tons of data. Even with all of these problems though, we were able to find some insights in the data.

This post is licensed under CC BY 4.0 by the author.