Scraping SeeThroughNY Data using R

Here is the R code I am using to scrape SeeThroughNY.net to download state and local government employment wage data.

01library(tidyverse)
02library(RSelenium)
03library(netstat)
04library(rvest)
05 
06# Load Selenium browser. This code should automatically open a Firefox window
07# from r, downloading the latest GeckoDriver if neccessary.
08#
09# If this doesn't work, you should delete the LICENSE.chromedriver which
10# sometimes causes rSelenium to not load.
11## find ~/.local/share/ -name LICENSE.chromedriver | xargs -r rm
12 
13rs <- rsDriver(
14  remoteServerAddr = "localhost",
15  port = free_port(random = T),
16  browser = "firefox",
17  verbose = F
18)
19 
20rsc <- rs$client
22 
23# STOP !!!
24# While you could automate this step, you should now manually choose your
25# search items on SeeThroughNY browser window that has opened. Then
26# you should execute the following lines.
27 
28# Next you want to load all of the results. We limit it to 30 attempts,
29# which will pull most reasonably sized queries. Too big and you could crash
30# your browser due to excessive memory needed.
31 
32for (i in seq(1,30)) {
33  rsc$findElement(using='css', '#data_loader')$clickElement()
34   
35  if (rsc$findElement(using='css', '#data_loader')$getElementAttribute('style')[[1]] == 'display: none;')
36    break;
37  
38  Sys.sleep(2)
39}
40 
41# Next you need to pull and clean the HTML table that
42# contains the data
43rsc$getPageSource() %>%
44  unlist() %>%
45  read_html() %>%
46  html_table() %>%
47  .[[1]] %>%
48  janitor::clean_names() -> employees
49 
50# Some of the data is located in the (+) tab, but this is just a
51# table field located every other row, which split up into the appropiate
52# field values
53 
54employees %>%
55  filter(row_number() %% 2 == 0) %>%
56  select(name) %>%
57  separate(name, sep='\n', into=c(NA,'subagency',NA,NA,NA,'title',NA,NA,NA,'rateofpay',NA,NA,NA,'payyear',NA,NA,NA,'paybasis',NA,NA,NA,'branch') ) %>%
58  cbind(employees %>% filter(row_number() %% 2 != 0), .) %>%
59  mutate(across(everything(), str_trim),
60         total_pay = parse_number(total_pay)) %>%
61  select(-x, -x_2, -subagency_type) -> employees
62 
63### Then you can pipe this data into ggplot or any other program.
64### Or export it to CSV or Excel file
65employees %>% write_csv('/tmp/employee_data.csv')

Leave a Reply

Your email address will not be published. Required fields are marked *