Scraping SeeThroughNY Data using R

Here is the R code I am using to scrape SeeThroughNY.net to download state and local government employment wage data.

library(tidyverse)
library(RSelenium)
library(netstat)
library(rvest)

# Load Selenium browser. This code should automatically open a Firefox window
# from r, downloading the latest GeckoDriver if neccessary.
#
# If this doesn't work, you should delete the LICENSE.chromedriver which
# sometimes causes rSelenium to not load.
## find ~/.local/share/ -name LICENSE.chromedriver | xargs -r rm

rs <- rsDriver(
  remoteServerAddr = "localhost",
  port = free_port(random = T),
  browser = "firefox",
  verbose = F
)

rsc <- rs$client
rsc$navigate("https://seethroughny.net/payrolls")

# STOP !!!
# While you could automate this step, you should now manually choose your 
# search items on SeeThroughNY browser window that has opened. Then 
# you should execute the following lines.

# Next you want to load all of the results. We limit it to 30 attempts,
# which will pull most reasonably sized queries. Too big and you could crash
# your browser due to excessive memory needed.

for (i in seq(1,30)) {
  rsc$findElement(using='css', '#data_loader')$clickElement()
  
  if (rsc$findElement(using='css', '#data_loader')$getElementAttribute('style')[[1]] == 'display: none;')
    break;
 
  Sys.sleep(2)
}

# Next you need to pull and clean the HTML table that
# contains the data
rsc$getPageSource() %>%
  unlist() %>%
  read_html() %>%
  html_table() %>%
  .[[1]] %>% 
  janitor::clean_names() -> employees

# Some of the data is located in the (+) tab, but this is just a
# table field located every other row, which split up into the appropiate
# field values

employees %>%
  filter(row_number() %% 2 == 0) %>%
  select(name) %>%
  separate(name, sep='\n', into=c(NA,'subagency',NA,NA,NA,'title',NA,NA,NA,'rateofpay',NA,NA,NA,'payyear',NA,NA,NA,'paybasis',NA,NA,NA,'branch') ) %>%
  cbind(employees %>% filter(row_number() %% 2 != 0), .) %>%
  mutate(across(everything(), str_trim),
         total_pay = parse_number(total_pay)) %>%
  select(-x, -x_2, -subagency_type) -> employees

### Then you can pipe this data into ggplot or any other program.
### Or export it to CSV or Excel file
employees %>% write_csv('/tmp/employee_data.csv')

Leave a Reply

Your email address will not be published. Required fields are marked *