Scraping SeeThroughNY Data using R
Here is the R code I am using to scrape SeeThroughNY.net to download state and local government employment wage data.
library(tidyverse)
library(RSelenium)
library(netstat)
library(rvest)
# Load Selenium browser. This code should automatically open a Firefox window
# from r, downloading the latest GeckoDriver if neccessary.
#
# If this doesn't work, you should delete the LICENSE.chromedriver which
# sometimes causes rSelenium to not load.
## find ~/.local/share/ -name LICENSE.chromedriver | xargs -r rm
rs <- rsDriver(
remoteServerAddr = "localhost",
port = free_port(random = T),
browser = "firefox",
verbose = F
)
rsc <- rs$client
rsc$navigate("https://seethroughny.net/payrolls")
# STOP !!!
# While you could automate this step, you should now manually choose your
# search items on SeeThroughNY browser window that has opened. Then
# you should execute the following lines.
# Next you want to load all of the results. We limit it to 30 attempts,
# which will pull most reasonably sized queries. Too big and you could crash
# your browser due to excessive memory needed.
for (i in seq(1,30)) {
rsc$findElement(using='css', '#data_loader')$clickElement()
if (rsc$findElement(using='css', '#data_loader')$getElementAttribute('style')[[1]] == 'display: none;')
break;
Sys.sleep(2)
}
# Next you need to pull and clean the HTML table that
# contains the data
rsc$getPageSource() %>%
unlist() %>%
read_html() %>%
html_table() %>%
.[[1]] %>%
janitor::clean_names() -> employees
# Some of the data is located in the (+) tab, but this is just a
# table field located every other row, which split up into the appropiate
# field values
employees %>%
filter(row_number() %% 2 == 0) %>%
select(name) %>%
separate(name, sep='\n', into=c(NA,'subagency',NA,NA,NA,'title',NA,NA,NA,'rateofpay',NA,NA,NA,'payyear',NA,NA,NA,'paybasis',NA,NA,NA,'branch') ) %>%
cbind(employees %>% filter(row_number() %% 2 != 0), .) %>%
mutate(across(everything(), str_trim),
total_pay = parse_number(total_pay)) %>%
select(-x, -x_2, -subagency_type) -> employees
### Then you can pipe this data into ggplot or any other program.
### Or export it to CSV or Excel file
employees %>% write_csv('/tmp/employee_data.csv')