Scraping SeeThroughNY Data using R Here is the R code I am using to scrape SeeThroughNY.net to download state and local government employment wage data. view sourceprint?01library(tidyverse)02library(RSelenium)03library(netstat)04library(rvest)05 06# Load Selenium browser. This code should automatically open a Firefox window07# from r, downloading the latest GeckoDriver if neccessary.08#09# If this doesn't work, you should delete the LICENSE.chromedriver which10# sometimes causes rSelenium to not load.11## find ~/.local/share/ -name LICENSE.chromedriver | xargs -r rm12 13rs <- rsDriver(14 remoteServerAddr = "localhost",15 port = free_port(random = T),16 browser = "firefox",17 verbose = F18)19 20rsc <- rs$client21rsc$navigate("https://seethroughny.net/payrolls")22 23# STOP !!!24# While you could automate this step, you should now manually choose your 25# search items on SeeThroughNY browser window that has opened. Then 26# you should execute the following lines.27 28# Next you want to load all of the results. We limit it to 30 attempts,29# which will pull most reasonably sized queries. Too big and you could crash30# your browser due to excessive memory needed.31 32for (i in seq(1,30)) {33 rsc$findElement(using='css', '#data_loader')$clickElement()34 35 if (rsc$findElement(using='css', '#data_loader')$getElementAttribute('style')[[1]] == 'display: none;')36 break;37 38 Sys.sleep(2)39}40 41# Next you need to pull and clean the HTML table that42# contains the data43rsc$getPageSource() %>%44 unlist() %>%45 read_html() %>%46 html_table() %>%47 .[[1]] %>% 48 janitor::clean_names() -> employees49 50# Some of the data is located in the (+) tab, but this is just a51# table field located every other row, which split up into the appropiate52# field values53 54employees %>%55 filter(row_number() %% 2 == 0) %>%56 select(name) %>%57 separate(name, sep='\n', into=c(NA,'subagency',NA,NA,NA,'title',NA,NA,NA,'rateofpay',NA,NA,NA,'payyear',NA,NA,NA,'paybasis',NA,NA,NA,'branch') ) %>%58 cbind(employees %>% filter(row_number() %% 2 != 0), .) %>%59 mutate(across(everything(), str_trim),60 total_pay = parse_number(total_pay)) %>%61 select(-x, -x_2, -subagency_type) -> employees62 63### Then you can pipe this data into ggplot or any other program.64### Or export it to CSV or Excel file65employees %>% write_csv('/tmp/employee_data.csv')
Leave a Reply Cancel replyYour email address will not be published. Required fields are marked *Comment * Name * Email * Website Save my name, email, and website in this browser for the next time I comment. Ξ