Scraping SeeThroughNY Data using R

Here is the R code I am using to scrape to download state and local government employment wage data.


# Load Selenium browser. This code should automatically open a Firefox window
# from r, downloading the latest GeckoDriver if neccessary.
# If this doesn't work, you should delete the LICENSE.chromedriver which
# sometimes causes rSelenium to not load.
## find ~/.local/share/ -name LICENSE.chromedriver | xargs -r rm

rs <- rsDriver(
  remoteServerAddr = "localhost",
  port = free_port(random = T),
  browser = "firefox",
  verbose = F

rsc <- rs$client

# STOP !!!
# While you could automate this step, you should now manually choose your 
# search items on SeeThroughNY browser window that has opened. Then 
# you should execute the following lines.

# Next you want to load all of the results. We limit it to 30 attempts,
# which will pull most reasonably sized queries. Too big and you could crash
# your browser due to excessive memory needed.

for (i in seq(1,30)) {
  rsc$findElement(using='css', '#data_loader')$clickElement()
  if (rsc$findElement(using='css', '#data_loader')$getElementAttribute('style')[[1]] == 'display: none;')

# Next you need to pull and clean the HTML table that
# contains the data
rsc$getPageSource() %>%
  unlist() %>%
  read_html() %>%
  html_table() %>%
  .[[1]] %>% 
  janitor::clean_names() -> employees

# Some of the data is located in the (+) tab, but this is just a
# table field located every other row, which split up into the appropiate
# field values

employees %>%
  filter(row_number() %% 2 == 0) %>%
  select(name) %>%
  separate(name, sep='\n', into=c(NA,'subagency',NA,NA,NA,'title',NA,NA,NA,'rateofpay',NA,NA,NA,'payyear',NA,NA,NA,'paybasis',NA,NA,NA,'branch') ) %>%
  cbind(employees %>% filter(row_number() %% 2 != 0), .) %>%
  mutate(across(everything(), str_trim),
         total_pay = parse_number(total_pay)) %>%
  select(-x, -x_2, -subagency_type) -> employees

### Then you can pipe this data into ggplot or any other program.
### Or export it to CSV or Excel file
employees %>% write_csv('/tmp/employee_data.csv')

Its wonderful…

To be walking down the street and to run into random people who say, you’ve lost a lot of weight. Literal strangers but also long term acquaintances who are noticing.

That said, what really feels wonderful is how much better these days I feel and how I’ve learned to eat much healthier and diverse food choices, things that are interesting but not overcooked and loaded with fat, salt and sugar.

There is always more to do but I think I am making permanent changes in my life. But probably the hardest thing remains friends, colleagues and family – when you find a good way to live – others want to pull you back as they don’t understand your new way of living.

I’m reminded of these lyrics which ring true with doing so much of the right thing in your life.

My buddies tell me that I should have waited
They say I’m missing a whole world of fun
But I am happy and I sing with pride
I like the christian life

I won’t lose a friends by heeding God’s call
For what is a friend who’d want me to fall
Otheres find pleasure in things I despise
I like the christian life

My buddies shun me since I turned to Jesus
But I am happy though it burdens my soul
And I’ll try to lead them to walk in the night
I like the christian life