May 24, 2025 | Andy Arthur.org

I like to describe myself as a data scientist at least on the blog. I think it’s an accurate term to describe what I do professionally and as a hobbyist – I put together data, tease insights out of it, use it to create outputs from the data. I link names and addresses together from various government records, clean addresses and data, do spatial calculations and render things as Excel files, CSV files, and database updates.

A data scientist is not a programmer or a database administrator. He or she doesn’t fix computers. If anything, I break them sometimes by pushing them a bit too hard. But instead, I work to get insights out of data, take one form of data and then transform it. You might say a bit portion of my work – outside of data cleaning both manually and automated – is extract, transform and load. Often I’ll pull data out of the db2 database, work on it and join it in R and then upload it using a different program that was custom written for my needs.

Sometimes I wish I was a computer programmer by training – everything I know was learned mostly by reading and practical use outside of a few classes I took twenty years ago in college on Data Structures and Statistics. But I’m not needing it in sense I don’t write lengthy C/C++ programs, nor do I worry about user facing interfaces. Instead, I just extract value of data using common tools like SQL, R and some Bash and Python scripts. While I use some AWK, I don’t nearly as much as my predecessor did. AWK is good for simple things, but it doesn’t hold a candle to modern Python and R.

Data science is an interesting field, and one that is surprisingly accessible with relatively easy to use and powerful tools like R and Python. And it’s actually a lot of fun, as you’re not getting into the weeds of computer programming, memory allocation and the alike. A lot of things are relatively simple and clever scripts, and teasing out value of what’s out there but may not obvious until you join the data together.

It was only in 2021, when I really got interested in Python after a friend suggested I give it a second look for doing data processing for GIS. I also got tired of the sometimes clumsy and slow processing in QGIS, and while I had used some Python to automate things in QGIS, I became quite interested in PANDAS and Python for working with data. I got every book I could get my hands on about writing Python code, with a particular focus on data science. Later that year, actually Labor Day, I stumbled upon the R programming language and tidyverse and ggplot – and with it’s strong graphics capacity and ability to quickly process geospatial data I was hooked.

Since then I’ve been using R Studio every day. It’s not to say that I don’t occasionally use Python or other languages, or mapping tools like QGIS. But R has such a rich universe of data manipulation tools, it is so powerful and quick for processing data, manipulating spatial data and querying and exporting Census data. R Studio is the tool I use the most at work and for the blog and many other purposes. And it was all something I taught myself all just at first by watching a few Youtube videos while laying in a hammock, drinking a beer at the Perkins Clearing Conservation Easement in Adirondacks.

Maybe it was just dumb luck that the Data Services position opened up when the former director retired and I was a good fit for it. But I really love being able to clean, process and manipulate data every day using powerful tools and generating new insights that are powering government forward.

Day: May 24, 2025

Final Night At Camp

Relatively few New York residents live above 1,000 ft

Mohawk River Watershed

The Byrds – Your Gentle Way Of Loving Me.

Light Pollution USA March 2024

Burlington Flats

A Data Scientist