Sign in to confirm you’re not a bot
This helps protect our community. Learn more
These chapters are auto-generated

Intro

0:00

What is Apache Arrow? A multi-language toolbox for accelerated data interchange and in-memory processing

0:37

Accelerating data interchange

1:21

Efficient in-memory processing

2:39

Where does R fit (in Arrow)?

4:07

Where does Arrow fit (in R)?

5:24

dplyr connects to an Arrow backend

7:55

Get data

9:26

The NYC taxi data . The data set

9:41

Downloading the data

10:30

Opening a dataset

12:29

Using dplyr verbs: Select and filter

14:13

Find airports

16:52

A secondary table

17:05

Extract airport zones

18:23

Wrangle data

18:42

Let's find the airport pickups...

18:47

We need to use database joins

19:47

Wait... why does the join throw an error?

21:47

Solution: use schemas

22:36

Summarize

24:33

Count rides by dropoff zone

24:43

Data visualization

25:43

Making connections with data! • Apache Arrow is cool

26:34
Doing More with Data: An Introduction to Arrow for R Users
141Likes
4,346Views
2022Jun 23
Speaker: Danielle Navarro, Developer Advocate at Voltron Data As datasets become larger and more complex, the boundaries between data engineering and data science are becoming blurred. Data analysis pipelines with larger-than-memory data are becoming commonplace, creating a gap that needs to be bridged: between engineering tools designed to work with very large datasets on the one hand, and data science tools that provide the analysis capabilities used in data workflows on the other. One way to build this bridge is with Apache Arrow, a multi-language toolbox for working with larger-than-memory tabular data. Arrow is designed to improve performance and efficiency, and places emphasis on standardization and interoperability among workflow components, programming languages, and systems. This talk gives an introduction to the Arrow package in R, a mature interface to Apache Arrow, that provides an appealing solution for data scientists working with large data in R. It introduces the core concepts behind Apache Arrow and the Arrow package in R, provides a walkthrough of a sample data analysis using a large tabular data set (containing about 1.7 billion rows), and highlights possible pain points for an R user new to the Arrow ecosystem.

Follow along using the transcript.

Voltron Data

1.34K subscribers