The R User Conference 2016

June 27 - June 30 2016
Stanford University, Stanford, California

Ninja Moves with data.table; Learn By Doing in a Cookbook Style Workshop

Matt Dowle (data.table creator) and Arun Srinivasan (data.table co-author)

Post-tutorial notes

The materials used in the tutorial are not yet available since Matt's session was a live demo with question and answer sessions. Matt says he is preparing a markdown document that he will send along as soon as he can.

Tutorial Description

data.table is known for its speed on large data in RAM (e.g. 100GB) but it also has a consistent and flexible syntax for more advanced data manipulation tasks on small data too. First released to CRAN in 2006 it continues to grow in popularity. 180​ ​CRAN and Bioconductor packages now import or depend on data.table. Its StackOverflow tag has attracted 4,000 questions from users in many fields making it a top 3 asked about R package. It is the 7th most starred R package on GitHub.

This three hour tutorial will guide complete beginners from basic queries through to advanced topics via examples you will run on your laptop. There is a short learning curve to data.table but once it clicks it sticks.

Tutorial Outline

  • fread() -​ ­ basic to advanced usage and its convenience features
  • General form of data.table queries:
    • for those familiar with data.frame: ​DT[i, j, by]
    • for those familiar with SQL: ​DT[where, select|update|do, group by]
  • The speed and syntax pros and cons of setting a primary key: setkey()
  • Secondary and automatic indexes: ​DT[==,] ​and​ DT[,on=]
  • Joining forwards, backwards and limiting staleness: ​roll=TRUE|+n|­n
  • Outer, inner and not joins; joining is the same as subsetting
  • The convenience symbols: ​.SD, .N and .I
  • Using base R and other R packages from data.table queries
  • The ease and freedom of being able to use ​for()​ loops again! :=​ and ​set()
  • The power of by=.EACHI
  • Chaining queries: ​DT[...][...]
  • Why it’s all inside ​DT[...]
  • Using data.table programmatically; e.g. doing the same task by many different groupings

Background Knowledge

Familiarity with base R and/or SQL is an advantage but not required.


R with the latest CRAN release of data.table installed.




Back to Top ↑