ADS Drinks and Data: CIDR meets ADS – Amsterdam Data Science

The theme of this meetup will be the following. The Conference on Innovative Data Systems Research (CIDR) is a systems-oriented conference, emphasizing the systems architecture perspective. It is complementary in its mission to mainstream database conferences like SIGMOD and VLDB.

Taking advantage of the presence of prominent data systems researchers who are visiting CIDR, ADS and CIDR are organizing a free meetup at the conference venue, closing the last day of the conference. The list of speakers includes world-class researchers in this field. This meetup is open for anyone to join.

Programme

14:00 Walk-in

14:30 Introduction

14:35 Talk #1 Scalable Input Data Processing for Resource-Efficient ML

15:20 Talk #2 Natural Language Meets Query Processing

16:05 Break

16:15 Talk #3 The end of “Big Data” and what the duck you can do about it

17:00 Drinks & Networking

17:30 End

Chair
Peter Boncz, professor of Large-Scale Analytical Data Management at Vrije Universiteit Amsterdam and senior researcher in the Database Architectures Group at CWI.

***

Talk #1 Ana Klimovic (ETH Zurich)

Title of the keynote: Scalable Input Data Processing for Resource-Efficient ML

Abstract: Processing input data plays a vital role in ML training, impacting accuracy, throughput, and cost. This talk will discuss the characteristics of ML input pipelines, which have motivated the design of a new system architecture, in which we disaggregate input data processing from model training. I will present Cachew, a fully-managed service for ML data processing, built on top of Tensorflow’s data loading framework, tf.data. Cachew’s autoscaling and autocaching policies reduce end-to-end training time by up to 4.1x and total cost by up to 3.8x compared to scaling data processing resources with a traditional Kubernetes Horizontal Pod Autoscaler.

Bio of Ana Klimovic: Ana Klimovic is an Assistant Professor in the Systems Group of the Computer Science Department at ETH Zurich. Her research interests span operating systems, computer architecture, and their intersection with machine learning. Ana’s work focuses on computer system design for large-scale applications such as cloud computing services, data analytics, and machine learning. Before joining ETH in August 2020, Ana was a Research Scientist at Google Brain and completed her Ph.D. in Electrical Engineering at Stanford University.Ana Klimovic is an Assistant Professor in the Systems Group of the Computer Science Department at ETH Zurich. Her research interests span operating systems, computer architecture, and their intersection with machine learning. Ana’s work focuses on computer system design for large-scale applications such as cloud computing services, data analytics, and machine learning. Before joining ETH in August 2020, Ana was a Research Scientist at Google Brain and completed her Ph.D. in Electrical Engineering at Stanford University.

***

Talk #2 Matei Zaharia (Databricks, Stanford University)

Title of the keynote: Natural Language Meets Query Processing

Abstract: The exponential growth of data sizes has been used to justify dramatically different techniques for handling data and novel architectures. However, 15 years into the “Big Data” craze, the vast majority of people don’t have giant data sets, and the few that do tend to not process more than a modest amount of that data at a time. Moreover, advances in hardware mean that the threshold for what you’d consider “Big Data” in the first place has been increasing steadily. This means that over time, fewer workloads will need complex distributed architectures to handle them.

This talk will go through some of the assumptions in the modern data ecosystem and how these may not hold true in a world where data size is not a factor for most users. We will explore ways in which data can be a liability, and argue that organizations should consider constraining the amount of data collected and retained. Finally, the talk will discuss what kinds of things become possible with moderate data sizes, and how to keep them that way.

Bio of Matei Zaharia: Matei Zaharia is an Associate Professor of Computer Science at Stanford and a cofounder and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley, and has worked on other widely used open source data analytics and machine learning software, including MLflow, Delta Lake and Delta Sharing. His current research covers database systems, natural language processing (NLP), and information retrieval. Matei’s research work was recognized through the 2014 ACM Doctoral Dissertation Award and the US PECASE award.

***

Talk #3 by Jordan Tigani (MotherDuck)

Title of the keynote: The end of “Big Data” and what the duck you can do about it

Abstract: The exponential growth of data sizes has been used to justify dramatically different techniques for handling data and novel architectures. However, 15 years into the “Big Data” craze, the vast majority of people don’t have giant data sets, and the few that do tend to not process more than a modest amount of that data at a time. Moreover, advances in hardware mean that the threshold for what you’d consider “Big Data” in the first place has been increasing steadily. This means that over time, fewer workloads will need complex distributed architectures to handle them.

This talk will go through some of the assumptions in the modern data ecosystem and how these may not hold true in a world where data size is not a factor for most users. We will explore ways in which data can be a liability, and argue that organizations should consider constraining the amount of data collected and retained. Finally, the talk will discuss what kinds of things become possible with moderate data sizes, and how to keep them that way.

Bio Jordan Tigani: Jordan spent more than a decade building systems to handle big data before realizing that they were solving the wrong problem. He was one of the founding engineers on Google BigQuery, wrote two books about it and held various leadership roles on the team. Jordan was chief product officer at SingleStore before starting MotherDuck to build serverless DuckDB and spread duck puns around the world. Previously, Jordan worked on the Windows Kernel team on driver verifier and in Microsoft Research on runtime software analysis. He has an undergraduate degree from Harvard and a Master’s from the University of Washington.

Registration is free but you must do so in advance through our Meetup page. The event will be in English.

***

We will be taking photos of the event and posting them on the ADS website and social media channels. If you have any questions or concerns, please send an email to info[at]amsterdamdatascience.nl