Research Theme

The goal of our research is build high-performance, scalable, and easy-to-use data systems.

To achieve the goal, we adopt and develop novel machine learning and mathematical optimization techniques.


Embedded Data System for Shared Data

Subprojects: Airphant, AirKV, AirIndex

AirDB project is to enable extremely elastic, high-performance data systems for managing shared data (e.g., stored in cloud storage like AWS S3) as embedded systems without relying on servers (i.e., more like SQLite than PostgreSQL). If this project is successful, its users can manage significantly larger volume of data in an extremely cost efficient way. Unfortuately, naive approaches can compromise performance or data integrity. This idea has also been studied by Hyder and Meld, but no known work has addressed the fundamental problem that an amount of data to read can grow linearly with the number of updates even if those updates involve the same key.

To overcome the challenge, we are developing (1) new mechanisms for managing shared data more effectively (AirKV project) and (2) intelligent indexing techniques for significantly faster lookup performance (AirIndex project). Our early work (named Airphant, ICDE'22) proposes statistical indexing technique for the document data stored in cloud storage.


Deep Online Aggregation Engine

Subprojects: DeepOLA, DistOLA

AirEgg project is to reshape large-scale data aggregation with statistical methods by enabling progressive/online aggregation (OLA) for arbitrary analytical queries. OLA has been an attractive approach to many for its extremely user-friendly mode of computation — continuously update answers as processing more data — compared to the "wait until process everything" approach. Unfortunately, providing reliable, accurate answers while contiuously updating answers has been limited to extremely simple queries (e.g., computing simple count or average over a few joined data sources); thus, OLA has hardly been adopted to broad audience (with VerdictDB being a rare exception).

AirEgg overcomes this limitation with a series of formal investigations (1) to enable online aggregation for arbitrarily deep queries (thus, DeepOLA), (2) to offer parallel/distributed computing opportunities for DeepOLA, and (3) to leverage new hardware (e.g., GPU) for extremely fast computation.


Stateful Data Science Platform

Data science platforms like Jupyter (and many related commercial platforms) has been reshaping the way we anlayze data by enabling users to connect to remote servers via web and run large-scale analytics leveraging the power of large server machines. The existing platforms have a common limitation that if the server has to restart (to save cost, to migrate tasks to different machines, or to recover from system errors), all the data produced by running "cells" erases. We started this project to overcome this limitation by automatically capturing all intermediate, user-produced data and saving frequently on persistent storage (like checkpointing in data systems). Unfortunately, naive approaches can be extremely inefficient especially when users are working with a large amount of data.

AirNote aims to tackle the problem by intelligently discovering the most optimal way to save all the data (stored in variables) by exploiting their lineage captured in a graph. That is, if a variable A can be produced from another variable B by applying a simple data transformation operation, the system can only persist the variable B and dynamically re-compute the variable A as part of restoration process. Our project generalizes this idea to an arbitrarily complex graph.

Past Projects

  • VerdictDB: Universal Approximate Query Processing (AQP) system that offers AQP on top of existing SQL engines. Read more.
  • Database Learning: Data system that becomes smarter over time. Read more.
  • VAS: Visualization-Aware Sampling. Read more.
Yongjoo Park: