Research Theme
The goal of our research is build high-performance, scalable, and easy-to-use data systems.
To achieve the goal, we adopt and develop novel machine learning and mathematical optimization techniques.
CARE
System for Scalable Causal Inference
Advances in machine learning, coupled with advances in scalable data processing, have resulted in highly accurate predictions of quantities of interest. Yet, despite the advances in machine learning and data systems, many data practitioners cannot easily answer causal inference questions in observational data settings. This project will build a novel computer system to allow businesses, academics, and the public to perform effective and intuitive causal exploration. The main novelty of this project will be an end-to-end, causal data exploration system that allows users to ask direct, causal questions and visually experience cause-effect relationships, while the system automatically optimizes for real-time interactions. Using the system, an office employee can ask "What would have been the effect on sales last year had we increased advertising expenditure targeted at women?"; an academic can ask "Did improved educational attainment cause a wage increase?"; a member of the public can ask "Did lack of exercise cause my gain in weight?"
This project will develop a scalable, CAusal-RElational (CARE) data system for end-to-end causal data exploration. CARE will let users experience causality by allowing explicit, real-time interventions with causal data modeling, do-calculus querying, and intervention-centric visualization. CARE will accelerate this broad range of tasks simultaneously by optimizing the underlying data layout and using emerging hardware (e.g., GPU, TPU) in consideration of user-specific data access and computational patterns. This project will address three fundamental research challenges: (1) data modeling - designing a causality-driven data model for effortless causal modeling and systems optimization, (2) efficient query processing - rapidly estimating accurate causal treatment effects for large datasets, (3) declarative querying and interactive visualization - assisting users in easily expressing their causal queries and intuitively understanding causality. These research thrusts will be evaluated by measuring system performance and asking human evaluators about their experiences. This research effort will enable interactive, low-effort causal inference, making this crucial analysis tool accessible to all.
Kishu
Checkpointable Data Science
Data science platforms like Jupyter (and many related commercial platforms) has been reshaping the way we anlayze data by enabling users to connect to remote servers via web and run large-scale analytics leveraging the power of large server machines. The existing platforms have a common limitation that if the server has to restart (to save cost, to migrate tasks to different machines, or to recover from system errors), all the data produced by running "cells" erases. We started this project to overcome this limitation by automatically capturing all intermediate, user-produced data and saving frequently on persistent storage (like checkpointing in data systems). Unfortunately, naive approaches can be extremely inefficient especially when users are working with a large amount of data.
Kishu aims to tackle the problem by intelligently discovering the most optimal way to save all the data (stored in variables) by exploiting their lineage captured in a graph. That is, if a variable A can be produced from another variable B by applying a simple data transformation operation, the system can only persist the variable B and dynamically re-compute the variable A as part of restoration process. Our project generalizes this idea to an arbitrarily complex graph.
AirDB
Embedded Databases that can Share Data
Subprojects: Airphant, AirKV, AirIndex
AirDB project is to enable extremely elastic, high-performance data systems for managing shared data (e.g., stored in cloud storage like AWS S3) as embedded systems without relying on servers (i.e., more like SQLite than PostgreSQL). If this project is successful, its users can manage significantly larger volume of data in an extremely cost efficient way. Unfortuately, naive approaches can compromise performance or data integrity. This idea has also been studied by Hyder and Meld, but no known work has addressed the fundamental problem that an amount of data to read can grow linearly with the number of updates even if those updates involve the same key.
To overcome the challenge, we are developing (1) new mechanisms for managing shared data more effectively (AirKV project) and (2) intelligent indexing techniques for significantly faster lookup performance (AirIndex project). Our early work (named Airphant, ICDE'22) proposes statistical indexing technique for the document data stored in cloud storage.
DeepOLA
Interactive Analytics for Real-World Data Science
DeepOLA project is to reshape large-scale data aggregation with statistical methods by enabling progressive/online aggregation (OLA) for arbitrary analytical queries. OLA has been an attractive approach to many for its extremely user-friendly mode of computation — continuously update answers as processing more data — compared to the "wait until process everything" approach. Unfortunately, providing reliable, accurate answers while contiuously updating answers has been limited to extremely simple queries (e.g., computing simple count or average over a few joined data sources); thus, OLA has hardly been adopted to broad audience (with VerdictDB being a rare exception).
DeepOLA overcomes this limitation with a series of formal investigations (1) to enable online aggregation for arbitrarily deep queries (thus, DeepOLA), (2) to offer parallel/distributed computing opportunities for DeepOLA, and (3) to leverage new hardware (e.g., GPU) for extremely fast computation.
Past Projects
- VerdictDB: Universal Approximate Query Processing (AQP) system that offers AQP on top of existing SQL engines. Read more.
- Database Learning: Data system that becomes smarter over time. Read more.
- VAS: Visualization-Aware Sampling. Read more.