Arnold: Declarative Crowd-Machine Data Integration

Published in CIDR, 2013

Citation: Jeffery, S. R., Sun, L., DeLand, M., Pendar, N., Barber, R., & Galdi, A. (2013). Arnold: Declarative Crowd-Machine Data Integration. In CIDR. http://www.ramb.ethz.ch/CDstore/www2011/companion/p145.pdf

The availability of rich data from sources such as the WorldWide Web, social media, and sensor streams is giving rise toa range of applications that rely on a clean, consistent, andintegrated database built over these sources. Human input,or crowd-sourcing, is an effective tool to help produce suchhigh-quality data. It is infeasible, however, to involve hu-mans at every step of the data cleaning process for all data.We have developed a declarative approach to data clean-ing and integration that balances when and where to applycrowd-sourcing and machine computation using a new typeof data independence that we term Labor Independence. La-bor Independence divides the logical operations that shouldbe performed on each record from the physical implementa-tions of those operations. Using this layer of independence,the data cleaning process can choose the physical operatorfor each logical operation that yields the highest quality forthe lowest cost. We introduce Arnold, a data cleaning andintegration architecture that utilizes Labor Independence toefficiently clean and integrate large amounts of data.

Download paper here