Chairholders
Profile
Ihab Ilyas
David R. Cheriton School of Computer Science
University of Waterloo
Chair title
Thomson Reuters-NSERC Industrial Research Chair in Data Cleaning
Chair program
Industrial Research Chairs program
Role
Senior Chairholder since 2018
Summary
Economy and science are now driven by data. Examples include evidence-based medicine, big data analytics that drive spending and decision-making in economic sectors, and data science infrastructure to speed up discoveries in astronomy, chemistry and many other scientific domains. Hence, enterprises in all vertical markets have been aggressively collecting data from a variety of sources to build the ultimate data asset, often referred to as the “data lake,” which will allow data scientists to find key insights and analytics that drive business. However, because of the variety data-collection methods, their imperfections, and the integration of data from various sources with different schemas, units, and languages, this data asset is often “dirty”―containing inaccurate, incomplete or inconsistent data―and cannot be used as intended. Consequently, data cleaning is the most time-consuming task performed by data scientists (according to an article in Forbes magazine in 2016), and a major hurdle to effective data science (according to an article in the New York Times in 2014). Current data-cleaning approaches suffer from fundamental pragmatic problems when applied to real-world dirty data, which hinder any of these solutions from being deployed in industry and business settings. The main objective of this research is to develop new technologies that enable data-quality solutions, allowing high-quality analytics and retrieval on large-scale, inconsistent, and dirty databases.
This proposal is motivated by the University of Waterloo and Thomson Reuters’ strong commitment to data science and big data research, leveraging both parties’ world-class reputation and track record of innovation involving data. The Chair is a tenured Full Professor since 2014, with a strong international reputation in data management, and is recognized internationally as one of the research leaders in data quality and cleaning. He has also co-developed one of the most prominent commercial products in data quality and integration.
On the commercial applications front, the techniques developed under the Chair will enable large enterprises across economic sectors, including Thomson Reuters, to gain access to and leverage their data assets. On the scientific applications front, disciplines such as astronomy, chemistry, scientific imagery and pharmaceutical research have been transformed, becoming massive data-centric applications that collect very large (but often dirty and incomplete) data from a variety of sources. The data cleaning and integration activities proposed in this research have a direct impact on accelerating the research findings and on significantly cutting down the duration of the data science life cycle, by enhancing the quality of the underlying data and increasing its value.
The proposal acknowledges the challenges in achieving the academic and commercial goals, but at the same time builds on many research results in the past decade on business rule mining, data integration, information exchange, and large-scale data management. Hence, on one hand, the proposal will help consolidate previous results to produce new techniques that have better adoption and application properties. On the other hand, it will tackle new research challenges often ignored when addressing small-scale and constrained scenarios. We anticipate ground-breaking results with respect to the scale of application, the diversity of the data addressed, and the constraints related to privacy and compliance.
Partner
- Thomson Reuters
Contact information
David R. Cheriton School of Computer Science
University of Waterloo
Website:
https://cs.uwaterloo.ca/~ilyas/