Web Data Management
Dr. Daniel Deutch
The course discusses various algorithmic aspects of web data management. Web data has distinctive features: it contains both structured (tables), semi-structured (XML and linked data), and unstructured data (text); it is large-scale, contains contradicting pieces of data, and is uncertain.
To be able to deal with this ocean of information, there is a need for new models as well as efficient retrieval, recommendation, and ranking techniques.
The goal of this course is to teach state-of-the-art models and algorithmic techniques for dealing with web data, in light of these challenging features.
In particular, we will start with an overview of models for structured data (in particular the relational and the nested relational models), semi-structured data (XML, tree automata, Web Graphs..), and unstructured data (Context-Free and Context-sensitive Languages). We will highlight some key issues in the design and usage of these models, in the context of web data. For each model, we will further introduce a probabilistic variant that allows to model uncertainty (probabilistic databases, probabilistic XML, HMMs, PCFGs, ..). We will introduce analysis tasks (and query languages that capture them where appropriate) for all cases (relational algebra, XQuery and XPath, Text analysis with HMMs and PCFGs, Ranking web-pages) and state-of-the-art algorithms for performing these tasks in the context of the probabilistic models that were studied. The algorithms will include techniques for evaluating relational algebra on probabilistic databases; Google PageRank algorithm; algorithms for computing Part-of-speech tagging in text analysis; algorithms for recommendations, and many others.
Bibliography:
Books
1. Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart,
Web Data Management, Cambridge University Press 2011
2. Dan Suciu, Dan Olteanu, Christopher Ré , Christoph Koch
Probabilistic databases, Morgan and Claypool 2011
Papers
1. Brin, Page
The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Networks and ISDN Systems 30 (7), 1998
2. Fagin,
Kumar, Sivakumar, COMPARING TOP k LISTS SIAM J. Discrete Math 17 (1), 2003
3. Fagin, Lotem, Naor,
Optimal aggregation algorithms for middleware, in Proc. of SIGMOD ‘01
4. Sarwar, Karypis, Konstan, Reidl,
Item-based collaborative filtering recommendation algorithms in proc. of WWW’01
5. Wang, De Vries, Reinders,
Unifying user-based and item-based Collaborative Filtering in proc. of SIGIR ‘06
6. Kleinberg,
Authoritative sources in a hyperlinked environment, in JACM 46 (5), 1999
7. Serge Abiteboul, Benny Kimelfeld, Yehoshua Sagiv, Pierre Senellart:
On the expressiveness of probabilistic XML models. VLDB J. 18(5), 2009