Hylien: A hybrid approach to general list extraction on the web

Published in WWW, 2011

Citation: Fumarola, F., Weninger, T., Barber, R., Malerba, D., & Han, J. (2011, March). Hylien: A hybrid approach to general list extraction on the web. In Proceedings of the 20th international conference companion on World wide web (pp. 35-36). ACM. http://wwwconference.org/proceedings/www2011/companion/p35.pdf

We consider the problem of automatically extracting general lists from the web. Existing approaches are mostly dependent upon either the underlying HTML markup or the visual structure of the Web page. We present HyLiEn an unsupervised, Hybrid approach for automatic List discovery and Extraction on the Web. It employs general assumptions about the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods.

Download paper here