Hylien: A hybrid approach to general list extraction on the web
Published in WWW, 2011
Citation: Fumarola, F., Weninger, T., Barber, R., Malerba, D., & Han, J. (2011, March). Hylien: A hybrid approach to general list extraction on the web. In Proceedings of the 20th international conference companion on World wide web (pp. 35-36). ACM. http://wwwconference.org/proceedings/www2011/companion/p35.pdf
We consider the problem of automatically extracting general lists from the web. Existing approaches are mostly dependent upon either the underlying HTML markup or the visual structure of the Web page. We present HyLiEn an unsupervised, Hybrid approach for automatic List discovery and Extraction on the Web. It employs general assumptions about the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods.