Categories
Software and Programming

Parsing HTML with Perl

The Perl Journal has a good tutorial to Parsing HTML with HTML::PARSER, by Ken MacFarlane.

Worth noting that this is already sub-classed (and re-used) by lots of other modules in CPAN. Although I think the common approach is either to (a) use RegExps to parse what you need or (b) use some black-box tool to convert the HTML to well-formed XML, and parse that.

The HTML::Tree objects (tree builder and tree node classes) are another interface to the HTML::Parser, as is HTML::TokeParser. I think I’ll stick with hacking on the plain parser for now. Bah. It doesn’t work – some obscure bug when calling the “parse” method, which doesn’t appear in the source – compiled code, perhaps? In any case, I’m using RegExps….