NAME HTML::Untemplate - web scraping assistant VERSION version 0.014 DESCRIPTION Suppose you have a set of HTML documents generated by populating the same template with the data from some kind of database. HTML::Untemplate is a set of command-line tools ("xpathify", "untemplate") and modules (HTML::Linear and it's dependencies) which assist in original data retrieval. This process is also known as wrapper induction . To achieve this goal, HTML tree nodes are presented as XPath/content pairs. HTML documents linearized this way can be easily inspected manually or with a diff tool. Please refer to "EXAMPLES". Despite being named similarly to HTML::Template, this distribution is not directly related to it. Instead, it attempts to reverse the templating action, whatever the template agent used. Why? Suppose you have a CMS. Typical CMS works roughly as this (data flows bottom-down): RDBMS scripting language HTML HTTP server (...) HTTP agent layout engine screen user Consider the first 3 steps: "RDBMS => scripting language => HTML" This is "applying template". Now, consider this: "HTML => scripting language => RDBMS" I would call that "un-applying template", or "untemplate" ":)" The practical application of this set of tools is to assist in creation of web scrappers. A similar (however completely unrelated) approach is described in the paper XPath-Wrapper Induction for Data Extraction . Human-readability Consider the following HTML node address representations: * 0.1.3.0.0.4.0.0.0.2 (HTML::TreeBuilder internal address representation); * "/html/body/div[4]/div/div[1]/table[2]/tr/td/ul/li[3]" (HTML::Linear, strict); * "//td[1]/ul[1]/li[3]" (HTML::Linear, strict, shrink); * "/html/body[@class='section_home']/div[@id='content_holder'][1]/div[ @id='content']/div[@id='main']/table[@class='content_table'][2]/tr/t d/ul/li[@class='rss_content rss_content_col'][2]" (HTML::Linear, non-strict); * "//li[@class='rss_content rss_content_col'][2]" (HTML::Linear, non-strict, shrink). They all point to the same node, however, their verbosity/readability vary. The *strict* mode specifies tag names and positions only. Disabling *strict* will use additional data from CSS selectors. *Shrink* mode attempts to find the shortest XPath unique for every node ("/html/body" is shared among almost all nodes, thus is likely to be irrelevant). EXAMPLES xpathify The xpathify tool flatterns the HTML tree into key/value list: Hello HTML

Hello World!

This is a sample HTML

Beware!

HTML is not XML!

Have a nice day. Becomes: *(HTML block)* The keys are in XPath format, while the values are respective content from the HTML tree. Theoretically, it could be possible to reassemble the HTML tree from the flat key/value list this tool generates. untemplate The untemplate tool flatterns a set of HTML documents using the algorithm from xpathify. Then, it strips the shared key/value pairs. The "rest" is composed of original values fed into the template engine. And this is how the result actually looks like with some simple real-world examples (quotes 1839 and 2486 from bash.org ): *(HTML block)* MODULES May be used to serialize/flattern HTML documents by your own: * HTML::Linear - represent HTML::Tree as a flat list * HTML::Linear::Element - represent elements to populate HTML::Linear * HTML::Linear::Path - represent paths inside HTML::Tree SEE ALSO * Wrapper (data mining) * XPath-Wrapper Induction for Data Extraction * HTML::TreeBuilder * HTML::Similarity * XML::DifferenceMarkup AUTHOR Stanislaw Pusep COPYRIGHT AND LICENSE This software is copyright (c) 2012 by Stanislaw Pusep. This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.