NAME HTML::Untemplate - web scraping assistant VERSION version 0.005 DESCRIPTION Suppose you have a set of HTML documents generated by populating the same template with the data from some kind of database. HTML::Untemplate is a set of command-line tools ("xpathify", "untemplate") and modules (HTML::Linear and it's dependencies) which assist in original data retrieval. To achieve this goal, HTML tree nodes are presented as XPath/content pairs. HTML documents linearized this way can be easily inspected manually or with a diff tool. Please refer to "EXAMPLES". Despite being named similarly to HTML::Template, this distribution is not directly related to it. Instead, it attempts to reverse the templating action, whatever the template agent used. Why? Suppose you have a CMS. Typical CMS works roughly as this (data flows bottom-down): RDBMS scripting language HTML HTTP server (...) HTTP agent layout engine screen user Consider the first 3 steps: "RDBMS => scripting language => HTML" This is "applying template". Now, consider this: "HTML => scripting language => RDBMS" I would call that "un-applying template", or "untemplate" ":)" The practical application of this set of tools to assist in creation of web scrappers. EXAMPLES xpathify The xpathify tool flatterns the HTML tree into key/value list: Hello HTML

Hello World!

This is a sample HTML

Beware!

HTML is not XML!

Have a nice day. Becomes: *(HTML block)* The keys are in XPath format, while the values are respective content from the HTML tree. Theoretically, it could be possible to reassemble the HTML tree from the flat key/value list this tool generates. untemplate The untemplate tool flatterns a set of HTML documents using the algorithm from xpathify. Then, it strips the shared key/value pairs. The "rest" is composed of original values fed into the template engine. And this is how the result actually looks like with some simple real-world examples (quotes 1839 and 2486 from bash.org ): *(HTML block)* MODULES May be used to serialize/flattern HTML documents by your own: * HTML::Linear - represent HTML::Tree as a flat list * HTML::Linear::Element - represent elements to populate HTML::Linear * HTML::Linear::Path - represent paths inside HTML::Tree SEE ALSO * HTML::TreeBuilder * HTML::Similarity * XML::DifferenceMarkup AUTHOR Stanislaw Pusep COPYRIGHT AND LICENSE This software is copyright (c) 2012 by Stanislaw Pusep. This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.