------------------------------------------------------------------------- OpenOffice::OODoc - Open Office Document Connector http://www.genicorp.com/devel/oodoc Quickstart notice (Feb. 2004) ------------------------------------------------------------------------- This short introductory notice is intended to allow anybody to evaluate some basic features of the OODoc modules. A full reference manual of the API is available as an OpenOffice.org (SXW) document (please see the project page above for download) I - Overview The main goal of the Perl Open Office Document Connector (OODoc) is to allow quick application development in 2 areas: - replacement of old-style, proprietary, client-based macros for intensive and non-interactive document processing; - direct read/write operations by enterprise software on office documents, and/or document-driven applications. OODoc provides an abstraction of the document objects and isolates the programmer from low level XML navigation, UTF8 encoding and compressed file management details. The OODoc module set is organized in 3 logical layers. The first layer consists of the OpenOffice::OODoc::File class (defined in the File.pm module). This class is responsible of read/write operations with the OpenOffice.org physical files. It does every I/O and compression/uncompression processing. It's mainly an easy-to-use, OpenOffice-oriented wrapper for the standard Archive::Zip Perl module (but it could be extended to encapsulate any other physical storage method for the OpenOffice.org documents). The second layer is made of the OpenOffice::OODoc::XPath class (XPath.pm), which is an OpenOffice/XML-aware class. OpenOffice::OODoc::XPath is an object-oriented Perl representation of an XML member of an OpenOffice.org document (ex: content.xml, meta.xml, styles.xml, etc.), using the XML::XPath Perl API to access individual XML elements. If you want to deal in the same time with several XML components of the same document, you can/must create several OpenOffice::OODoc::Xpath against the document (ex: one OpenOffice::OODoc::XPath will be associated with 'meta.xml' to represent the metadata, another one will be associated with 'content.xml' to give access to the content. OpenOffice::OODoc::XPath accepts and provides only XML strings from/to the application; but it's able to connect with an OpenOffice::OODoc::File object for file I/O operation, so you can use it without explicit file management coding. For an example, if you want to get access to the content of any OO file (say 'foo.sxw'), you have to write something like: use OpenOffice::OODoc; my $doc = OpenOffice::OODoc::XPath->new ( file => 'foo.sxw', member => 'content' ); then $doc becomes an abstraction of the 'content.xml' (i.e. the text and automatic styles) of the 'foo.sxw' file, that can be used to get/set any content through simple methods like: print $doc->getText('//text:p', 2); The last instruction outputs the content of the 3rd paragraph as flat, editable text (because '//text:p' is the logical path to any paragraph, and the paragraphs are numbered from zero). You could also put your own text in the same paragraph with: $doc->setText('//text:p', 2, 'My text'); The line above deletes any preceding content in the paragraph and replaces it by 'My text'. But, for the moment, the paragraph is only changed in memory; to commit the change and make it persistent in the OO file, you have just to do a $doc->save; In order to avoid obscure coding for Perl/object beginners, the OpenOffice::OODoc::XPath->new initialization call can be replaced by "ooXPath", with the same arguments. OpenOffice::OODoc::XPath allows some quick element manipulation and exchange, and can operate on several documents in the same session. For example: my $doc1 = ooXPath(file => 'file1.sxw', member => 'content'); my $doc2 = ooXPath(file => 'file2.sxw', member => 'content'); my $paragraph = $doc1->getElement('//text:p', 15); $doc2->insertElement ('//text:h', 0, $paragraph, position => 'after'); This sequence takes an arbitrary paragraph (the 16th one) of a document and inserts it just after an arbitrary header (the first one) in another document. Here, we used an 'insertElement' method to directly transfer an existing text element, but the same method (with different arguments) can create a new element according to application data, or from a well- formed XML string describing any document element in regular OpenOffice syntax. Example: # a program my $doc = ooXPath(file => 'file1.sxw', member => 'content'); open MYFILE, "> transfer.xml"; print MYFILE $doc1->exportXMLElement('//text:p', 15); close MYFILE; # another program my $doc2 = ooXPath(file => 'file2.sxw', member => 'content'); open MYFILE, "< transfer.xml"; $doc2->insertElement ('//text:h', 0, , position => 'after'); close MYFILE; These last two short programs produce the same effect as the preceding one, but the target file can be processed later than the source one and in a different location, because there is no direct link in the two documents. The first program exports an XML description of the selected element, then the second program uses this description to create and insert a new element that is an exact replicate of the exported one. In the meantime, the XML intermediate file can be checked, processed and transmitted with any language and protocol. But it's just a beginning, because, in the real world, you have to do much more sophisticated processing, and you have not a lot of time to learn the XML path of any kind of document element (paragraph, header, item list, style, ...). So we began to develop a third, more user-friendly layer. The third layer is designed as a set of application-oriented classes, inherited from OpenOffice::OODoc::XPath. In this layer, the basic principle is "allow the user to forget XML". Each document element is considered from the user's point of view, and the XML path to get it is hidden. This approach works only if a specialized OpenOffice::OODoc::XPath class is defined for each kind of content. So, we ultimately need the following classes: OpenOffice::OODoc::Text for the textual content of any document; OpenOffice::OODoc::Image to deal with the graphic objects; OpenOffice::OODoc::Calc to manage the specific addressing of cells in spreadsheet documents; OpenOffice::OODoc::Meta for the metadata (meta.xml); OpenOffice::OODoc::Styles for page/style definitions. For the moment, in the 3rd layer, the Calc module is not implemented yet (but OO Calc documents can be processed by the XPath module and (partly) by the Text module (because this last one provides a few table processing methods, and because the text content of a Calc cell can be managed as a text paragraph). To illustrate the differences between the layers, with OODoc::Text (if you know your document is really an OpenOffice.org Writer one), you get the same paragraph as in the previous example with: print $doc->getParagraphText(2); The difference looks tiny, but in fact OODoc::Text contains much more sophisticated text-aware methods that avoid a lot of coding and probably a lot of XML path errors. For example, the following code puts the content of an ordinary Perl list (@mydata) in an OpenOffice document as an ordinary numbered item list: my $list = $doc->appendItemList ( type => 'ordered', style => 'Text body' ); $doc->setText($list, @mydata); The first instruction creates an empty list at the end of the document body (here an ordered one with a given style, but these parameters are optional). The second one populates the new list with the content of an application- provided table. The setText method automatically modify its behaviour according to the functional type of its first argument (with is not the same for a paragraph as for an itemlist or a table cell). The same layer provides some global processing methods such as: my $result = $doc->selectTextContent($filter, \&myFunction); that produces a double effect: 1) it scans the whole document body and extracts the content of every text element matching a given filter expression (that is an exact string or a conventional Perl regular expression); 2) it triggers automatically an application-provided function each time a matching content is found; the called function can execute any on-the-fly search/replace/delete operation on the current content and get data from any external database or communication channel; the return value of the function automatically replaces the matching string. So such a method can be use in sophisticated conditional fusion- transformation scripts. But you can use the same method to get a flat ASCII export of the whole document, without other processing, if you provide neither filter nor action: print $doc->selectTextContent; Of course, OODoc can process presentation and not only content. Example: $filter = 'Dear valued customer'; foreach $element ($doc->selectElementsByContent($filter)) { $doc->setStyle($element, 'Welcome') if $element->isParagraph; } After this last code sequence, every paragraph containing the string 'Dear valued customer' has the 'Welcome' style (assuming 'Welcome' is a paragraph style, already defined or to be defined in the document). A style (like any other document element) can be completely created by program, or imported (directly or through an XML string) from another document. The second way is generally the better because you need a lot of parameters to build a completely new style by program, but the creation of a simple style is not a headache with the OODoc::Styles module, provided that you have an OpenOffice.org attributes glossary at hand. The following example show the way to build the "Welcome" style. This piece of code declares "Welcome" as a paragraph style, whith "Text body" as parent style, and with some private properties (Times 16 bold font, yellow background and blue foreground). $doc->createStyle ( "Welcome", family => 'paragraph', parent => 'Text body', properties => { 'style:font-name' => 'Times', 'fo:font-size' => '16pt', 'fo:font-weight' => 'bold', 'style:text-background-color' => '#ffff00', 'fo:color' => '#000080' } ); According to the application logic, each newly created style can be registered either as a "named" style (i.e. visible and reusable for the OpenOffice.org suite end-user) or as an "automatic" style. For an ordinary application that needs the best processing facility for any kind of content and presentation element, the OODoc::Document module is the best choice. This module defines a special class that inherits from Text, Image and Styles classes. It allows the programmer, for example, to simply insert a new paragraph, create an image object, anchor the image to the paragraph, then create the styles needed to control the presentation of both the paragraph and the image, all that in the same sequence and in any order. II - Some practical examples While the OODoc modules can read and modify any document, they can't operate without any existing document. If you want to generate the entire content and presentation of a document, you must provide an empty (or non empty) document as a template, then your program will be able to create/remove/change any element in it. To begin playing with the modules, you should before all see the self-documented sample scripts provided in the package. These scripts do nothing really useful, but they show the way to use the modules. You should directly load the full library with the single "use OpenOffice::OODoc" in the beginning of your scripts. Then you should only use (in the beginning) the Document and/or Meta classes only. We encourage you, in the first time, to avoid any explicit OODoc::XPath basic method invocation, and to deal only with available "intelligent" modules (Text, Image, Styles, via Document, and Meta), in order to get immediate results with a minimal effort. And, if you use this stuff for evangelization purpose, you can show the code to prove that the OpenOffice.org XML format allows a lot of things with a few lines. You can avoid the heavy object oriented notation such as: my $meta = OpenOffice::OODoc::Meta->new(file => "xxx.sxc"); and use the shortcuts like: my $meta = ooMeta(file => "xxx.sxc"); The first thing you have to do with a document is to create an object focused on the member you want to work with, and "feed" it with regular OpenOffice.org XML. The most straightforward way to do that is to create the object in association with an OpenOffice.org file. Example 1: Dealing with metadata We need metadata access, so we use OODoc::Meta use OpenOffice::OODoc; my $doc = ooMeta(file => 'myfile.sxw'); my $title = $doc->title; if ($title) { print "The title is $title"; } else { print "There is no title"; } Here, because the constructor of OODoc::Meta is called with a 'file' parameter, OODoc::Meta knows it needs a file access and it dynamically requires the OODoc::File module, instantiates a corresponding object using the file name, connects to it, and asks it for the 'meta.xml' member of the file. All that annoying processing is hidden for the programmer. We have just to query for the useful object, the title. We could get more complex metadata structures, such as the user defined fields: my %ud = $doc->user_defined; foreach my $name (keys %ud) { print $k . '->' . $ud{$k} . "\n"; } This code captures the user defined fields (names and values) in a hash table, which then is displayed in a "name->value" form. You could see the way to update the user defined fields in the 'set_fields' script. The most usual metadata accessors have a symmetrical behaviour. To update the title, for example, you have to call the 'title' method with a string argument: $doc->title("New title"); You can proceed in the same way with subject, description, keywords. The 'keywords' is an example of polymorphic behaviour (which is quite common for many OODoc methods): my $keywords = $doc->keywords; my @keywords = $doc->keywords; In the first form, the keywords are returned concatenated and comma- separated in a single editable text line. In the second one, we get the keywords as a list. But if 'keywords' is called to add new keywords, these ones must be provided as a list: $doc->keywords("kw1", "kw2", "kw3"); $doc->keywords(@my_keywords); The program is automatically prevented from introducing redundancy in the keywords list (the 'keywords' method deletes duplicates). While 'keywords' can only add new keywords, you have to call removeKeyword to delete an existing keyword. If you want to destroy the entire list of keywords in a single call, you have just to write: $doc->removeKeywords; Well, we have done some updates in the metadata, but these updates apply only in memory. To make it persistent in the file, we have just to issue a: $doc->save; I said OODoc::Meta (which is an OODoc:XPath) did not know anything about files and data compression. But in my example, the object has been created with a 'file' argument and associated with an implicit OODoc::File object. So, the 'save' method of OODoc::XPath is only a stub method which sends a 'save' command to the connected OODoc::File object. With an object created with an 'xml' parameter (providing the metadata through an XML string, without reference to a file), a 'save' call generates a 'No archive' error. If you prefer to keep the original file unchanged, you can issue a $doc->save('my_other_file.sxw'); that produces the same thing as 'File/SaveAs' in your favorite office software: if called with an argument, 'save' creates a new file containing all the changed and unchanged members of the original one. Example 2 - Manipulating text Here we must read and update some elements of an OpenOffice.org Writer Our program begin with something like that: use OpenOffice::OODoc; my $doc = ooText(file => 'myfile.sxw'); To give a very high level abstract, we can say that OODoc::Text provides 2 kinds of read access methods: - the 'get' methods that return data referred by unconditional addressing, like getParagraph(4); - the 'select' methods that return data selected against a given filter, related to a text content or an attribute value, like selectParagraphsByStyle('Text body'). Some 'get' or 'select' methods return lists while other return individual elements or values. Returned data may be elements or texts. Text data can be exported or displayed, but the application needs elements to do any read/write operation on the content. For example: my $text = $doc->getTextContent; extracts the whole content of the document as a flat, editable text in the local character set, for immediate use (or display on a dumb terminal). Of course, there are more the one way to do the same thing, so you can get the same result with a 'select' method as with a 'get' one if you use a "non-filtering filter". So: my $text = $doc->selectTextContent('.*'); will also return the whole text content. But this last method, with some additional arguments and an appropriate filter, is much more powerful, because it can do 'on-the-fly' processing in each text element matching the filter (for example, insert values extracted from an enterprise database or resulting from complex calculations). The output of getTextContent can be tagged according to the type of each text element, so the application can easily use this method to export the text in an alternative (simple) markup language. To do some intelligent processing in the text, we need to deal with individual text objects such as paragraphs, headers, list items or table cells. For example, to export the content of the 5th paragraph (paragraph numbering beginning with 0), we could directly get th text with: my $text = $doc->getParagraphText(4); But in order to update the same paragraph, or change its style, I need the paragraph element, not only its text content: my $para = $doc->getParagraph(4); # text processing takes place here $doc->setText($para, $other_text); $doc->setStyle($para, $my_style); Some methods can dynamically adapt to the text element type they have to process. For example, the getText method (exporting the text content of a given text element), can return the content of many kinds of element (paragraphs, headers, table cells, item lists or individuals list items). In addition, any text content extracted with an high-level OODoc method is transcoded in the local character set (UTF8 issues are (we hope) hidden for the application). Optionnally, the text output can be instrumented with begin and end application-provided tags according to the element type (so it's possible to export the text in an alternative, simple XML dialect, or in LaTeX, or in an application-specific markup language). In order to facilitate some kinds of massive document processing operations, OODoc::Text provides a few high level methods that do iterative processing upon whole sets of text elements. One example is selectElementsByContent: this method looks for any paragraph, header or list item element matching a given pattern (string or regular expression) and, each time an element is selected, it executes an application-provided callback function. An example of use is provided in the 'search' demo script, which selects any text element in a document matching a given expression, and appends the selected content as a sequence of paragraphs in another document. The more usual methods have explicit names, and can be used without documentation (or just reading the headers of the french documentation) provided that the programmer has a good understanding of the general philosophy. Header and paragraph manipulations are quite simple. The situation is more complex with other text content such as item lists, tables and graphics. To get an individual list item, you must point to it from a previously obtained list element: my $item_list = $doc->getOrderedList(2); my $item = $doc->getListItem(4); Here, $item contains the 5th item of the 3rd ordered (i.e. numbered) list of the document (the content of the item could then be exported by a generic method such as getText), Because the need of data capture within table structures is more evident, there is a direct accessor to get any individual table cell: my $value = $doc->getCellValue($table, $line, $col); For example: my $value = $doc->getCellValue(0, 12, 0); returns the value of the 1st cell of the 13th row of the 1st table in the document. Note the 'cell value' is simply the text content if the cell type is string; but if the cell type is any numeric type, getCellValue returns the content of the value attribute and ignores the text. You can also change the content of a cell: $doc->updateCell($table, $line, $col, $value); $doc->updateCell($table, $line, $col, $value, $string); $doc->updateCell($cell, $value); $doc->updateCell($cell, $value, $string); The first form puts the $value in the target cell, assuming it's a string cell or, if it's a numeric one, your choice is to put the same content as the value and the displayable string. The second form (assuming the target cell is numeric) provides independent content for value and string (the programmer must know what he does, for example in case of money or date cell). The 3rd and 4th forms do respectively the same things, but use a previously obtained cell element in place of 3D coordinates (in order to avoid unnecessary low-level XPath recalculation). OODoc::Text allows the program to create a new table, using the appendTable or insertTable method. The following example appends a new table with 8 lines and 5 columns to the document. my $table = $doc->appendTable("MyTable", 8, 5); But this new table is (by default) a pure text table. It's possible to build very sophisticated table structures, with an appropriate data type and a special presentation for each cell. But, to complete this task, the the application must provide a lot of parameters. So, it's recommended to avoid purely programmatic table construction, and to reuse existing table structures and styles in template documents previously created with the OpenOffice.org software. And, as with OODoc::Meta, don't forget to issue a 'save' call if you want to make your changes persistent. Example 3 : Dealing with text AND metadata In this last example, we must access both the text content and the metadata. So, we need 2 OODoc::XPath objects : one OODoc::Text and one OODoc::Meta. And to avoid ugly and inefficient I/O operations, we need to connect the 2 objects to the same OODoc::File "server". use OpenOffice::OODoc; my $archive = ooFile('myfile.sxw'); my $content = ooText(archive => $archive); my $meta = ooMeta(archive => $archive); # process content and metadata $archive->save; In this case, the OODoc::Text and OODoc::Meta objects are created with an 'archive' parameter, so they are required to connect to an existing OODoc::File object. After processing, a 'save' call directly addressed to the OODoc::File is sufficient to do the physical file update, because this object "knows" the list of the OODoc::XPath objects connected to it, and "asks" to each of them the XML content it's responsible of (the other XML members of the file remain unchanged). There is an example of simultaneous access to content and metadata in the script 'set_title' (where some text content is used to generate a piece of metadata). Example 4 - Manipulating graphics The module OODoc::Image brings some functionalities that can be used against any OO document. The following code (combining the capabilities of OODoc::Text and OODoc::Image) selects the first paragraph containing the string "OpenOffice" and attach an imported image to it. my $p = $doc->selectElementByContent("OpenOffice"); die "Paragraph not found" unless $p; $doc->createImageElement ( "Paris landscape", description => "Montmartre in winter", attachment => $p, import => "C:\MyDocuments\montmartre.jpg", size => "5cm, 3.5cm", style => "graphics2" ); In this example, the image is physically imported. But I could replace the "import" parameter by a "link" one, in order to use the image as an external link (cf. the "link" option when you insert an image in OpenOffice.org). My new image needs a style (called "graphics2" in my example) to be presented. This style could be an existing one, but my program could create it if needed, using an OODoc::Styles method (see below). Any characteristic of an existing image can be read or updated using simple methods. For example, it's easy to change the size and the position of my image: $doc->imageSize("Paris landscape", "10cm, 7cm"); $doc->imagePosition("Paris landscape", "3cm, 0cm"); The logical name of the image (here "Paris landscape") is the best way to retrieve an image object, so it's a mandatory argument with the createImageElement method. With OpenOffice.org Writer, each image is created with an unique name (that is "Image1", "Image2", etc. if the user doesn't provide a more significant one). But with OpenOffice.org Impress, the images are unnamed by default. We recommend you to give a significant name to each object that you want to process later by program, knowing that if an object can be easily caught by program, it's potentially reusable. An image can be selected by his description (i.e. the text the end-user can edit in the image properties dialog in OpenOffice.org). So, the following sequence provides the list of images where the description contains the string "Montmartre": my @images = $doc->selectImageElementsByDescription("Montmartre"); If you have to store and process a graphical content out of the OpenOffice.org software, you can export it as an ordinary file: $doc->exportImage("Paris landscape", "/home/pictures/montmartre.jpg"); And you can use a symmetric importImage method to change the content of an image element. Example 5 - Managing styles The OODoc::Styles allows the programmer to get any style definition, to change it and, if really needed, to create new styles. In the first part of this document, you can see an example of paragaph style creation. Unfortunately, createStyle could drive you to heavy coding efforts, because a very sophisticated style definition needs a lot of parameters and requires the knowledge of a lot of OpenOffice.org attribute names. So we recommend you to systematically reuse existing styles (stored in OO template documents used as "style repositories" or in XML databases). The createStyle method supports a "prototype" parameter that allows you to clone an existing style, contained in the same document or in another one. The next code sequence selects the "Text body" style of a document, and uses it as a template to create a "My Text body" style in another document, changing the font size only: my $template = $doc1->getStyleElement("Text body"); $doc2->createStyle ( "My Text Body", family => "paragraph", prototype => $template, properties => { "fo:font-size" => "12pt" } ); Because a style is required for each image in a document, the OODoc::Document brings a more user-friendly createImageStyle method. This method allows you to create an image style without any mandatory parameter (excepted the name). So, the "graphics2" style I invoked in a previous createImage example could be simply created by: $doc->createImageStyle("graphics2"); Without other indication, the module automatically creates a style with "reasonable" values, so the image is really visible in the document. Of course, the application could provide explicit values for some parameters if needed. The following call, for example, provides specific values for contrast, luminance and gamma correction: $doc->createImageStyle ( "graphics2", properties => { 'draw:contrast' => '2%', 'draw:luminance' => '-3%', 'draw:gamma' => '1.1' } ); Styles are not made only to control the presentation of individual elements. There are special styles for page layout. While these styles are described with very specific data structures, the OODoc::Styles module contains some methods dedicated to page styling. III - Conclusion OODoc is a work in progress, so there are probably some bugs to fix, a lot of new functionality to add, and maybe some basic design to reconsider. But it reasonably works, and has been practically used in small real-world projects. So any constructive remark is welcome. (JMG // oodoc@genicorp.com)