perl6-PDF-Tools =============== ## Overview perl6-PDF-Tools is an experimental low-level tool-kit for reading and manipulating data from PDF files. It presents a seamless view of the data in PDF or FDF documents; behind the scenes handling compression, encryption, fetching of indirect objects and unpacking of object streams. It is capable of reading, editing and creation or incremental update of PDF files. This module is primarily intended as base for higher level modules. It can also be used to explore or patch data in PDF or FDF files. It does not understand logical PDF document structure. It is however possible to construct simple documents and perform simple edits by direct manipulation of PDF data. You will need some knowledge of how PDF documents are structured. Please see 'The Basics' and 'Recommended Reading' sections below. PDF::DOM and PDF::FDF are both under construction for high-level manipulation of PDF and FDF documents. Classes/roles in this tool-kit include: - `PDF::Reader` - for indexed random access to PDFs - `PDF::Storage::Filter` - a collection of standard PDF decoding and encoding tools for PDF data streams - `PDF::Storage::IndObj` - base class for indirect objects - `PDF::Storage::Serializer` - data marshalling utilities for the preparation of full or incremental updates - `PDF::Storage::Crypt` - decryption / encryption (V 2 & 3 RC4 only at this stage) - `PDF::Writer` - for the creation or update of PDFs - `PDF::DAO` - an intermediate Data Access and Object representation layer (DAO) to PDF data structures. Base classes for PDF::DOM ## Example Usage To create a one page PDF that displays 'Hello, World!'. ``` #!/usr/bin/env perl6 # creates t/example.pdf use v6; use PDF::DAO; use PDF::DAO::Doc; sub prefix:($name){ PDF::DAO.coerce(:$name) }; my @MediaBox = 0, 0, 420, 595; my %Resources = :Procset[ /'PDF', /'Text'], :Font{ :F1{ :Type(/'Font'), :Subtype(/'Type1'), :BaseFont(/'Helvetica'), :Encoding(/'MacRomanEncoding'), }, }; my $doc = PDF::DAO::Doc.new; my $root = $doc.Root = { :Type(/'Catalog') }; my $outlines = $root = { :Type(/'Outlines'), :Count(0) }; my $pages = $root = { :Type(/'Pages'), :@MediaBox, :%Resources, :Kids[], :Count(0), }; my $Contents = PDF::DAO.coerce( :stream{ :decoded("BT /F1 24 Tf 100 250 Td (Hello, world!) Tj ET" ) }); $pages.push: { :Type(/'Page'), :Parent($pages), :$Contents }; $pages++; my $info = $doc.Info = {}; $info.CreationDate = DateTime.now; $info.Producer = 'PDF-Tools'; $doc.save-as: 't/example.pdf'; ``` Then to update the PDF, adding another page: ``` use v6; use PDF::DAO::Doc; my $doc = PDF::DAO::Doc.open: 't/example.pdf'; my $catalog = $doc; my $Parent = $catalog; my $Contents = PDF::DAO.coerce( :stream{ :decoded("BT /F1 16 Tf 90 250 Td (Goodbye for now!) Tj ET" ) } ); $Parent.push: { :Type( :name ), :$Parent, :$Contents }; $Parent++; my $info = $doc.Info //= {}; $info.ModDate = DateTime.now; $doc.update; ``` ## Description A PDF file consists of data structures, including dictionarys (hashs) arrays, numbers and strings, plus streams for holding data such as images, fonts and general content. PDF files are also indexed for random access and may also have filters for stream compression and encryption of streams and strings. They have a reasonably well specified structure. The document structure starts from `Root` entry in the trailer dictionary, which is the main entry point into a PDF. This module is based on the PDF Reference version 1.7 specification. It implements syntax, basic data-types, serialization and encryption rules as described in the first four chapters of the specification. Read and write access to data structures is via direct manipulation of tied arrays and hashes. `PDF::DAO` provides a set of class builder utilities to enable higher level classes for general application development. This is put to work in the companion module PDF::DOM (under construction), which contains a much more detailed set of classes to implement much of the remainder of the PDF specification. ## The Basics PDF files are serialized as numbered indirect objects. The `t/example.pdf` file that we just wrote contains: ``` %PDF-1.3 %...(control characters) 1 0 obj << /CreationDate (D:20151225000000Z00'00') /Producer (PDF-Tools) >> endobj 2 0 obj << /Type /Catalog /Outlines 3 0 R /Pages 4 0 R >> endobj 3 0 obj << /Type /Outlines /Count 0 >> endobj 4 0 obj << /Type /Pages /Count 1 /Kids [ 5 0 R ] /MediaBox [ 0 0 420 595 ] /Resources << /Font << /F1 7 0 R >> /Procset [ /PDF /Text ] >> >> endobj 5 0 obj << /Type /Page /Contents 6 0 R /Parent 4 0 R >> endobj 6 0 obj << /Length 46 >> stream BT /F1 24 Tf 100 250 Td (Hello, world!) Tj ET endstream endobj 7 0 obj << /Type /Font /Subtype /Type1 /BaseFont /Helvetica /Encoding /MacRomanEncoding >> endobj xref 0 8 0000000000 65535 f 0000000014 00000 n 0000000101 00000 n 0000000172 00000 n 0000000222 00000 n 0000000400 00000 n 0000000469 00000 n 0000000567 00000 n trailer << /ID [ <4386dc7bc3489e418b44434e3a168843> <4386dc7bc3489e418b44434e3a168843> ] /Info 1 0 R /Root 2 0 R /Size 8 >> startxref 673 %%EOF ``` The PDF is composed of a series indirect objects, for example, the first object is: ``` 1 0 obj << /CreationDate (D:20151225000000Z00'00') /Producer (PDF-Tools) >> endobj ``` It's an indirect object with object number `1` and generation number `0`, with a `<<` ... `>>` delimited dictionary containing the author and the date that the document was created. This PDF dictionary is roughly equivalent to a Perl 6 hash: ``` { :CreationDate("D:20151225000000Z00'00'"), :Producer("PDF-Tools"), } ``` The bottom of the PDF contains: ``` trailer << /ID [ <4386dc7bc3489e418b44434e3a168843> <4386dc7bc3489e418b44434e3a168843> ] /Info 1 0 R /Root 2 0 R /Size 8 >> startxref 673 %%EOF ``` The `>>` ... `<<` delimited section is the trailer dictionary and the main entry point into the document. The entry `/Info 1 0 R` is an indirect reference to the first object (object number 1, generation 0) described above. We can quickly put PDF Tools to work using a Perl 6 REPL, to better explore the document: ``` snoopy: ~/git/perl6-PDF-Tools $ perl6 -MPDF::DAO::Doc > my $doc = PDF::DAO::Doc.open: "t/example.pdf" ID => [CÜ{ÃHADCN:C CÜ{ÃHADCN:C], Info => ind-ref => [1 0], Root => ind-ref => [2 0] > $doc.keys (Root Info ID) ``` This is the root of the PDF, loaded from the trailer dictionary ``` > $doc CreationDate => D:20151225000000Z00'00', Producer => PDF-Tools; ``` That's the document information entry, commonly used to store basic meta-data about the document. (PDF Tools has conveniently fetched indirect object 1 from the PDF, when we dereferenced this entry). ``` > $doc Outlines => ind-ref => [3 0], Pages => ind-ref => [4 0], Type => Catalog ```` The trailer `Root` entry references the document catalog, which contains the actual PDF content. Exploring further; the catalog potentially contains a number of pages, each with content. ``` > $doc Count => 1, Kids => [ind-ref => [5 0]], MediaBox => [0 0 420 595], Resources => Font => F1 => ind-ref => [7 0], Type => Pages > $doc[0] Contents => ind-ref => [6 0], Parent => ind-ref => [4 0], Procset => [PDF Text], Type => Page > $doc[0] Length => 46 > $doc[0].decoded BT /F1 24 Tf 100 250 Td (Hello, world!) Tj ET > ``` The page `/Contents` entry is a PDF stream which contains graphical instructions. In the above example, to output the text `Hello, world!` at coordinates 100, 250. ## Datatypes and Coercian The `PDF::DAO` namespace provides roles and classes for the representation and manipulation of PDF objects. ``` use PDF::DAO::Stream; my %dict = :Filter( :name ); my $obj-num = 123; my $gen-num = 4; my $decoded = "100 100 Td (Hello, world!) Tj"; my $stream-obj = PDF::DAO::Stream.new( :$obj-num, :$gen-num, :%dict, :$decoded ); say $stream-obj.encoded; ``` `PDF::DAO.coerce` is a method for the construction of objects. It is used internally to build objects from parsed AST data, e.g.: ``` use v6; use PDF::Grammar::Doc; use PDF::Grammar::Doc::Actions; use PDF::DAO; my $actions = PDF::Grammar::Doc::Actions.new; PDF::Grammar::Doc.parse("<< /Type /Pages /Count 1 /Kids [ 4 0 R ] >>", :rule