Apache/Dynagzip version 0.03 ============================ =head1 Apache::Dynagzip - mod_perl extension for C. Version 0.04 =head1 ABSTRACT This Apache handler provides dynamic gzip compression within a chunked outgoing stream. The implementation of this handler can compress outgoing HTML content by 3 to 20 times, while controlling the size of outgoing chunks, and the lifetime of the locally cached copy of the transferred document.. The handler can work within the C chain as well as perform as a standalone handler for the content generation phase. The standalone implementation of this handler can be especially helpful to transfer a huge static HTML file via a slow connection to the modern browser. A major benefit is that the browser can begin to decompress the first part of the HTML file while the server is still compressing the remainder of the document. This handler is particularly useful for compressing outgoing Web content which is dynamically generated on the fly (using templates, DB data, XML, etc.), when at the time of the request it is impossible to determine the length of the document to be transmitted. Support for Perl, Java, and C source generators is provided. Besides the benefits of reduced file size, this approach gains efficiency from being able to overlap the various phases of generation, compression, transmission, and decompression. In fact, the browser can start to decompress a document which has not yet been completely generated. The handler uses an internal buffer to accumulate a full chunk of data before transmission begins. =head1 INTRODUCTION From a historical point of view this package was developed mainly to compress the output of a proprietary CGI binary written in C that was widely used by Outlook Technologies, Inc. to deliver dynamic HTML content over the Internet using HTTP/1.0 since the mid-'90s. We were then presented with the challenge of using the compression features of HTTP/1.1 on busy production servers, especially those serving heavy traffic on virtual hosts of popular American broadcasting clients. The very first our attempts to implement the static gzip approach to compress the dynamic content helped us to scale effectively the bandwidth of BBC backend by the cost of significantly increased latency of the content delivery. Actually, the delay of the content's download (up to the moment when the page is able to run the onLoad() JavaScript) was not increased even on fast connections, and it was significantly decreased on dial-ups. Indeed, the BBC editors were not too happy to wait up to a minute sitting in front of the sleeping screen when the backend updates some hundreds of Kbytes of the local content... That was why I came up with the idea to use the chunked data transmission of the gzipped content sharing some real time between the server side data creation/compression, some data transmission, and the client side data decompression/presentation, and providing the end users with the partially displayed content as soon as it's possible in particular conditions of the user's connection. At the time we decided to go for the dynamic compression there was no appropriate software on the market, which could be customized to target our goals effectively. Even later in February 2002 Nicholas Oxhøj wrote to the mod_perl mailing list about his experience to find the Apache gzipper for the streaming outgoing content: I<"... I have been experimenting with all the different Apache compression modules I have been able to find, but have not been able to get the desired result. I have tried Apache::GzipChain, Apache::Compress, mod_gzip and mod_deflate, with different results. One I cannot get to work at all. Most work, but seem to collect all the output before compressing it and sending it to the browser...> I<... Wouldn't it be nice to have some option to specify that the handler should flush and send the currently compressed output every time it had received a certain amount of input or every time it had generated a certain amount of output?..> I<... So I am basically looking for anyone who has had any success in achieving this kind of "streaming" compression, who could direct me at an appropriate Apache module."> Unfortunately, the C has not yet been publicly available at that time... Since relesed this handler is the most useful when you need to compress the outgoing Web content, which is dynamically generated on the fly (using the templates, DB data, XML, etc.), and when at the moment of the request it is impossible to determine the length of the document you have to transmit. You may benefit additionally from the fact that the handler begins the transmission of the compressed data when the very first portion of outgoing data is arrived from the main data source only, at the moment when probably the source big HTML document has not been generated in full yet. So far, the transmission will be done partly at the same time of the document creation. From other side, the internal buffer within the handler prevents the Apache from the creation of too short chunks. =head1 DESCRIPTION The main pupose of this package is to serve the Content Generation Phase within the mod_perl enabled C, providing the dynamic on the fly compression of web content. It is done with the use of C library via the C perl interface to serve the requests from those browsers, who understands C format and can decompress this type of data on the fly. In fact, this handler mainly serves as a kind of customizable filter of the HTML content for C. It is supposed to be used in the C chain mostly to serve the outgoing content dynamically generated on the fly by perl and/or Java. It is featured to serve the regular CGI binaries (C-written for examle) as a standing along handler out of the C chain. As an extra option, this handler can be used to compress dynamically the huge static files, and to transfer the gzipped content in the form of chunked stream back to the client browser. For the last purpose the C handler should be used as a standing along handler out of the C chain too. In order to serve better the older web clients (and known bugs within the modern ones) the "extra light" compression is provided independantly to remove leading blank spaces and/or blank lines from the outgoing web content. This "extra light" compression could be combined with the main C compression, when necessary. The list of the features of this approach includes: · Control over the size of content chunks generated and compressed on the fly. · Support for any Perl, Java, or C/C++ CGI application to provide on-the-fly dynamic compression of outbound content. · Optional control over the duration of the content's life in client's local cache. · Controllable "extra light" compression for all browsers, including older ones that cannot decompress gzipped content. · Optional support for server-side caching of the dynamically generated content. =head2 Chunking Features This handler overwrites the default Apache behavior, and keeps the own control over the chunk-size when it is possible. In fact, the handler provides the soft control over the chunk-size only: It does never cut the incoming string in order to create a chunk of a particular size. In case of gzipped output the minimum size of the chunk is under the control of internal variable minChunkSize In case of uncompressed output, or the "extra light" compression only, the minimum size of the chunk is under the control of internal variable minChunkSizePP In this version for your convenience the handler provides defaults: minChunkSize = 8 minChunkSizePP = 8192 You may overwrite the default values of these variables in your C if necessary. Note: The internal variable minChunkSize should be treated carefully together with the minChunkSizeSource (see Compression Features). This handler does not keep the control over the chunk-size when it serves the internally redirected request. An appropriate warning is placed to C in this case. =head2 Compression Features There are two types of compression, which could be applied to the outgoing content by this handler: - "extra light" compression - gzip compression in any appropriate combination. An C is provided to remove leading blank spaces and/or blank lines from the outgoing web content. The implementation of C is Off by default. It could be turned On with the statement PerlSetVar LightCompression On in your C. Any other value turns the C Off. A C format is described in rfc1951 and rfc1952. This type of compression is applied when the client is recognized as being able to decompress C format on the fly. In this version the decision is under the control of whether the client sends the C HTTP header, or not. (Please, let me know if you have better idea about that...) Usually, when the C compression is in effect, handler keeps the control over the size of the chunks and over the compression ratio using two internal variables which could be set in your C: minChunkSizeSource minChunkSize The C defines the minimum length of the source stream which C may accumulate in its internal buffer. Note: The compression ratio depends on the length of the data, accumulated in that buffer; More data we keep over there - better ratio will be achieved... When the length defined by the C is exceeded, the handler flushes the internal buffer of C and transfers the accumulated portion of the compreesed data to the own internal buffer in order to create appropriate chunk(s). This buffer is not nessessarily be fransfered to Appache immediately. The decision is under the control of the C internal variable. When the size of the buffer exceeds the value of C the handler chunks the internal buffer and transfers the accumulated data to the Client. This approach helps to create the effective compression combined with the limited latency. For example, when I use PerlSetVar minChunkSizeSource 16000 PerlSetVar minChunkSize 8 in my C to compress the dynamically generated content of the size of some 54,000 bytes, the client side log C05 --> S06 GET /pipe/pp-pipe.pl/big.html?try=chunkOneMoreTime HTTP/1.1 C05 --> S06 Accept: */* C05 --> S06 Accept-Language: en-us C05 --> S06 Accept-Encoding: gzip, deflate C05 --> S06 User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows 98) C05 --> S06 Host: devl4.outlook.net C05 --> S06 Accept-Charset: ISO-8859-1 == Body was 0 bytes == ## Sockets 6 of 4,5,6 need checking ## C05 <-- S06 HTTP/1.1 200 OK C05 <-- S06 Date: Thu, 21 Feb 2002 20:01:47 GMT C05 <-- S06 Server: Apache/1.3.22 (Unix) Debian GNU/Linux mod_perl/1.26 C05 <-- S06 Transfer-Encoding: chunked C05 <-- S06 Vary: Accept-Encoding C05 <-- S06 Content-Type: text/html; charset=iso-8859-1 C05 <-- S06 Content-Encoding: gzip C05 <-- S06 == Incoming Body was 6034 bytes == == Transmission: text gzip chunked == == Chunk Log == a (hex) = 10 (dec) 949 (hex) = 2377 (dec) 5e6 (hex) = 1510 (dec) 5c5 (hex) = 1477 (dec) 26e (hex) = 622 (dec) 0 (hex) = 0 (dec) == Latency = 0.990 seconds, Extra Delay = 0.110 seconds == Restored Body was 54655 bytes == shows that the first chunk consists of the gzip header only (10 bytes). This chunk was sent as soon as the handler received the first portion of the data generated by the foreign CGI script. The data itself at that moment has been storied in the zlib's internal buffer, because the C is big enough. Note: Longer we allow zlib to keep its internal buffer - better compression ratio it makes for us... So far, in this example we have obtained the compression ratio at about 9 times. In this version the handler provides defaults: minChunkSizeSource = 32768 minChunkSize = 8 for your convenience. =head2 Filter Chain Features As a member of the C chain, the C handler is supposed to be the last filter in the chain, because of the features of it's functions: It produces the full set of required HTTP headers followed by the gzipped content within the chunked stream. No one of other handlers in C chain is allowed to issue $r->send_http_header(); or $r->send_cgi_header(); The only acceptable HTTP information from the old CGI applications is the C CGI header which should be the first line followed by the empty line. It is optional in accordance with the C description, and many known old scripts ignore this option, which should default to C. C (see: http://cgi-spec.golux.com/draft-coar-cgi-v11-03-clean.html ) makes the life even more complicated for the system administrators. This handler is partially CGI/1.1 compatible, except the internal redirect option, which is not guaranteed. =head2 POST Request Features I have to serve the POST request option for the rgular CGI binary only, because in this case the handler is standing along to serve the data flow in both directions at the moment when the C is tied into Apache, and could not be exposed to CGI binary transparently. To solve the problem I alter POST with GET internally doing the required incoming data transformations. It could cause a problem, when you have a huge incoming stream from your client (more than 4K bytes). =head2 Control over the Client Cache (see rfc2068): The Expires entity-header field gives the date/time after which the response should be considered stale. A stale cache entry may not normally be returned by a cache (either a proxy cache or an user agent cache) unless it is first validated with the origin server (or with an intermediate cache that has a fresh copy of the entity). The format is an absolute date and time as defined by HTTP-date in section 3.3; it MUST be in rfc1123-date format: C This handler creates the C HTTP header, adding the C to the date-time of the request. The internal variable C has default value pageLifeTime = 300 # sec. which could be overwriten in C. =head2 Support for the Server-Side Cache To support the Server-Side Cache I place the reference to the dynamically generated document to the C when the Server-Side Cache Support is ordered. The referenced document could be already compressed with C, if it was ordered for the current request. The effective C compression is supposed to take place within the C stage of the request processing. From the historical point of view, the development of this handler was a stage of a wider project, named C, which is supposed to provide the content caching capabilities to the wide range of arbitrary sites, being generated on the fly for some reasons. In that project the C handler is used in the dynamically generated chain of Apache handlers for various phases of the request processing to filter the Content Generation Phase of the appropriate request. To be compatible with the C flow chart, the C handler recognizes the optional reference in the C, named C. When the C is defined within the C table, the C handler creates one more reference named C within the C to reference the full body of uncompressed incoming document for the Post Request Processing Phase. You usually should not care about this feature of the C handler unless you use it in your own chain of handlers for the various phases of the request processing. =head1 INSTALLATION The installation consists of the two steps: - Installation to your Perl Library - Installation to your Apache Server =head2 Installation to your Perl Library Use the regular procedure to install this module to your Perl Library. When you have your local copy of the package type the following: perl Makefile.PL make make test make install Note: You should be a root to succeed with the last step... To install the package from the CPAN try to run perl -CPAN -e "install Apache::Dynagzip" on your UNIX machine. =head2 Installation to your Apache Server Edit your C using recomendations and examples from the POD of the main module. =head1 CUSTOMIZATION Do your best to avoid the implementation of this handler in internally redirected requests. It does not help much in this case. Read your C carefully to find the appropriate warnings. Tune your C carefully to take the most from opportunities offered by this handler. To select the type of the content's source follow the rules: - use Apache::Filter Chain to serve any Perl, or Java generated content. When your source is a very old CGI-application, which fails to provide the Content-Type CGI header, use PerlSetVar UseCGIHeadersFromScript Off in your httpd.conf to overwrite the Document Content-Type to default text/html. - you may use Apache::Filter Chain to serve another sources, when you know what you are doing. You might wish to write your own handler and include it into Apache::Filter Chain, emulating the CGI outgoing stream. - use the directive PerlSetVar BinaryCGI On to indicate that the source-generator is supposed to be a CGI binary. Don't use Apache::Filter Chain in this case. Support for CGI/1.1 headers is always On for this type of the source. - it will be assumed the plain file transfer, when you use the standing-along handler with no BinaryCGI directive. The Document Content-Type is determined by Apache in this case. To control the compression ratio and the minimum size of the chunk for gzipped content you can optionally use directives PerlSetVar minChunkSizeSource PerlSetVar minChunkSize for example you can try PerlSetVar minChunkSizeSource 32768 PerlSetVar minChunkSize 8 which are the default in this version. Indeed, you can use your own values, when you know what you are doing... Note: You can improve the compression ratio when you increase the value of minChunkSizeSource. You can control the _minimum_ size of the chunk with the minChunkSize. Try to play with these values to find out your best combination! To control the minimum size of the chunk for uncompressed content you can optionally use the directive PerlSetVar minChunkSizePP To control the C you can optionally use the directive PerlSetVar LightCompression To turn On the C you can use the directive PerlSetVar LightCompression On Any other value turns the C Off (default). To control the C in client's local cache you can optionally use the directive PerlSetVar pageLifeTime where the value stands for the life-length in seconds. PerlSetVar pageLifeTime 300 is default in this version. =head1 DEPENDENCIES This module requires these other modules and libraries: Apache::Constants; Apache::File; Apache::Filter 1.019; Apache::Log; Apache::URI; Apache::Util; Fcntl; FileHandle; Compress::Zlib 1.16; Note: the Compress::Zlib 1.16 requires the Info-zip zlib 1.0.2 or better (it is NOT compatible with versions of zlib <= 1.0.1). The zlib compression library is available at http://www.gzip.org/zlib/ I didn't test this handler with previous versions of the Apache::Filter. Please, let me know if you have a chance to do that... =head1 AUTHOR Slava Bizyayev Eslava@cpan.orgE - Freelance Software Developer & Consultant. =head1 COPYRIGHT AND LICENSE Copyright (C) 2002 Slava Bizyayev. All rights reserved. This package is free software. You can use it, redistribute it, and/or modify it under the same terms as Perl itself. The latest version of this module can be found on CPAN.