=pod
=head1 NAME
Spreadsheet::XLSX::Reader::LibXML - Read xlsx spreadsheet files with LibXML
=begin html
=end html
=head1 SYNOPSIS
The following uses the 'TestBook.xlsx' file found in the t/test_files/ folder of the package
#!/usr/bin/env perl
use strict;
use warnings;
use Spreadsheet::XLSX::Reader::LibXML;
my $parser = Spreadsheet::XLSX::Reader::LibXML->new();
my $workbook = $parser->parse( 'TestBook.xlsx' );
if ( !defined $workbook ) {
die $parser->error(), "\n";
}
for my $worksheet ( $workbook->worksheets() ) {
my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();
for my $row ( $row_min .. $row_max ) {
for my $col ( $col_min .. $col_max ) {
my $cell = $worksheet->get_cell( $row, $col );
next unless $cell;
print "Row, Col = ($row, $col)\n";
print "Value = ", $cell->value(), "\n";
print "Unformatted = ", $cell->unformatted(), "\n";
print "\n";
}
}
last;# In order not to read all sheets
}
###########################
# SYNOPSIS Screen Output
# 01: Row, Col = (0, 0)
# 02: Value = Category
# 03: Unformatted = Category
# 04:
# 05: Row, Col = (0, 1)
# 06: Value = Total
# 07: Unformatted = Total
# 08:
# 09: Row, Col = (0, 2)
# 10: Value = Date
# 11: Unformatted = Date
# 12:
# 13: Row, Col = (1, 0)
# 14: Value = Red
# 16: Unformatted = Red
# 17:
# 18: Row, Col = (1, 1)
# 19: Value = 5
# 20: Unformatted = 5
# 21:
# 22: Row, Col = (1, 2)
# 23: Value = 2017-2-14 #(shows as 2/14/2017 in the sheet)
# 24: Unformatted = 41318
# 25:
# More intermediate rows ...
# 82:
# 83: Row, Col = (6, 2)
# 84: Value = 2016-2-6 #(shows as 2/6/2016 in the sheet)
# 85: Unformatted = 40944
###########################
=head1 DEPRECATION
I. The L with L wound up being insurmountable.
I can only infer for all cases but release v0.38.22 will not build on one of my windows machines
and on an AWS server used at $work. Netbsd doesn't have a single build since release. I still
can't distil the failure mode down to a case that I can submit but the new package with L
is running where this one didn't. I opened the package in PAUSE to co-maintainership and I would
not be concerned if someone wanted to take this over but for now this is an orphan package.
I plan on pulling it from CPAN altogether after a while. The new package does run slower for
large files since it uses are pure perl parser. On the other hand a fair amount of the TODO
and bugs for this package got fixed in the new release. I invite you to try it if you liked
this package.>
=head1 DESCRIPTION
This is an object oriented just in time Excel spreadsheet reader package that should
parse all excel files with the extentions .xlsx, .xlsm, .xml I (SpreadsheetML)> that
can be opened in Excel 2007+ applications. The quick-start example provided in the
SYNOPSIS attempts to follow the example from L (.xls binary
file reader) as close as possible. There are additional methods and other approaches
that can be used by this package for spreadsheet reading but the basic access to data
from newer xml based Excel files can be as simple as above.
The intent is to fully document all public functions but you may need to go to sub
modules to find more detailed documentation. This package operates on the Excel file
with three primary tiers of classes. Each level provides object methods to access the
next level down.
=over
Workbook level (This doc)
* General attribute settings that affect parsing of the file in general
* The place to L
* Object methods to retreive document level metadata
* Object methods to return specific Worksheet instances for data retrieval
* The place to L
=over
L
* Object methods to return specific cell instances/L
* Access to some worksheet level format information (more access pending)
* The place to L
data output formats targeting specific cell ranges
=over
L
* Access to the cell contents
* Access to the cell formats (more access pending)
=back
=back
=back
There are some differences from the L package. For instance
in the L the '$parser' and the '$workbook' are actually the same
class for this package. You could therefore combine both steps by calling ->new with
the 'file' attribute called out. The test for load success would then rely on the
method L. Afterward it is still possible to call ->error
on the instance. Another difference is the data formatter and specifically date
handling. This package allows for a simple pluggable custom output format that is
very flexible as well as handling dates older than 1-January-1900. I leveraged
coercions from L to do this but anything that follows
that general format will work here. Additionally, this is a L based package
and possible I also use interfaces for each of the sub-roles/classes used in parsing.
This should alow you to change only the part you want to perform differently if you
have the desire to tinker with the guts. Read the full documentation for all
opportunities!
In the realm of extensibility this package uses L which has multiple ways
to read an XML file but this release only has an L parser option.
Future iterations could include a DOM parser option but that is a very low priority.
Currently this package does not provide the same access to the visual format elements
provided in L. That is on the longish and incomplete TODO list.
To skip the why and nitty gritty of design and jump to implementation details go to the
L section.
=head2 Architecture Choices
This is yet another package for parsing Excel xml or 2007+ workbooks. The goals of this
package are five fold. First, as close as possible produce the same output as is visible
in an excel spreadsheet with exposure to underlying settings from Excel. Second, adhere
as close as is reasonable to the L API (where it doesn't conflict
with the first objective) so that less work would be needed to integrate ParseExcel and
this package. An addendum to the second goal is this package will not expose elements of
the object hash for use by the consuming program. This package will either return an
unblessed hash with the equivalent elements to the Spreadsheet::ParseExcel output (instead
of a class instance) or it will provide methods to provide these sets of data. The third
goal is to provide an XLSX sheet parser that is built on L. The other two
primary options for XLSX parsing on CPAN use either a one-off XML parser (L)
or L (L). In general if either of them already work for
you without issue then there is no reason to switch to this package. Fourth, excel files
get abused in the wild. They get abused by humans and they get abused by scripts. In
general the Excel application handles this mangling gracefully. The goal here is to be able
to read any xml based spreadsheet Excel can read. Please L where this is not true
to my github repo so I can work to improve this package. If you don't want your test case
included with the distribution I will use it to improve the package without publishing it.
Fifth (and finally), the design of this package is targeted at handling as large of an Excel
file as possible. In general this means that design decisions will generally sacrifice speed
to keep RAM consumption low. Specifically this spreadsheet parser does not read the file into
memory completely when it is opened. Since the data in the sheet is parsed just in time the
information that is not contained in the primary meta-data headers will not be available for
review L.
In cases where the parser has made choices that prioritize speed over RAM savings there
will generally be an L.
All in all this package solves many of the issues I found parsing Excel in the wild. I
hope it solves some of yours as well.
=head2 Warnings
B<1.>This package uses L. Not all versions of Archive::Zip work for everyone.
I have tested this with Archive::Zip 1.30. Please let me know if this does not work with a
sucessfully installed (read passed the full test suit) version of Archive::Zip newer than that.
B<2.> Not all workbook sheets (tabs) are created equal! Some Excel sheet tabs are only a
chart. These tabs are 'chartsheets'. The methods with 'worksheet' in the name only act on
the sub set of tabs that are worksheets. Future methods with 'chartsheet' in the name will
focus on the subset of sheets that are chartsheets. Methods with just 'sheet' in the name
have the potential to act on both. The documentation for the chartsheet level class is found
in L (still under construction). All chartsheet
classes do not provide access to cells.
B<3.> This package supports reading xlsm files (Macro enabled Excel 2007+ workbooks).
xlsm files allow for binaries to be embedded that may contain malicious code. However, other
than unzipping the excel file no work is done by this package with the sub-file 'vbaProject.bin'
containing the binaries. This package does not provide an API to that sub-file and I have no
intention of doing so. Therefore my research indicates there should be no risk of virus activation
while parsing even an infected xlsm file with this package but I encourage you to use your own
judgement in this area. B>
B<4.> This package will read some files with 'broken' xml. In general this should be
transparent but in the case of the maximum row value and the maximum column value for a
worksheet it can cause some surprising problems. This includes the possibility that the maximum
values are initially stored as 'undef' if the sheet does not provide them in the metadata as
expected. The answer to the methods L and
L will then change as more of the sheet is
parsed. The parser improves these values as information is available based on the dimensional
scope of the users cell parsing. These values are generally never available in Excel 2003 xml files.
The primary cause of these broken XML elements in Excel 2007+ files are non-XML applications writing
to the excel spreadsheet. You can use the attribute L or
the methods L or
L as alternates for pre-testing
for boundaries when iterating.
B<5.> Version v0.40.2 changes the way file caching is turned on and off. It also changes the
way it is set when starting an instance of this package. If you did not turn off caching
explicitly before this release there should no be a problem with this change. The goal is to
automatically differentiate large files and small files and L
in a targeted manner in response to larger file sizes. This should allow larger spreadsheets that
may have exceeded available RAM to run (slowly) when they didn't run at all before without forcing
small sheets to run too much slower. However, if you do have caching turned off in your code using
the old Boolean setting this package will now see it, fix it upon load, and emit a warning. I will
still be tweaking this setting over the next few releases. This warning will stay till 3/1/2017
and then the old callout will no longer be supported.
B<6.> Version v0.40.2 introduces the L attribute and will start the deprication of the
L and L attributes as well as the following methods:
L, L, L,
and L. This change is intended to remove an overly complex
set of dependancies that was causing trouble for garbage collection on cleanup. Please use
the L attribute and the L methods as replacements moving
forward. Support for backwards compatible use of the old attributes and methods will be removed
after 3/1/2017.
B<7.> Version v0.40.2 introduces support for L
(Excel 2003) .xml extention documents. These documents should include
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
somewhere in the header to indicate their intended format. This change does introduce a lot of
behind the scenes re-plumbing but the top level tests all stayed the same. This means that for
.xlsx and .xlsm extentions there should not be any obvious changes or (hopefully) significant
new bugs. I. However, to get this release out and rolling I don't have a
full set of tests for the .xml extention paths and Microsofts documentation for that format is
spotty in some relevant areas (I still don't know what I don't know) so please L any cases that appear to behave
differently than expected for .xml extention files that are readable by the Excel application.
I am also interested in cases where an out of memory error occurs with an .xml extension file. This
warning will stay till 3/1/2017.
=head2 Attributes
Data passed to new when creating an instance. For modification of these attributes see the
listed 'attribute methods'. For general information on attributes see
L. For ways to manage the workbook when opened see the
L. For additional lesser used workbook options
see L.
B
$workbook_instance = Spreadsheet::XLSX::Reader::LibXML->new( %attributes )
I$file_handle, $formatter )> method before the rest of the package
can be used.>
=head3 file_name
=over
B This attribute holds the full file name and path for the xlsx|xlsm file to be
parsed.
B no default - either this or a L must be provided to
read a file
B any unencrypted xlsx|xlsm file that can be opened in Microsoft Excel.
B Methods provided to adjust this attribute
=over
B
=over
B change the file name value in the attribute (this will reboot
the workbook instance)
=back
B
=over
B this is used to see if the workbook loaded correctly using the
file_name option to open an Excel .xlsx file.
=back
=back
=back
=head3 file_handle
=over
B This attribute holds a copy of the passed file handle reference.
B no default - either this or a L must be provided to read
a file
B any unencrypted xlsx file handle that can be opened in Microsoft Excel
B Methods provided to adjust this attribute
=over
B
=over
B change the set file handle (this will reboot the workbook instance)
=back
B
=over
B this is used to see if the workbook loaded correctly when using the
file_handle option to open an Excel .xlsx file.
=back
=back
=back
=head3 error_inst
=over
B This attribute holds an 'error' object instance. It should have several
methods for managing errors. Currently no error codes or error language translation
options are available but this should make implementation of that easier.
B a L instance with the attributes set
as;
( should_warn => 0 )
B The minimum list of methods to implement for your own instance is;
error set_error clear_error set_warnings if_warn
The error instance must be able to extract the error string from a passed error
object as well. For now the current implementation will attempt ->as_string first
and then ->message if an object is passed.
B Methods provided to adjust this attribute
=over
B
=over
B returns this instance
=back
B
=over
B delegated method from the class used to get the most recently
logged error string
=back
B
=over
B delegated method from the class used to set a new error string
(or pass an error object for extraction of the error string)
=back
B
=over
B delegated method from the class used to clear the current error
string
=back
B
=over
B delegated method from the class used to turn on or off real time
warnings when errors are set
=back
B
=over
B delegated method from the class used to extend this package and
see if warnings should be emitted.
=back
B
=over
B delegated method from the class used to turn on or off the L
'longmess'for error messages
=back
B
=over
B delegated method from the class used to understand the current state
the longmess concatenation for error messages
=back
=back
=back
=head3 sheet_parser
=over
B This sets the way the .xlsx file is parsed. For now the only
choice is 'reader'.
B 'reader'
B 'reader'
B Methods provided to adjust this attribute
=over
B
=over
B the way to change the parser type
=back
B
=over
B returns the currently set parser type
=back
=back
=back
=head3 count_from_zero
=over
B Excel spreadsheets count from 1. L
counts from zero. This allows you to choose either way.
B 1
B 1 = counting from zero like Spreadsheet::ParseExcel,
0 = Counting from 1 like Excel
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
B
=over
B a way to change the current attribute setting
=back
=back
=back
=head3 file_boundary_flags
=over
B When you request data to the right of the last column or below
the last row of the data this package can return 'EOR' or 'EOF' to indicate that
state. This is especially helpful in 'while' loops. The other option is to
return 'undef'. This is problematic if some cells in your table are empty which
also returns undef. What is determined to be the last column and row is determined
by the attribute L.
B 1
B 1 = return 'EOR' or 'EOF' flags as appropriate, 0 = return undef when
requesting a position that is out of bounds
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
B
=over
B a way to change the current attribute setting
=back
=back
=back
=head3 empty_is_end
=over
B The excel convention is to read the table left to right and top
to bottom. Some tables have an uneven number of columns with real data from row
to row. This allows the several methods that excersize a 'next' function to wrap
after the last element with data rather than going to the max column. This also
triggers 'EOR' flags after the last data element and before the sheet max column
when not implementing 'next' functionality.
B 0
B 0 = treat all columns short of the max column for the sheet as being in
the table, 1 = treat all cells after the last cell with data as past the end of
the row. This will be most visible when
L or next functionality is
used in the context of the L attribute is on.
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
B
=over
B a way to set the current attribute setting
=back
=back
=back
=head3 values_only
=over
B Excel will store information about a cell even if it only contains
formatting data. In many cases you only want to see cells that actually have
values. This attribute will change the package behaviour regarding cells that have
formatting stored against that cell but no actual value.
B 0
B 1 = skip cells with formatting only and treat them as completely empty,
0 = return informat about cells that only contain formatting
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
B
=over
B a way to set the current attribute setting
=back
=back
=back
=head3 from_the_edge
=over
B Some data tables start in the top left corner. Others do not. I
don't reccomend that practice but when aquiring data in the wild it is often good
to adapt. This attribute sets whether the file reads from the top left edge or from
the top row with data and starting from the leftmost column with data.
B 1
B 1 = treat the top left corner of the sheet as the beginning of rows and
columns even if there is no data in the top row or leftmost column, 0 = Set the
minimum row and minimum columns to be the first row and first column with data
B Methods provided to adjust this attribute
=over
B
=over
B a way to set the current attribute setting
=back
=back
=back
=head3 cache_positions
=over
B This parse can be slow. It does this by trading processing and
file storage for RAM usage but that is probably not the average users choice.
Currently four of the files can implement selective caching. The setting for
this attribute takes a hash ref with the file indicators as keys and the max
file size in bytes as the value. When the sub file handle exceeds that size
then caching for that subfile is turned off. The default setting shows an
example with the four available cached size.
B This behaviour changed with v0.40.2. Prior to that this setting
accepted a boolean value that turned all caching on or off universally. If
a boolean value is passed a deprication warning will be issued and the input
will be changed to this format. 'On' will be converted to the default caching
levels. A boolean 'Off' is passed then the package will set all maximum caching
levels to 0.
B
{
sharedStrings => 5242880,# 5 MB
styles => 5242880,# 5 MB
worksheet_interface => 5242880,# 5 MB
chartsheet_interface => 5242880,# 5 MB
}
B Methods provided to adjust this attribute
=over
B
=over
B read the attribute
=back
B
=over
B return the max file size allowed to cache for the indicated $target_file
=back
B $max_file_size )>
=over
B set the $max_file_size to be cached for the indicated $target_file
=back
B
=over
B returns true if one of the four allowed files is passed at $target_file
=back
=back
=back
=head3 formatter_inst
=over
B This is the attribute containing the formatter class. In general the
default value is sufficient. However, If you want to tweak this a bit then review the
L. It does include
a role that interprets the excel L
into a L coercion.
B An instance build from MooseX::ShortCut::BuildInstance with the following
arguments
{
superclasses => ['Spreadsheet::XLSX::Reader::LibXML::FmtDefault'],
add_roles_in_sequence =>[qw(
Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings
Spreadsheet::XLSX::Reader::LibXML::FormatInterface
)],
package => 'FormatInstance',
}
B Methods provided to adjust this attribute
=over
B
=over
B a way to set the current attribute instance
=back
B
=over
B a way to get the current attribute setting
=back
=back
B
delegated_to => link_delegated_from
=over
get_formatter_region => L
get_target_encoding => L
set_target_encoding => L
has_target_encoding => L
change_output_encoding => L
set_defined_excel_formats => L
get_defined_conversion => L
parse_excel_format_string => L
set_date_behavior => L
set_european_first => L
set_formatter_cache_behavior => L
=back
=back
=head3 group_return_type
=over
B Traditionally ParseExcel returns a cell object with lots of methods
to reveal information about the cell. In reality the extra information is not used very
much (witness the popularity of L). Because many users don't need or
want the extra cell formatting information it is possible to get either the raw xml value,
the raw visible cell value (seen in the Excel format bar), or the formatted cell value
returned either the way the Excel file specified or the way you specify instead of a Cell
instance with all the data. . See
L to insert custom targeted
formats for use with the parser. All empty cells return undef no matter what.
B instance
B instance = returns a populated L instance,
unformatted = returns just the raw visible value of the cell shown in the Excel formula bar,
value = returns just the formatted value stored in the excel cell, xml_value = the raw value
for the cell as stored in the sub-xml files
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
B
=over
B a way to set the current attribute setting
=back
=back
=back
=head3 empty_return_type
=over
B Traditionally L returns an empty string for cells
with unique formatting but no stored value. It may be that the more accurate way of returning
undef works better for you. This will turn that behaviour on. I
B empty_string
B
empty_string = populates the unformatted value with '' even if it is set to undef
undef_string = if excel stores undef for an unformatted value it will return undef
B Methods provided to adjust this attribute
=over
B
=over
B a way to check the current attribute setting
=back
B
=over
B a way to set the current attribute setting
=back
=back
=back
=head2 Primary Methods
These are the primary ways to use this class. They can be used to open an .xlsx workbook.
They are also ways to investigate information at the workbook level. For information on
how to retrieve data from the worksheets see the
L and
L documentation. For additional workbook
options see the L
and the L sections. The attributes section specifically contains
all the methods used to adjust the attributes of this class.
All methods are object methods and should be implemented on the object instance.
B
my @worksheet_array = $workbook_instance->worksheets;
=head3 parse( $file_name|$file_handle, $formatter )
=over
B This is a convenience method to match L.
It only works if the L or L attribute was not
set with ->new. It is one way to set the 'file_name' or 'file_handle' attribute [and the
L attribute]. I
B
$file = a valid xlsx file [or a valid xlsx file handle] (required)
[$formatter] = see the default_format_list attribute for valid options (optional)
B itself when passing with the xlsx file loaded to the workbook level or
undef for failure.
=back
=head3 worksheets
=over
B This method will return an array (I)
containing a list of references to all worksheets in the workbook. This is not
a reccomended method. It is provided for compatibility to Spreadsheet::ParseExcel.
For alternatives see the L method and the
L methods. B
B nothing
B an array ref of L
objects for all worksheets in the workbook.
=back
=head3 worksheet( $name )
=over
B This method will return an object to read values in the worksheet.
If no value is passed to $name then the 'next' worksheet in physical order is
returned. I<'next' will NOT wrap> It also only iterates through the 'worksheets'
in the workbook (but not the 'chartsheets').
B the $name string representing the name of the worksheet object you
want to open. This name is the word visible on the tab when opening the spreadsheet
in Excel. (not the underlying zip member file name - which can be different. It will
not accept chart tab names.)
B a L object with the
ability to read the worksheet of that name. It returns undef and sets the error attribute
if a 'chartsheet' is requested. Or in 'next' mode it returns undef if past the last sheet.
B using the implied 'next' worksheet;
while( my $worksheet = $workbook->worksheet ){
print "Reading: " . $worksheet->name . "\n";
# get the data needed from this worksheet
}
=back
=head3 in_the_list
=over
B This is a predicate method that indicates if the 'next'
L function has been implemented at least once.
Bnothing
B true = 1, false = 0
once
=back
=head3 start_at_the_beginning
=over
B This restarts the 'next' worksheet at the first worksheet. This
method is only useful in the context of the L
function.
B nothing
B nothing
=back
=head3 worksheet_count
=over
B This method returns the count of worksheets (excluding charts) in
the workbook.
Bnothing
B an integer
=back
=head3 get_worksheet_names
=over
B This method returns an array ref of all the worksheet names in the
workbook. (It excludes chartsheets.)
B nothing
B an array ref
B Another way to parse a workbook without building all the sheets at
once is;
for $sheet_name ( @{$workbook->worksheet_names} ){
my $worksheet = $workbook->worksheet( $sheet_name );
# Read the worksheet here
}
=back
=head3 get_sheet_names
=over
B This method returns an array ref of all the sheet names (tabs) in the
workbook. (It includes chartsheets.)
B nothing
B an array ref
=back
=head3 get_chartheet_names
=over
B This method returns an array ref of all the chartsheet names in the
workbook. (It excludes worksheets.)
B nothing
B an array ref
=back
=head3 sheet_name( $Int )
=over
B This method returns the sheet name for a given physical position
in the workbook from left to right. It counts from zero even if the workbook is in
'count_from_one' mode. B(It will return chart names but chart tab names cannot currently
be converted to worksheets). You may actually want L
instead of this function.
B integers
B the sheet name (both workbook and worksheet)
B To return only worksheet positions 2 through 4
for $x (2..4){
my $worksheet = $workbook->worksheet( $workbook->worksheet_name( $x ) );
# Read the worksheet here
}
=back
=head3 sheet_count
=over
B This method returns the count of all sheets in the workbook (worksheets
and chartsheets).
B nothing
B a count of all sheets
=back
=head3 worksheet_name( $Int )
=over
B | |