dtd.pl
is a
Perl
library that parses an
SGML
document type defintion
(DTD) and creates Perl data structures containing the content
of the DTD.
I assume the reader knows about the scope of packages and how to
access variables/subroutines defined in packages. If not, refer to
perl
(1) or any book on Perl.
The reader should also have a working knowledge of SGML.
Unless stated, or implied, otherwise, all variables mentioned are
within the scope of package dtd
.
Once installed, the following statement can be used to
access the dtd
routines:
require "dtd.pl";
All the public routines available are defined within the scope
of package main
. Hence, if you require dtd.pl
in a package other than main
, you must use package
qualification when calling a routine.
Example:
&main'DTDread_dtd(DTD);
or,
&'DTDread_dtd(DTD);
The following routines are available in dtd.pl
:
The following routines are only applicable after
DTDread_dtd
has been called.
The following routines deal with the parsing of an SGML DTD.
&'DTDread_dtd(FILEHANDLE);
DTDread_dtd
parses the SGML DTD specified by
FILEHANDLE.
DTDread_dtd
. Otherwise, FILEHANDLE will
be interpreted under the scope of package dtd
.
Parsing of
the DTD stops once the end of the file is reached. Any external entity
references will be parsed if an entity to filename mapping exists (see
DTDread_mapfile
).
DTDread_dtd
makes the following assumptions when parsing a DTD:
The reference concrete syntax is assumed. However, various
variables in dtd.pl
can be redefined to try to accomodate an
alternate syntax. There are some dependencies in the parser on how
certain delimiters are defined. See the Perl source for more
information.
The SGML DTD is syntactically correct. This libary is not intended
as a validator. Use
sgmls
,
or other SGML validator, for such
purposes.
The SGML declaration statement is ignored if it exists.
Tag and entity names can only contain the characters "A-Za-z_.-".
However, this can be changed by setting the variable
$namechars
.
There is no size limit on name length.
Tag names are treated with case-insensitivity, but entity names are case-sensitive. Tag names are converted and stored in lowercase.
Multiple contiguous whitespaces are ignored in entity identifiers. I.e. Multiple contiguous whitespaces are treated as one whitespace character.
After
DTDread_dtd
is finished, the following variables are
filled (Note: all the variables are within the scope of package dtd):
@ParEntities
@GenEntities
@Elements
%ParEntity
%PubParEntity
%SysParEntity
%GenEntity
%StartTagEntity
%EndTagEntity
%MSEntity
%MDEntity
%PIEntity
%CDataEntity
%SDataEntity
%ElemCont
%ElemInc
%ElemExc
%ElemTag
%Attribute
%Attribute
,
it is best to use
DTDget_elem_attr
.
%PubNotation
%SysNotation
All entities are expanded when data is stored in
%ElemCont
,
%ElemInc
,
%ElemInc
,
%ElemExc
,
%ElemTag
,
%Attribute
arrays.
To avoid maintenance problems with programs directly accessing
the variables set by DTDread_dtd
, dtd.pl
defines
routines
to access the data contained in the variables.
If you use dtd.pl
, try to use the
data access routines
when at all possible.
External PUBLIC and SYSTEM general and data entities are ignored.
<!DOCTYPE is recognized, but external reference to file not implemented.
Concurrent DTDs are not distinguished and may cause loss of data.
LINKTYPE, SHORTREF, USEMAP declarations are ignored.
DTDread_dtd
's performance is not the best.
DTDread_dtd
makes frequent use of Perl's
getc
function. If SGML did not have such screwing grammer rules, I
could have easily avoided getc
.
DTDread_dtd
is meant to process DTDs in separate files. If a document
instance is in the file
DTDread_dtd
is parsing, behavior is undefined.
&'DTDread_catalog_files(@files);
DTDread_catalog_files
reads all catalog entry files (aka map files)
specified by
@files
and by the
SGML_CATALOG_FILES
envariable.
See
DTDread_mapfile
for more information on catalog entry files.
This envariable is a colon (semi-colon for MSDOS users)
separated list of catalog files to read.
The files listed in
@files
are read first before any files specified by SGML_CATALOG_FILES. If
a file in the list is not an absolute path, then file is searched in
the paths listed in the envariables
P_SGML_PATH
and
SGML_SEARCH_PATH.
&'DTDread_mapfile($filename);
DTDread_mapfile
parses a map file specified
$filename
.
dtd.pl
. However, since version 2.2.0, the "map file"
format has changed to following similiar conventions of SGML catalogs
(as defined in
SGML Open Draft Technical Resolution 9401:1994).
Therefore,
the term "map file" and "catalog" are the same in the context of this
document.
The map file, or catalog, provides you with the capability of mapping public identifiers to system identifiers (files) or to map entity names to system identifiers.
A catalog contains a sequence of the following types of entries:
PUBLIC
public_id system_idThis maps public_id to system_id.
ENTITY
name system_idThis maps a general entity whose name is name to system_id.
ENTITY %
name system_idThis maps a parameter entity whose name is name to system_id.
A system_id string cannot contain any spaces. The system_id is treated as pathname of file.
Any line in a catalog file that does not follow the previously mentioned entries is ignored.
In case of duplicate entries, the first entry defined is used.
Example catalog file:
-- ISO public identifiers -- PUBLIC "ISO 8879-1986//ENTITIES General Technical//EN" iso-tech.ent PUBLIC "ISO 8879-1986//ENTITIES Publishing//EN" iso-pub.ent PUBLIC "ISO 8879-1986//ENTITIES Numeric and Special Graphic//EN" iso-num.ent PUBLIC "ISO 8879-1986//ENTITIES Greek Letters//EN" iso-grk1.ent PUBLIC "ISO 8879-1986//ENTITIES Diacritical Marks//EN" iso-dia.ent PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN" iso-lat1.ent PUBLIC "ISO 8879-1986//ENTITIES Greek Symbols//EN" iso-grk3.ent PUBLIC "ISO 8879-1986//ENTITIES Added Latin 2//EN" ISOlat2 PUBLIC "ISO 8879-1986//ENTITIES Added Math Symbols: Ordinary//EN" ISOamso -- HTML public identifiers and entities -- PUBLIC "-//IETF//DTD HTML//EN" html.dtd PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN//HTML" ISOlat1.ent ENTITY "%html-0" html-0.dtd ENTITY "%html-1" html-1.dtd
dtd.pl
also supports envariables (ie. environment
variables) to aid in resolving external entities. The following
envariables are used by .pl
:
This is a colon (semi-colon for MSDOS users) separated list of paths for finding catalog files or system identifiers. For example, if a system identifier is not an absolute pathname, then the paths listed in P_SGML_PATH are used to find the file.
This is a colon (semi-colon for MSDOS users) separated list of paths for finding catalog files or system identifiers. This envariable serves the same function as P_SGML_PATH. If both are defined, paths listed in P_SGML_PATH are searched first before any paths in SGML_SEARCH_PATH.
The use of P_SGML_PATH is for compatibility with earlier versions
of dtd.pl
.
SGML_CATALOG_FILES
and SGML_SEARCH_PATH
are supported for compatibility with James Clark's nsgmls(1)
.
&'DTDreset();
DTDreset
clears all data associated with the DTD read via
DTDread_dtd
.
This routine is useful if multiple DTDs need to be processed.
&'DTDset_comment_callback($callback);
DTDset_comment_callback
sets the function,
$callback
,
to be called
when a comment declaration is read during
DTDread_dtd
.
$callback
is called as follows:
&$callback(*comment_text);
*comment_text
is a pointer to the string containing all
the text within the SGML comment delaration (excluding the open and close
delimiters).
&'DTDset_pi_callback($callback);
DTDset_pi_callback
sets the function,
$callback
,
to be called when a
processing instruction is read during
DTDread_dtd
.
$callback
is called as follows:
&$callback(*pi_text);
*pi_text
is a pointer to the string containing all the text within the
processing instruction (excluding the open and close delimiters).
&'DTDset_verbosity($value);
DTDset_verbosity
sets the verbosity flag for
DTDread_dtd
.
If
$value
is non-zero, then
DTDread_dtd
outputs status messages as it parses a DTD. This function is
used for debugging purposes.
The following routines access the data
extracted from an SGML DTD via
DTDread_dtd
@elements = &'DTDget_elements(); @elements = &'DTDget_elements($nosortflag);
DTDget_elements
retrieves an array of all elements defined in
the DTD.
An optional flag argument can be passed to the routine to
determine is elements returned are sorted or not: 0 => sorted,
1 => not sorted.
@top_elements = &'DTDget_elements();
DTDget_top_elements
retrieves a sorted array of all top-most elements
defined in the DTD. Top-most elements are those elements that cannot
be contained within another element or can only be contained within
itself.
%attribute = &'DTDget_elem_attr($elem);
DTDget_elem_attr
returns an associative array containing the
attributes of
$elem
.
The keys of the array are the attribute names,
and the array values are
$;
separated strings of the possible values
for the attributes. Example of extracting an attribute's values:
@values = split(/$;/, $attribute{`alignment'});
The first array value of the
$;
splitted array is the default value
for the attribute (which may be an SGML reserved word). If the default
value equals
"#FIXED
",
then the next array value is the
#FIXED
value.
The other array values are all possible values for the attribute.
$;
is assumed to be the default value assigned by Perl: "\034".
If
$;
is changed, unpredictable results may occur.
@parent_elements = &'DTDget_parents($elem);
DTDget_parents
returns an array of all elements that may be a parent
of
$elem
.
@base_children = &'DTDget_base_children($elem, $andcon);
DTDget_base_children
returns an array of the elements in the base
model group of
$elem
.
The
$andcon
is flag if the connector characters
are included in the returned array: 0 => no connectors, 1 (non-zero)
=> connectors.
Example:
<!ELEMENT foo (x | y | z) +(a | b) -(m | n)>
The call
&'DTDget_base_children(`foo')
will return
(`x', `y', `z')
The call
&'DTDget_base_children(`foo', 1)
will return
(`(`,`x', `|', `y', `|', `z', `)')
One may use
DTDis_tag_name
to distinguish
elements from the connectors.
@exc_children = &'DTDget_exc_children($elem, $andcon);
DTDget_exc_children
returns an array of the elements in the exclusion
model group of
$elem
.
The
$andcon
is flag if the connector characters
are included in the returned array: 0 => no connectors, 1 (non-zero)
=> connectors.
Example:
<!ELEMENT foo (x | y | z) +(a | b) -(m | n)>
The call
&'DTDget_exc_children(`foo')
will return
(`m', `n')
@generalents = &'DTDget_gen_ents(); @generalents = &'DTDget_gen_ents($nosort);
DTDget_gen_ents
returns an array of general entities.
An optional flag argument can be passed to the routine to
determine is elements returned are sorted or not: 0 => sorted,
1 => not sorted.
@gendataents = &'DTDget_gen_data_ents();
DTDget_gen_data_ents
returns an array of general data
entities defined in the DTD. Data entities cover the
following: PCDATA, CDATA, SDATA, PI.
@inc_children = &'DTDget_inc_children($elem, $andcon);
DTDget_inc_children
returns an array of the elements in the inclusion
model group of
$elem
.
The
$andcon
is flag if the connector characters
are included in the returned array: 0 => no connectors, 1 (non-zero)
=> connectors.
Example:
<!ELEMENT foo (x | y | z) +(a | b) -(m | n)>
The call
&'DTDget_inc_children(`foo')
will return
(`a', `b')
&'DTDis_element($element);
DTDis_element
returns 1 if
$element
is defined in the DTD. Otherwise,
0 is returned.
The following are general utility routines.
&'DTDis_attr_keyword($word);
DTDis_attr_keyword
returns 1 if
$word
is an attribute content reserved
value, otherwise, it returns 0. In the reference concrete syntax, the
following values of
$word
will return 1:
Character case is ignored.
&'DTDis_elem_keyword($word);
DTDis_elem_keyword
returns 1 if
$word
is an element content reserved
value, otherwise, it returns 0. In the reference concrete syntax, the
following values of
$word
will return 1:
Character case is ignored.
&'DTDis_group_connector($char);
DTDis_group_connector
returns 1 if
$char
is an group connector,
otherwise, it returns 0. The following values of
$char
will return 1:
&'DTDis_occur_indicator($char);
DTDis_occur_indicator
returns 1 if
$char
is an occurence indicator,
otherwise, it returns 0. The following values of
$char
will return 1:
&'DTDis_tag_name($string);
DTDis_tag_name
returns 1 if
$string
is a legal tag name, otherwise, it
returns 0. Legal characters in a tag name are defined by the
$namechars
variable. By default, a tag name may only contain the
characters "A-Za-z_.-".
&'DTDprint_tree($elem, $depth, FILEHANDLE);
DTDprint_tree
prints the content hierarchy of a single element,
$elem
,
to a maximum depth of
$depth
to the file specified by
FILEHANDLE.
If
FILEHANDLE
is not specified then output goes to standard out. A depth of 5
is used if
$depth
is not specified. The root of the tree has a depth
of 1.
The output generated by DTDprint_tree
is as follows:
Elements that exist at a higher (or equal) level, or if the maximum depth has been reached, are pruned. The string "..." is appended to an element if it has been pruned due to pre-existance at a higher (or equal) level. The content of the pruned element can be determined by searching for the complete tree of the element (ie. elements w/o "...").
Here's an example of what the output will look like due to pruning of recursive element contents:
htmlplus | |_body | | | |_address | | | | | |_p ... | | | |_div1 | | | | | |_address ... | | |_div2 ... | | |_div3 ... | | |_div4 ... | | |_div5 ... | | |_div6 ...
Since the tree outputed is static, the inclusion and exclusion sets of elements are treated specially. Inclusion and exclusion elements inherited from ancestors are not propagated down to determine what elements are printed, but special markup is presented at a given element if there exists inclusion and exclusion elements from ancestors. The reason inclusion and exclusion elements are not propagated down is because of the pruning done. An element w/o "..." may be the only place of reference to see the content hierarchy of that element. However, the element may occur in multiple contents and have different ancestoral inclusion and exclusion elements applied to it.
Have I lost you? Maybe an example may help:
OPENBOOK | |_d1 | | (I): idx needbegin needend newline | | | |_abbrev | | | (Ia): idx needbegin needend newline | | | (X): needbegin needend | | | | | |_#PCDATA | | |_acro | | | | (Ia): idx needbegin needend newline | | | | (Xa): needbegin needend | | | | | | | |_#PCDATA | | | |_sub ... | | | |_super ... | | |
Ignoring the lines starting with ()'s, one gets the content hierachy
of an element as defined by the DTD without concern of where it
may occur in the overall structure. The ()'s line give additional
information regarding the element with respect to its existance
within a specific context. For example, when an acro element
occurs within openbook
/d1
/abbrev
,
along with its normal
content, it can contain idx
and newline
elements due to
inclusions from ancestors. However, it cannot contain needbegin
,
needend
regardless of its defined content since an ancestor(s)
excludes them.
needbegin
,
needend
are excluded from acro
.Explanation of ()'s keys:
(I)
(Ia)
(X)
(Xa)
This program is part of the perlSGML package; see <URL:http://www.oac.uci.edu/indiv/ehood/perlSGML.html>