dtd.pl

dtd.pl is a Perl library that parses an SGML document type defintion (DTD) and creates Perl data structures containing the content of the DTD.


Audience

I assume the reader knows about the scope of packages and how to access variables/subroutines defined in packages. If not, refer to perl(1) or any book on Perl. The reader should also have a working knowledge of SGML.

Unless stated, or implied, otherwise, all variables mentioned are within the scope of package dtd.


Usage

Once installed, the following statement can be used to access the dtd routines:

    require "dtd.pl";

All the public routines available are defined within the scope of package main. Hence, if you require dtd.pl in a package other than main, you must use package qualification when calling a routine.

Example:

    &main'DTDread_dtd(DTD);

or,

    &'DTDread_dtd(DTD);

The following routines are available in dtd.pl:

Parsing Routines

The following routines are only applicable after DTDread_dtd has been called.

Data Access Routines

Utility Routines


Parsing Routines

The following routines deal with the parsing of an SGML DTD.


DTDread_dtd

Usage

    &'DTDread_dtd(FILEHANDLE);

Description

DTDread_dtd parses the SGML DTD specified by FILEHANDLE.

Note
Make sure to package qualify FILEHANDLE when calling DTDread_dtd. Otherwise, FILEHANDLE will be interpreted under the scope of package dtd.

Parsing of the DTD stops once the end of the file is reached. Any external entity references will be parsed if an entity to filename mapping exists (see DTDread_mapfile).

DTDread_dtd makes the following assumptions when parsing a DTD:

After DTDread_dtd is finished, the following variables are filled (Note: all the variables are within the scope of package dtd):

@ParEntities
Parameter entities in order processed
@GenEntities
General entities in order processed
@Elements
Elements in order processed
%ParEntity
Keys: Non-external parameter entities.
Values: Replacement value.
%PubParEntity
Keys: External public parameter entities (PUBLIC).
Values: Entity identifier, if defined.
%SysParEntity
Keys: External public parameter entities (SYSTEM).
Values: Entity identifier, if defined.
%GenEntity
Keys: Regular general entities.
Values: Entity value.
%StartTagEntity
Keys: STARTTAG general entities.
Values: Entity value.
%EndTagEntity
Keys: ENDTAG general entities.
Values: Entity value.
%MSEntity
Keys: MS general entities.
Values: Entity value.
%MDEntity
Keys: MD general entities.
Values: Entity value.
%PIEntity
Keys: PI general entities.
Values: Entity value.
%CDataEntity
Keys: CDATA general entities.
Values: Entity value.
%SDataEntity
Keys: SDATA general entities.
Values: Entity value.
%ElemCont
Keys: Element names.
Values: Base content of declaration of elements.
%ElemInc
Keys: Element names.
Values: Inclusion set declarations.
%ElemExc
Keys: Element names.
Values: Exclusion set declarations.
%ElemTag
Keys: Element names.
Values: Omitted tag minimization.
%Attribute
Keys: Element names.
Values: Attributes for elements. To access the data stored in %Attribute, it is best to use DTDget_elem_attr.
%PubNotation
Keys: PUBLIC Notation names.
Values: Notation identifier.
%SysNotation
Keys: SYSTEM Notation names.
Values: Notation identifier.

All entities are expanded when data is stored in %ElemCont, %ElemInc, %ElemInc, %ElemExc, %ElemTag, %Attribute arrays.

To avoid maintenance problems with programs directly accessing the variables set by DTDread_dtd, dtd.pl defines routines to access the data contained in the variables. If you use dtd.pl, try to use the data access routines when at all possible.

Notes


DTDread_catalog_files

Usage

    &'DTDread_catalog_files(@files);

Description

DTDread_catalog_files reads all catalog entry files (aka map files) specified by @files and by the SGML_CATALOG_FILES envariable.

See DTDread_mapfile for more information on catalog entry files.

Environment Variables

SGML_CATALOG_FILES

This envariable is a colon (semi-colon for MSDOS users) separated list of catalog files to read. The files listed in @files are read first before any files specified by SGML_CATALOG_FILES. If a file in the list is not an absolute path, then file is searched in the paths listed in the envariables P_SGML_PATH and SGML_SEARCH_PATH.


DTDread_mapfile

Usage

    &'DTDread_mapfile($filename);

Description

DTDread_mapfile parses a map file specified $filename.

Note
The term "map file" was introduced by the first version of dtd.pl. However, since version 2.2.0, the "map file" format has changed to following similiar conventions of SGML catalogs (as defined in SGML Open Draft Technical Resolution 9401:1994). Therefore, the term "map file" and "catalog" are the same in the context of this document.

The map file, or catalog, provides you with the capability of mapping public identifiers to system identifiers (files) or to map entity names to system identifiers.

Catalog Syntax

A catalog contains a sequence of the following types of entries:

PUBLIC public_id system_id

This maps public_id to system_id.

ENTITY name system_id

This maps a general entity whose name is name to system_id.

ENTITY %name system_id

This maps a parameter entity whose name is name to system_id.

Syntax Notes

Example catalog file:

        -- ISO public identifiers --
PUBLIC "ISO 8879-1986//ENTITIES General Technical//EN"            iso-tech.ent
PUBLIC "ISO 8879-1986//ENTITIES Publishing//EN"                   iso-pub.ent
PUBLIC "ISO 8879-1986//ENTITIES Numeric and Special Graphic//EN"  iso-num.ent
PUBLIC "ISO 8879-1986//ENTITIES Greek Letters//EN"                iso-grk1.ent
PUBLIC "ISO 8879-1986//ENTITIES Diacritical Marks//EN"            iso-dia.ent
PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN"                iso-lat1.ent
PUBLIC "ISO 8879-1986//ENTITIES Greek Symbols//EN"                iso-grk3.ent 
PUBLIC "ISO 8879-1986//ENTITIES Added Latin 2//EN"                ISOlat2
PUBLIC "ISO 8879-1986//ENTITIES Added Math Symbols: Ordinary//EN" ISOamso

        -- HTML public identifiers and entities --
PUBLIC "-//IETF//DTD HTML//EN"                                    html.dtd
PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN//HTML"          ISOlat1.ent
ENTITY "%html-0"                                                  html-0.dtd
ENTITY "%html-1"                                                  html-1.dtd

Environment Variables

dtd.pl also supports envariables (ie. environment variables) to aid in resolving external entities. The following envariables are used by .pl:

P_SGML_PATH

This is a colon (semi-colon for MSDOS users) separated list of paths for finding catalog files or system identifiers. For example, if a system identifier is not an absolute pathname, then the paths listed in P_SGML_PATH are used to find the file.

SGML_SEARCH_PATH

This is a colon (semi-colon for MSDOS users) separated list of paths for finding catalog files or system identifiers. This envariable serves the same function as P_SGML_PATH. If both are defined, paths listed in P_SGML_PATH are searched first before any paths in SGML_SEARCH_PATH.

The use of P_SGML_PATH is for compatibility with earlier versions of dtd.pl. SGML_CATALOG_FILES and SGML_SEARCH_PATH are supported for compatibility with James Clark's nsgmls(1).

Note
When searching for a file via the P_SGML_PATH and/or SGML_SEARCH_PATH, if the file is not found in any of the paths, then the current working directory is searched.

DTDreset

Usage

    &'DTDreset();

Description

DTDreset clears all data associated with the DTD read via DTDread_dtd. This routine is useful if multiple DTDs need to be processed.


DTDset_comment_callback

Usage

    &'DTDset_comment_callback($callback);

Description

DTDset_comment_callback sets the function, $callback, to be called when a comment declaration is read during DTDread_dtd. $callback is called as follows:

    &$callback(*comment_text);

*comment_text is a pointer to the string containing all the text within the SGML comment delaration (excluding the open and close delimiters).


DTDset_pi_callback

Usage

    &'DTDset_pi_callback($callback);

Description

DTDset_pi_callback sets the function, $callback, to be called when a processing instruction is read during DTDread_dtd. $callback is called as follows:

    &$callback(*pi_text);

*pi_text is a pointer to the string containing all the text within the processing instruction (excluding the open and close delimiters).


DTDset_verbosity

Usage

    &'DTDset_verbosity($value);

Description

DTDset_verbosity sets the verbosity flag for DTDread_dtd. If $value is non-zero, then DTDread_dtd outputs status messages as it parses a DTD. This function is used for debugging purposes.


Data Access Routines

The following routines access the data extracted from an SGML DTD via DTDread_dtd


DTDget_elements

Usage

    @elements = &'DTDget_elements();
    @elements = &'DTDget_elements($nosortflag);

Description

DTDget_elements retrieves an array of all elements defined in the DTD. An optional flag argument can be passed to the routine to determine is elements returned are sorted or not: 0 => sorted, 1 => not sorted.


DTDget_top_elements

Usage

    @top_elements = &'DTDget_elements();

Description

DTDget_top_elements retrieves a sorted array of all top-most elements defined in the DTD. Top-most elements are those elements that cannot be contained within another element or can only be contained within itself.


DTDget_elem_attr

Usage

    %attribute = &'DTDget_elem_attr($elem);

Description

DTDget_elem_attr returns an associative array containing the attributes of $elem. The keys of the array are the attribute names, and the array values are $; separated strings of the possible values for the attributes. Example of extracting an attribute's values:

    @values = split(/$;/, $attribute{`alignment'});

The first array value of the $; splitted array is the default value for the attribute (which may be an SGML reserved word). If the default value equals "#FIXED", then the next array value is the #FIXED value. The other array values are all possible values for the attribute.

Note
$; is assumed to be the default value assigned by Perl: "\034". If $; is changed, unpredictable results may occur.

DTDget_parents

Usage

    @parent_elements = &'DTDget_parents($elem);

Description

DTDget_parents returns an array of all elements that may be a parent of $elem.


DTDget_base_children

Usage

    @base_children = &'DTDget_base_children($elem, $andcon);

Description

DTDget_base_children returns an array of the elements in the base model group of $elem. The $andcon is flag if the connector characters are included in the returned array: 0 => no connectors, 1 (non-zero) => connectors.

Example:

    <!ELEMENT foo (x | y | z) +(a | b) -(m | n)>

The call

    &'DTDget_base_children(`foo')

will return

    (`x', `y', `z')

The call

    &'DTDget_base_children(`foo', 1)

will return

    (`(`,`x', `|', `y', `|', `z', `)')

One may use DTDis_tag_name to distinguish elements from the connectors.


DTDget_exc_children

Usage

    @exc_children = &'DTDget_exc_children($elem, $andcon);

Description

DTDget_exc_children returns an array of the elements in the exclusion model group of $elem. The $andcon is flag if the connector characters are included in the returned array: 0 => no connectors, 1 (non-zero) => connectors.

Example:

    <!ELEMENT foo (x | y | z) +(a | b) -(m | n)>

The call

    &'DTDget_exc_children(`foo')

will return

    (`m', `n')

DTDget_gen_ents

Usage

    @generalents = &'DTDget_gen_ents();
    @generalents = &'DTDget_gen_ents($nosort);

Description

DTDget_gen_ents returns an array of general entities. An optional flag argument can be passed to the routine to determine is elements returned are sorted or not: 0 => sorted, 1 => not sorted.


DTDget_gen_data_ents

Usage

    @gendataents = &'DTDget_gen_data_ents();

Description

DTDget_gen_data_ents returns an array of general data entities defined in the DTD. Data entities cover the following: PCDATA, CDATA, SDATA, PI.


DTDget_inc_children

Usage

    @inc_children = &'DTDget_inc_children($elem, $andcon);

Description

DTDget_inc_children returns an array of the elements in the inclusion model group of $elem. The $andcon is flag if the connector characters are included in the returned array: 0 => no connectors, 1 (non-zero) => connectors.

Example:

    <!ELEMENT foo (x | y | z) +(a | b) -(m | n)>

The call

    &'DTDget_inc_children(`foo')

will return

    (`a', `b')

DTDis_element

Usage

    &'DTDis_element($element);

Description

DTDis_element returns 1 if $element is defined in the DTD. Otherwise, 0 is returned.


Utility Routines

The following are general utility routines.


DTDis_attr_keyword

Usage

    &'DTDis_attr_keyword($word);

Description

DTDis_attr_keyword returns 1 if $word is an attribute content reserved value, otherwise, it returns 0. In the reference concrete syntax, the following values of $word will return 1:

Character case is ignored.


DTDis_elem_keyword

Usage

    &'DTDis_elem_keyword($word);

Description

DTDis_elem_keyword returns 1 if $word is an element content reserved value, otherwise, it returns 0. In the reference concrete syntax, the following values of $word will return 1:

Character case is ignored.


DTDis_group_connector

Usage

    &'DTDis_group_connector($char);

Description

DTDis_group_connector returns 1 if $char is an group connector, otherwise, it returns 0. The following values of $char will return 1:


DTDis_occur_indicator

Usage

    &'DTDis_occur_indicator($char);

Description

DTDis_occur_indicator returns 1 if $char is an occurence indicator, otherwise, it returns 0. The following values of $char will return 1:


DTDis_tag_name

Usage

    &'DTDis_tag_name($string);

Description

DTDis_tag_name returns 1 if $string is a legal tag name, otherwise, it returns 0. Legal characters in a tag name are defined by the $namechars variable. By default, a tag name may only contain the characters "A-Za-z_.-".


DTDprint_tree

Usage

    &'DTDprint_tree($elem, $depth, FILEHANDLE);

Description

DTDprint_tree prints the content hierarchy of a single element, $elem, to a maximum depth of $depth to the file specified by FILEHANDLE. If FILEHANDLE is not specified then output goes to standard out. A depth of 5 is used if $depth is not specified. The root of the tree has a depth of 1.

The output generated by DTDprint_tree is as follows:

Elements that exist at a higher (or equal) level, or if the maximum depth has been reached, are pruned. The string "..." is appended to an element if it has been pruned due to pre-existance at a higher (or equal) level. The content of the pruned element can be determined by searching for the complete tree of the element (ie. elements w/o "...").

Here's an example of what the output will look like due to pruning of recursive element contents:

    htmlplus
    |
    |_body
    |  |
    |  |_address
    |  |  |
    |  |  |_p ...
    |  |
    |  |_div1
    |  |  |
    |  |  |_address ...
    |  |  |_div2 ...
    |  |  |_div3 ...
    |  |  |_div4 ...
    |  |  |_div5 ...
    |  |  |_div6 ...

Since the tree outputed is static, the inclusion and exclusion sets of elements are treated specially. Inclusion and exclusion elements inherited from ancestors are not propagated down to determine what elements are printed, but special markup is presented at a given element if there exists inclusion and exclusion elements from ancestors. The reason inclusion and exclusion elements are not propagated down is because of the pruning done. An element w/o "..." may be the only place of reference to see the content hierarchy of that element. However, the element may occur in multiple contents and have different ancestoral inclusion and exclusion elements applied to it.

Have I lost you? Maybe an example may help:

     OPENBOOK
     |
     |_d1
     |  | (I): idx needbegin needend newline
     |  |
     |  |_abbrev
     |  |  | (Ia): idx needbegin needend newline
     |  |  | (X): needbegin needend
     |  |  |
     |  |  |_#PCDATA
     |  |  |_acro
     |  |  |  | (Ia): idx needbegin needend newline
     |  |  |  | (Xa): needbegin needend
     |  |  |  |
     |  |  |  |_#PCDATA
     |  |  |  |_sub ...
     |  |  |  |_super ...
     |  |  | 

Ignoring the lines starting with ()'s, one gets the content hierachy of an element as defined by the DTD without concern of where it may occur in the overall structure. The ()'s line give additional information regarding the element with respect to its existance within a specific context. For example, when an acro element occurs within openbook/d1/abbrev, along with its normal content, it can contain idx and newline elements due to inclusions from ancestors. However, it cannot contain needbegin, needend regardless of its defined content since an ancestor(s) excludes them.

Note
Exclusions override inclusions. If an element occurs in an inclusion set and an exclusion set, the exclusion takes precedence. Therefore, in the above example, needbegin, needend are excluded from acro.

Explanation of ()'s keys:

(I)
The list of inclusion elements defined by the current element. Since this is part of the content model of the element, the inclusion elements are printed as part of the content hierarchy of the current element.
(Ia)
The list of inclusion elements due to ancestors. This is listed as reference to determine the content of an element within a given context. None of the ancestoral inclusion elements are printed as part of the content hierarchy of the element.
(X)
The list of exclusion elements defined by the current element. Since this is part of the content model of the element, the exclusion elements prevent elements defined in the base content and inclusion sets to be printed.
(Xa)
The list of exclusion elements due to ancestors. This is listed as reference to determine the content of an element within a given context. None of the ancestoral exclusion elements have any effect on the printing of the content hierarchy of the current element.

Availability

This program is part of the perlSGML package; see <URL:http://www.oac.uci.edu/indiv/ehood/perlSGML.html>


Author

Earl Hood <ehood@convex.com>
CONVEX Computer Corporation
3000 Waterview Parkway
P.O. Box 833851
Richardson, TX 75083-3851

Phone: (214) 497-4387
FAX: (214) 497-4500

dtd.pl 2.2.0