# GNU Source-highlight 2.5

## GNU Source-highlight

GNU Source-highlight, given a source file, produces a document with syntax highlighting.

This is Edition 2.5 of the Source-highlight manual.

This file documents GNU Source-highlight version 2.5.

This manual is for GNU Source-highlight (version 2.5, 5 October 2006), which given a source file, produces a document with syntax highlighting.

## 1 Introduction

GNU Source-highlight, given a source file, produces a document with syntax highlighting. The colors and the styles can be specified (bold, italics, underline) by means of a configuration file, and some other options can be specified at the command line.

The program already recognizes many programming languages (e.g., C++, Java, Perl, etc.) and file formats (e.g., log files, ChangeLog, etc.), and some output formats (e.g., HTML, ANSI color escape sequences, LaTeX, etc.). Since version 2.0, it allows you to specify your own input source language via a simple syntax described later in this manual (Language Definitions). Since version 2.1, it allows you to specify your own output format language via a simple syntax described later in this manual (Output Language Definitions). Since version 2.2, it is able to generate cross references (e.g., to variable names, field names, etc.) by relying on the program ctags, http://ctags.sourceforge.net (Generating References).

### 1.1 Supported languages

The complete list of languages (indeed, file extensions) natively supported by this version of Source-highlight (2.5), as reported by --lang-list, is the following:

     Supported languages (file extensions)
and associated language definition files

C = cpp.lang
H = cpp.lang
bib = bib.lang
bison = bison.lang
c = cpp.lang
caml = caml.lang
cc = cpp.lang
changelog = changelog.lang
cls = latex.lang
cpp = cpp.lang
cs = csharp.lang
csharp = csharp.lang
diff = diff.lang
docbook = xml.lang
dtx = latex.lang
eps = postscript.lang
flex = flex.lang
fortran = fortran.lang
h = cpp.lang
hh = cpp.lang
hpp = cpp.lang
htm = html.lang
html = html.lang
java = java.lang
javascript = javascript.lang
js = javascript.lang
l = flex.lang
lang = langdef.lang
langdef = langdef.lang
latex = latex.lang
lex = flex.lang
lgt = logtalk.lang
ll = flex.lang
log = syslog.lang
logtalk = logtalk.lang
lua = lua.lang
ml = caml.lang
mli = caml.lang
outlang = outlang.lang
pas = pascal.lang
pascal = pascal.lang
patch = diff.lang
perl = perl.lang
php = php.lang
php3 = php.lang
pl = prolog.lang
pm = perl.lang
postscript = postscript.lang
prolog = prolog.lang
ps = postscript.lang
py = python.lang
python = python.lang
rb = ruby.lang
ruby = ruby.lang
sh = sh.lang
shell = sh.lang
sig = sml.lang
sml = sml.lang
sql = sql.lang
sty = latex.lang
style = style.lang
syslog = syslog.lang
tcl = tcl.lang
tex = latex.lang
tk = tcl.lang
txt = nohilite.lang
xhtml = xml.lang
xml = xml.lang
y = bison.lang
yacc = bison.lang
yy = bison.lang


The complete list of output formats natively supported by this version of Source-highlight (2.5), as reported by --outlang-list, is the following:

     Supported output languages
and associated language definition files

docbook = docbook.outlang
esc = esc.outlang
esc-doc = esc.outlang
html = html.outlang
html-css = css_common.outlang
html-css-doc = cssdoc.outlang
html-doc = htmldoc.outlang
latex = latex.outlang
latex-doc = latexdoc.outlang
latexcolor = latexcolor.outlang
latexcolor-doc = latexcolordoc.outlang
texinfo = texinfo.outlang
xhtml = xhtml.outlang
xhtml-css = xhtmlcss.outlang
xhtml-css-doc = xhtmldoc.outlang
xhtml-doc = xhtmldoc.outlang


The meaning of the suffixes -doc, -css and -css-doc is explained in Output Language map.

Please, keep in mind, that I haven't tested personally all these language definitions: I actually checked that the definition file is correct (with the command line option --check-lang, Invoking source-highlight), but I'm not sure their definition actually respects that language syntax (e.g., I've put up together some language definitions by searching for information in the Internet, but I've never programmed in that language). So, if you find that a language definition is not precise, please let me know. Moreover, if you have a program example in a language that's not included in the tests directory, please send it to me so that I can include it in the test suite.

### 1.2 Using source-highlight as a simple formatter

You can also use source-highlight as a simple formatter of input file, i.e., without performing any highlighting1.

You can achieve this by using, as the language definition file for input sources the file nohilite.lang, using the command line option --lang-def (Invoking source-highlight). Since that language definition is empty, no highlighting will be performed; however, source-highlight will transform the input file in the output format. Notice, in the input language associations in Supported languages, that nohilite.lang is also associated to txt files.

This, for instance, makes source-highlight useful in cases you want to transform a text file into HTML or LaTeX. During the output, in fact, source-highlight will correctly generate characters that have a specific meanings in the output format.

For instance, in this Texinfo manual, if I want to insert a @ or a { I have to “escape” them to make them appear literally since they have a special meaning in Texinfo. The same holds, e.g., for <, > or & in HTML. If you use source-highlight, it will take care of this, automatically for you. This is the Texinfo source of the above sentence:

     For instance, in this Texinfo manual,
if I want to insert a @@ or a @{
I have to escape'' them to make them appear literally
since they have a special meaning in Texinfo.
The same holds, e.g.,
for @code{<}, @code{>} or @code{&} in HTML.
If you use source-highlight,
it will take care of this, automatically for you.


This was processed by source-highlight as a simple text file, without no highlighting; however since it was formatted in Texinfo, all the necessary escaping was automatically performed. This way, it is very easy to insert, in the same document, a code, and its result (as in this example).

This is actually the formatting performed by source-highlight; except for the comment, this is basically what you should have written yourself to do all the escaping stuff manually:

     @c Generator: GNU source-highlight, by Lorenzo Bettini, http://www.gnu.org/software/src-highlite
@example
For instance, in this Texinfo manual,
if I want to insert a @@@@ or a @@@{
I have to escape'' them to make them appear literally
since they have a special meaning in Texinfo.
The same holds, e.g.,
for @@code@{<@}, @@code@{>@} or @@code@{&@} in HTML.
If you use source-highlight,
it will take care of this, automatically for you.
@end example


In case source-highlight does not handle a specific input language, you can still use the option --failsafe (Invoking source-highlight) and also in that case no highlighting will be performed, but source-highlight will transform the input file in the output format.

Notice, however, that if the input language cannot be established, the default.lang will be used: an empty language definition file which you might want to customize.

## 2 Installation

See the file INSTALL for detailed building and installation instructions; anyway if you're used to compiling Linux software that comes with sources you may simply follow the usual procedure, i.e., untar the file you downloaded in a directory and then:

     cd <source code main directory>
./configure
make
make install


However, before you do this, please check that you have everything that is needed to build source-highlight, What you need to build source-highlight.

Note: unless you specify a different install directory by --prefix option of configure (e.g. ./configure --prefix=<your home>), you must be root to run make install.

Files will be installed in the following directories:

Executables
/prefix/bin
docs and samples
/prefix/share/doc/source-highlight
conf files
/prefix/share/source-highlight

Default value for prefix is /usr/local but you may change it with --prefix option to configure.

NOTICE: Originally, instead of Source-highlight, there were two separate programs, namely GNU java2html and GNU cpp2html. There are two shell scripts with the same name that will be installed together with Source-highlight in order to facilitate the migration (however their use is not advised and it is deprecated).

You can download it from GNU's ftp site: ftp://ftp.gnu.org/gnu/src-highlite or from one of its mirrors (see http://www.gnu.org/prep/ftp.html).

I do not distribute Windows binaries anymore; since, they can be easily built by using Cygnus C/C++ compiler, available at http://www.cygwin.com. However, if you don't feel like downloading such compiler, you can request such binaries directly to me, by e-mail (find my e-mail at my home page) and I can send them to you. An MS-Windows port of Source-highlight is available from http://gnuwin32.sourceforge.net.

Archives are digitally signed by me (Lorenzo Bettini) with GNU gpg (http://www.gnupg.org). My GPG public key can be found at my home page (http://www.lorenzobettini.it).

You can also get the patches, if they are available for a particular release (see below for patching from a previous version).

### 2.2 Anonymous CVS Access

This project's CVS repository can be checked out through anonymous (pserver) CVS with the following instruction:

     cvs -z3 -d:pserver:anonymous@cvs.savannah.gnu.org:/sources/src-highlite co src-highlite


Further instructions can be found at the address:

Please notice that this way you will get the latest development sources of Source-highlight, which may also be unstable. This solution is the best if you intend to correct/extend this program: you should send me patches against the latest cvs repository sources.

If, on the contrary, you want to get the sources of a given release, through cvs, say, e.g., version X.Y.Z, you must specify the tag rel_X_Y_Z when you run the cvs command or the cvs update command.

When you compile the sources that you get through the cvs repository, before running the configure and make commands, you should, at least the first time, run the command:

     sh reconf


This will run the autotools commands in the correct order, and also copy possibly missing files. You should have installed recent versions of automake and autoconf in order for this to succeed. You will also need flex and bison.

NOTICE: This convention holds since release 2.1.

### 2.3 What you need to build source-highlight

Since version 2.0 Source-highlight relies on regular expressions as provided by boost (http://www.boost.org), so you need to install at least the regex library from boost.

Most GNU/Linux distributions provide this library already in a compiled form. If you use your distribution packages, please be sure to install also the development package of the boost libraries.

If you experience problems in installing Boost Regex library, or in compiling source-highlight because of this library, please take a look at Tips on installing Boost Regex library.

If you want to use a specific version of the Boost regex library (because you have many versions of it), you can use the configure option --with-boost-regex to specify a particular suffix. For instance,

     ./configure --with-boost-regex=boost_regex-gcc-1_31


Source-highlight has been developed under GNU/Linux, using gcc (C++), and bison (yacc) and flex (lex), and ported under Win32 with Cygnus C/C++compiler, available at http://www.cygwin.com. I used the excellent GNU Autoconf and GNU Automake. I also used Autotools (ftp://ftp.ugcs.caltech.edu/pub/elef/autotools) which creates a starting source tree (according to GNU standards) with autoconf, automake starting files. Finally I used GNU gengetopt (http://www.gnu.org/software/gengetopt), for command line parsing.

I started to use also doublecpp (http://www.lorenzobettini.it/software/doublecpp) that permits achieving dynamic overloading.

Actually, apart from the boost regex library, you don't need the other tools above to build source-highlight (indeed I provide the output sources generated by the above mentioned tools), unless you want to develop source-highlight.

However, if you obtained sources through CVS, you need some other tools, see Anonymous CVS Access.

### 2.4 Tips on installing Boost Regex library

I created this section because many users reported some problems after installing Boost Regex library from sources; other users had problems in compiling source-highlight even if this library was already correctly installed (especially windows users, using cygwin). I hope this section sheds some light in installing/using the Boost Regex library. Please, notice that this section does not explain how to compile the Boost libraries (the documentation you'll find on http://www.boost.org is well done); it explains how to tweak things if you have problems in compiling source-highlight even after a successful installation of Boost libraries.

If you experience no problem in compiling source-highlight, you can happily skip this section :-)

First of all, if your distribution provides packages for the Boost regex library, please be sure to install also the development package of the boost libraries, i.e., those providing also the header files needed to compile a program using these libraries. For instance, on my Debian system I had to install the package libboost-regex-dev, besides the package libboost-regex.

If your distribution does not provide these packages then you have to download the sources of Boost libraries from http://www.boost.org and follow the instructions for compilation and installation. However, I suggest you specify /usr as prefix for installation, instead of relying on the default prefix /usr/local (unless /usr/local/include is already in the inclusion path of your C++ compiler), since this will make things easier when compiling source-highlight. I suggest this, since /usr/include is usually the place where C++ searches for header files during compilation.

If you successfully compiled and installed the Boost Regex library, or you installed the package from your distribution, but you STILL experience problems in compiling source-highlight, then you simply have to adjust some things as described in the following.

If the ./configure command of source-highlight reports this error:

     ERROR! Boost::regex library not installed.


then, the compiler cannot find the header files for this library. In this case, check that the directory /usr/include/boost actually exists; if it does not, then probably you'll find a similar directory, e.g., /usr/include/boost-1_33/boost, depending on the version of the library you have installed. Then, all you have to do is to create a symbolic link as follows:

     ln -s /usr/include/boost-1_33/boost /usr/include/boost


Alternatively, you might run source-highlight's configure as follows:

     ./configure CXXFLAGS=-I/usr/include/boost-1_33/


If then ./configure command of source-highlight reports this other error:

     ERROR! Boost::regex library is installed, but you
must specify the suffix with --with-boost-regex at configure
for instance, --with-boost-regex=boost_regex-gcc-1_31


then, there's still another thing to fix: you must find out the exact names of the files of your installed Boost Regex libraries; you can do this by using the command:

     vardef FUNCTION = '(?:[[:alpha:]]|_)[[:word:]]*[[:blank:]]*(?=$$)' function = FUNCTION  The capital letters are used only for readability. It is also possible to concatenate variables and expressions, and reuse variables inside further variable definitions:  vardef basic_time = '[[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}' vardef time = '\<' + basic_time + '\>'  Next: , Previous: Variable definitions, Up: Language Definitions ### 7.6 File inclusion It is possible to include other language definition files into another file. This is inclusion actually physically includes the contents of the included file into the current file during parsing, at the exact point of inclusion (just like the #include in C/C++). This is useful for re-using definitions in many files. For instance, C++ comment definitions are given in a file c_comment.lang, and this file is included in the Java and C++ definition files. The same happens for number and functions. For instance, the file java.lang contains the following include instructions:  include "c_comment.lang" include "number.lang" keywords ... include "function.lang"  Notice that the order of inclusion is crucial since the order of definition is crucial. If function definition was included before keyword definitions, then the sentence if (exp) would be highlighted as a function invocation. Next: , Previous: File inclusion, Up: Language Definitions ### 7.7 State/Environment Definitions Sometimes you want some source element to be highlighted only if they are surrounded by other elements. Source-highlight language definitions provides also this feature.  state|environment <standard definition> begin <other definitions> end  This structure is recursive (so other state/environment definitions can be given within a state/environment). The meaning of a state/environment is that the definitions within the begin ... end are matched only if the definitions that define the state/environment have been matched. When entering a state/environment, however, the definitions given outside the state/environment are not matched. The difference between state and environment is that in the latter, normal parts of the source language (i.e., those that do not match any definition) are highlighted according to the style of the definition that defines the environment. As an example, the following defines the multiline nested C comment, and highlights URL and e-mail addresses only when they appear inside a comment (notice that this uses file inclusion):  environment comment delim "/*" "*/" multiline nested begin include "url.lang" end  Notice that we used environment because everything else inside a comment has to be formatted according to the comment style. While for programming language definitions states/environments can be avoided (although they allow to highlight some parts only if inside a specific environment, e.g., URLs inside comments, or documentation tags in Javadoc comments), they are pretty important for highlighting files such as logs and ChangeLog files, since elements have to be highlighted when they appear in a specific position. For instance, for ChangeLog (see changelog.lang), we use a state for highlighting the date, name, e-mail or URL (taken from url.lang):  state date start '[[:digit:]]{2,4}-?[[:digit:]]{2}-?[[:digit:]]{2}' begin include "url.lang" name = '([[:word:]]|[[:punct:]])+' end  Notice that definitions that appear inside a state/environment have the same scope of the expressions that define the environment. While this makes sense for start and delim definitions, it may makes less sense for simple definitions (i.e., those that simply lists all possible expressions): in fact, in this case, such expressions do not define a scope. For such definitions, the semantics of state/environment is that the state/environment starts after matching one of the alternatives. And where will it end? In this case you must explicitly exit the environment. For instance, you can say that, when inside a state/environment, a specific language definition, when encountered also exits the environment (with the keyword exit). You can even exit all the environments with exitall. For instance, the following definition, highlights a non empty string following a web method:  vardef non_empty = '[^[:blank:]]+' state webmethod = "OPTIONS|GET|HEAD|POST|PUT|DELETE", "TRACE|CONNECT|PROPFIND|MKCOL|COPY|MOVE|LOCK|UNLOCK" begin string = non_empty exit end  If you ever need such advanced features, you may want to take a look at the log.lang definition file that defines highlighting for several log files (access logs, Apache logs, etc.). Next: , Previous: State/Environment Definitions, Up: Language Definitions ### 7.8 Redefinitions and Substitutions These two features are useful when you want to define a language by re-using an existing language definition with some changes. Typically you include another language definition file and you redefine/substitute some elements. When you use redef you erase all the previous definitions of that language elements with the new one. The new language element definition will be placed exactly in the point of the new definition. We use this feature, for instance, when we define the sml language by re-using the caml one: they differ only for the keywords11. In fact, the contents of sml.lang is summarized as follows:  include "caml.lang" redef keyword = "abstraction|abstype|and|andalso..." redef type = "int|byte|boolean|char|long|float|double|short|void"  Since the new language element definition appears in the exact point of the redefinition, this means that such a regular expression will be matched only if all the previous ones (the ones of the included file) cannot be matched. This may lead to unwanted results in some cases (not in the sml case though). In other words the following code  keyword = "foo" keyword = "bar" type = "int" redef keyword = "myfoo"  is equivalent to the following one  type = "int" keyword = "myfoo"  If this is not what you want, you can use subst, which is similar to redef apart from that it replaces the previous first definition of that language element in the exact point of that first definition (all other possible definitions are simply erased). That is to say that the following code  keyword = "foo" keyword = "bar" type = "int" subst keyword = "myfoo"  is equivalent to the following one  keyword = "myfoo" type = "int"  It is up to you to decide which one fits best your needs. We use this feature to define javascript in terms of java:  include "java.lang" subst keyword = "abstract|break|case|catch|class|if..."  Here using redef would have led to the unwanted behavior that if (exp) would have been highlighted as a function call, since the function element definition would have come first (and then matched first) than the redefinition of if as a keyword. Another example is the language definition for C# by reusing the one for C/C++, Highlighting C/C++ and C#. Next: , Previous: Redefinitions and Substitutions, Up: Language Definitions ### 7.9 Notes on regular expressions Although we refer to Boost documentation for such syntax12, we want to provide here some explanations of some forms of regular expressions that might be unknown but that are pretty useful in language definitions. Typically, when you need to group sub-expressions with parenthesis, but you don't want the parenthesis to spit out another marked sub-expression, you can use a non-marking parenthesis (?:expression). This is not necessary in the language definition syntax: even though you use standard parenthesis, source-highlight will transform it into a non-marking parenthesis. A useful regular expression form is the Forward Lookahead Asserts that come in two forms, one for positive forward lookahead asserts, and one for negative lookahead asserts: (?=abc) matches zero characters only if they are followed by the expression “abc”. (?!abc) matches zero characters only if they are not followed by the expression “abc”. For instance, in the definition of a function we use the following regular expression:  ([[:alpha:]]|_)[[:word:]]*[[:blank:]]*(?=\()  Thus after the name of a function we test, with the regular expression (?=\() whether an open parenthesis ( can be matched. If it can be matched, however, we leave that part in the input (so that the parenthesis will not be formatted the same way of a function name). Please, be careful when using such regular expression forms: since part of the input is not actually removed you may end up always scanning the same input part (thus looping) if you do not write the regular expressions well. For instance, consider this language definition  state foo = '(?=foo)' begin foo = '(?=foo)' end  and the following input file:  hello foo bar  As soon as we match the word foo we leave it in the input and we enter a state where we try to match the word foo still leaving it in the input. As you might have guess this will make source-highlight loop forever. Probably one might have wanted to write this language definition:  state foo = '(?=foo)' begin foo = 'foo' end  but a cut-and-paste error had its way ;-) You can also use Lookbehind Asserts: (?<=pattern) consumes zero characters, only if pattern could be matched against the characters preceding the current position (pattern must be of fixed length). (?<!pattern) consumes zero characters, only if pattern could not be matched against the characters preceding the current position (pattern must be of fixed length). Next: , Previous: Notes on regular expressions, Up: Language Definitions ### 7.10 Listing Language Elements In order for language definitions to be really useful they must be used in proper combination with formatting styles (see Output format style). However, these different files might not be developed by the same person, or simply some one may want to customize one of these. In order to define good output formatting style files you should be aware of each language element defined by a language definition file. Instead of having to look inside the language definition file itself (and recursively in each included file) you can use the command line option --show-lang-elements13, that simply prints to the standard output all the language elements that can be highlighted with a specific language definition file. For instance, for cpp.lang you get:  cbracket comment function keyword number preproc specialchar string symbol todo type url  while for log.lang you get:  cbracket comment date function ip normal number port string symbol time twonumbers webmethod  Next: , Previous: Listing Language Elements, Up: Language Definitions ### 7.11 Concluding Remarks By mixing all these features you can unleash your imagination and define highlighting for complex source languages such as Flex and Bison by writing few lines of code and re-use existing ones. For instance, Flex and Bison have their own syntax and lets you write C/C++ code in specific parts of the source language, e.g., the code between the outmost brackets, in the following example, is C++ code, and should be highlighted following C++ language definitions (apart from variables that are prefixed with ):  globaltags : options { if (...) { setTags( 1 ); } }  This is easy to do (taken from flex.lang):  state cbracket delim "{" "}" multiline nested begin variable = '\.' include "cpp.lang" end  Notice that, since we used nested we can be sure that the C++ language definitions are not considered anymore when we matched the last closing }. Next: , Previous: Concluding Remarks, Up: Language Definitions ### 7.12 Debugging When writing a language definition file, it is quite useful to be able to debug it (by using complex regular expressions one may experience unwanted behaviors). Since version 2.1 the command line option --debug-lang is available. When using this option, some additional information are printed to the standard output. Since version 2.5 this option also accepts the a sub specification (see Invoking source-highlight). When using dump (the default) all the additional information explained below will be dumped without interaction with the user. When using interactive, for each formatted string the program will stop waiting for a command from the user. In this very primordial version of interactive debug, the user will only have to press ENTER to make the program continue until the next formatted string. This way, the programmer will have the chance to step the highlighting of each part of the input file. Moreover, when debugging is enabled, no buffering will be performed by the program, thus each formatted element will be immediately available in the output. For instance, you can use the command tail -f to see the modifications on the output file on-the-fly. When using this command line option the additional information produced has the following format:  <.lang filename>:<line number>: <matched subexpression> formatting: <source file string to be formatted> entering: <next state's regular expression> exiting: exitingall:  The lines starting with entering, exiting and exitingall are related to entering a new state/environment and exiting one and all states/environments. The first line shows a link to the .lang definition file and the line number, i.e., and the sub-expression that matched and the line starting with formatting shows the source file string that matched with that expression. If a line starting with formatting is not preceded by a line with the link to the sub-expression, it means that no particular regular expression has matched, and thus the style normal will be used to format that string. Consider the following (simplified) Java source file:  01: /* 02: This is to demonstrate --debug-lang 03: http://www.lorenzobettini.it 04: */ 05: 06: package hello; 07: 08: public class Hello { 09: // just some greetings ;-) /* 10: int i = 10; 11: System.out.println("Hello World!"); 12: }  Now you can debug the java.lang file by using the --debug-lang command line option. And the output is as follows:  c_comment.lang:15: (/\*) formatting: "/*" as comment entering: (\*/)|(/\*)|... formatting: "" as comment formatting: " This is to demonstrate --debug-lang" as comment formatting: " " as comment url.lang:2: ((?:(?:[[:word:]]+://(?:[[:word:]]+[\./\-_]?)+))) formatting: "http://www.lorenzobettini.it" as url formatting: "" as comment c_comment.lang:15: (\*/) formatting: "*/" as comment exiting 1 level(s): (\<(?:import|package)\>)|(//)|... formatting: "" as normal formatting: "" as normal java.lang:1: (\<(?:import|package)\>) formatting: "package" as preproc formatting: " hello" as normal symbols.lang:1: ((?:~|!|%|\^|\*|\(|$$...
formatting: ";" as symbol
... omissis ...
c_comment.lang:2: (//)
formatting: "//" as comment
entering: (\z)
formatting: " just some greetings ;-)  /*" as comment
c_comment.lang:2: (\z)
formatting: "" as comment
exiting 1 level(s): (\<(?:import|package)\>)|(//)|...


This should provide enough information to understand how the regular expressions are used and how the states/environments are entered and exited. Please notice that the sub-expressions that are shown may differ from the original ones specified in the .lang file. This is due to the preprocessing that is performed by Source-highlight. Moreover, some sub-expressions are not defined at all in the .lang file: for instance, this is the case for line wide definitions, i.e., those that are defined with the keyword start, Line wide definitions. The last lines above, showing entering: (\z), mean that we wait to reach the end of a line.

Another useful feature in debugging is the option --show-regex that shows, on the standard output, the regular expression automaton that source-highlight creates.

For instance, consider this language definition (comment-show.lang):

     vardef TODO = '(TODO|FIXME)([:]?)'

environment comment delim "/**" "*/" multiline nested begin
type = '@[[:alpha:]]+'
todo = $TODO end string delim "<" ">" string2 delim "<<" ">>" multiline  If you now execute the following command:  source-highlight --show-regex=comment-show.lang  you will get, on the standard output, the following output:  STATE 1 0: normal (exit level: 0, exit_all: 0, next: none) 1: comment (/\*\*) (exit level: 0, exit_all: 0, next: 2) STATE 2 0: comment (exit level: 0, exit_all: 0, next: none) 1: comment (\*/) (exit level: 1, exit_all: 0, next: none) 2: comment (/\*\*) (exit level: 0, exit_all: 0, next: 2) 3: type ((?:@[[:alpha:]]+)) (exit level: 0, exit_all: 0, next: none) 4: todo ((?:(?:TODO|FIXME)(?:[:]?))) (exit level: 0, exit_all: 0, next: none) 2: string (<(?:[^<>])*>) (exit level: 0, exit_all: 0, next: none) 3: string2 (<<) (exit level: 0, exit_all: 0, next: 3) STATE 3 0: string2 (exit level: 0, exit_all: 0, next: none) 1: string2 (>>) (exit level: 1, exit_all: 0, next: none)  This shows the states of the regular expression automaton that source-highlight creates and will use to format an input source. Each state is associated a unique number in order to identify them. Then for each state it shows the regular expressions associated to each element. The first element (the one numbered with 0) of each state is always the default style for that state, i.e., the style applied if no regular expression is matched (in fact it does not have an associated regular expression). For instance, in the initial state the default style is normal. Then, we can see that if we match a /** (it is shown as a string with escaped special characters, /\*\*) we enter a new state, in this case the state 2 (next: 2). This corresponds to the delimited element defining a new environment. The fact that it is actually and environment and not a state14 can be seen by the fact that the default style is the same of the environment itself. If we match a */, i.e., the end of the delimited element, we exit one level (exit level: 1) meaning that we go back to state 1. Since the delimited element is defined as nested, we can notice that in the state 2 we have that if we match /** we simply enter a new instance of state 2 itself. The string and string2 show the difference implied by the multiline option: since source-highlight handles a line of input separately, the first delimited definition can be handled with a single regular expression while the multiline version cannot. Previous: Debugging, Up: Language Definitions ### 7.13 Tutorials on Language Definitions Now we provide some examples of language definitions. In the previous sections we have already provided some code snippets, while here we provide complete examples of language definitions that are included in the source-highlight distribution itself. In particular we will first show the language definition for the language definition syntax itself (file langdef.lang). This will be used to highlight the examples of language definitions that we will show in this section (the highlighting will not be visible if you are viewing this manual with the info command). Of course, this example is highlighted itself.  # this is the language definition for the # language definition syntax itself comment start "#" preproc = "include" string delim "\"" "\"" escape "\\" multiline string delim "'" "'" escape "\\" multiline keyword = "state|environment|begin|end|delim|escape|start", "multiline|nested|vardef|exitall|exit", "redef|subst|nonsensitive" symbol = "=|+|," vardef ID = '[[:word:]]+' variable = '\$' + $ID variable =$ID



The style that is used to highlight these examples in Texinfo is texinfo.style that is shown in Output format style. The language definition for the style syntax (file style.lang) is even simpler:

     # this is the language definition for the
# style definition syntax
comment start "//"

string delim "\"" "\"" escape "\\"

keyword = "purple|orange|brightorange|brightgreen|darkgreen",
"green|darkred|red|brown|pink|yellow|cyan",
"black|teal|gray|darkblue|blue",
"normal|linenum",
"noref|nf|f|u|i|b"

symbol = ",|;"

variable = '[[:word:]]+'



Notice that this definition is pretty simple since the language definition syntax is simple. In the next examples we will see how to use more complex features to highlight more complex language syntaxes.

#### 7.13.1 Highlighting C/C++ and C#

This is the language definition for C/C++, included in the file cpp.lang:

     # definitions for C/C++
include "c_comment.lang"

state preproc start '^[[:blank:]]*#(?:[[:blank:]]*include)' begin
string delim "<" ">"
string delim "\"" "\"" escape "\\"
include "c_comment.lang"
end

preproc = '^[[:blank:]]*#([[:blank:]]*[[:word:]]*)'

include "number.lang"

include "c_string.lang"

keyword = "__asm|__cdecl|__declspec|__export|__far16",
"__fastcall|__fortran|__import",
"__pascal|__rtti|__stdcall|_asm|_cdecl",
"__except|_export|_far16|_fastcall",
"break|case|catch|cdecl|class|const|const_cast|continue|default|delete",
"do|dynamic_cast|else|enum|explicit|extern|false|for|friend|goto",
"if|inline|mutable|namespace|new|operator|pascal|private|protected",
"public|register|reinterpret_cast|return|sizeof|static|static_cast",
"struct|switch|template|this|throw|true",
"try|typedef|typeid|typename|union",
"using|virtual|volatile|while"

type = "bool|char|double|float|int|long",
"short|signed|unsigned|void|wchar_t"

include "symbols.lang"

cbracket = "{|}"

include "function.lang"



Notice that this makes use of lots of includes since these parts are reused in other language definitions (e.g., Java has lots of parts that are in common with C/C++ so we wrote these parts in separate files). In particular the comments definitions:

     # c_comment.lang

vardef TODO = '(TODO|FIXME)([:]?)'

environment comment start "///" begin
include "url.lang"
include "html.lang"
type = '@[[:alpha:]]+'
todo = $TODO end comment start "//" # comments with documentation tags environment comment delim "/**" "*/" multiline nested begin include "url.lang" include "html.lang" type = '@[[:alpha:]]+' todo =$TODO
end

environment comment delim "/*" "*/" multiline nested begin
include "url.lang"
todo = $TODO end  Here we have the definitions for line-wide comments (//) and for multi line comments where we highlight also URL addresses and e-mail addresses (defined in the file url.lang not shown here). Moreover, for comments that are used in automatic documentation generation tools (such as Doxygen or Javadoc), i.e., those that start with /** or ///) we also highlight the complete HTML syntax (defined in the file html.lang not shown here). Going back to cpp.lang we see that for preprocessor directives #include we use a state definition since in this case the file included with the <file> syntax must be formatted as strings (and only in this context the <> must be considered as strings, anywhere else they are operators). Since a state erases definitions defined outside the state we must include c_comment.lang again in order to highlight comments also in this context15. Then we have a definition of preproc that catches all the other preprocessor directives. The included file number.lang defines the regular expression that catches number constants (not shown here), then we include the file c_string.lang that define strings (again shared by Java):  vardef SPECIALCHAR = '\\.' environment string delim "\"" "\"" begin specialchar =$SPECIALCHAR
end

environment string delim "'" "'" begin
specialchar = $SPECIALCHAR end  inside a string we want to highlight in a different way the special characters (such as, e.g., \n, \t, etc.) and in general escaped characters, matched by the regular expression \\.'. The included file symbols.lang defines all the symbols (shared also by other languages):  symbol = "~","!","%","^","*","(",")","-","+","=","[", "]","\\",":",";",",",".","/","?","&","<",">","\|"  This has nothing interesting but the fact that it shows that the character \ and | have to be escaped. The included file function.lang defines the regular expression to match a function definition or invocation:  vardef FUNCTION = '([[:alpha:]]|_)[[:word:]]*[[:blank:]]*(?=\()' function =$FUNCTION


that shows an example of forward lookahead assert for the opening parenthesis (see Notes on regular expressions). As noted in File inclusion, it is crucial that this file is included after the keyword definition.

Now that we wrote the language definition for C/C++, writing the one for C# is straightforward, since we only need to add the keyword using as a preprocessor element, and redefine (or better, “substitute”, Redefinitions and Substitutions) the keywords and types:

     # definitions for C-sharp
# by S. HEMMI, updated by L. Bettini.
preproc = "using"

number = '\<[+-]?((0x[[:xdigit:]]+)|(([[:digit:]]*\.)?[[:digit:]]+([eE][+-]?[[:digit:]]+)?))([FfDdMmUulL]+)?\>'

include "cpp.lang"

subst keyword = "abstract|event|new|struct ",
"as|explicit|null|switch",
"base|extern|this",
"false|operator|throw",
"break|finally|out|true",
"fixed|override|try",
"case|params|typeof",
"catch|for|private",
"foreach|protected",
"checked|goto|public|unchecked",
"const|implicit|ref",
"continue|in|return",
"virtual",
"default|interface|sealed|volatile",
"delegate|internal",
"do|is|sizeof|while",
"lock|stackalloc",
"else|static",
"enum|namespace",
"get|partial|set",
"value|where|yield"

subst type = "bool|byte|sbyte|char|decimal|double",
"float|int|uint|long|ulong|object",
"short|ushort|string|void"



#### 7.13.2 Highlighting Diff files

Now we want to highlight files that are generated by diff (typically used to create patches). This program can generate outputs in three different formats (at least at best of my knowledge).

With the option -u|--unified the differences among files are shown in the same context, for instance (the examples of the diff files shown here are manually modified so that they can fit in the page width):

     diff -ruP source-highlight-2.1.1/source-highlight.spec ...
--- source-highlight-2.1.1/source-highlight.spec ...
+++ source-highlight-2.1.2/source-highlight.spec ...
@@ -6,8 +6,8 @@

Summary:   syntax highlighting for source documents
Name:      source-highlight
-Version:   2.1.1
-Release:   2.1.1
+Version:   2.1.2
+Release:   2.1.2
Group:     Utilities/Console
Source:    ftp://ftp.gnu.org/gnu/source-highlight/%{name}-%{version}.tar.gz



With the option -c--context the differences are shown into two different parts:

     diff -rc2P source-highlight-2.1.1/source-highlight.spec ...
*** source-highlight-2.1.1/source-highlight.spec ...
--- source-highlight-2.1.2/source-highlight.spec ...
***************
*** 7,12 ****
Summary:   syntax highlighting for source documents
Name:      source-highlight
! Version:   2.1.1
! Release:   2.1.1
Group:     Utilities/Console
--- 7,12 ----
Summary:   syntax highlighting for source documents
Name:      source-highlight
! Version:   2.1.2
! Release:   2.1.2
Group:     Utilities/Console
diff -rc2P source-highlight-2.1.1/src/latex.outlang ...
*** source-highlight-2.1.1/src/latex.outlang ...
--- source-highlight-2.1.2/src/latex.outlang ...
***************
*** 35,37 ****
--- 35,38 ----
"--" "-\\/-"
"---" "-\\/-\\/-"
+ "\"" "\"{}" # avoids problems with some inputenc
end



Without options it generates only the essential difference information without any addition context lines:

     diff -rP source-highlight-2.1.1/source-highlight.spec ...
9,10c9,10
< Version:   2.1.1
< Release:   2.1.1
---
> Version:   2.1.2
> Release:   2.1.2


Summarizing, we would like to be able to handle all these three different syntaxes; notice that the first format and the second format have something conflicting: the first one uses the --- to indicate the new version of a file while the second format uses it to indicate the old version of a file. Since we want to highlight differently the old parts and the new parts (this is not visible in the Texinfo highlighting due to the lack of enhanced formatting features, but it is visible for instance in HTML output where we use two different colors), this behavior adds some difficulties. Of course, we could define three different language definitions, one for each diff output format. However, we prefer to handle them all in the same file!

This is the language definition for diff files:

     # language definition for files created with 'diff'

# diff created with -u option
state oldfile = '(?=^[-]{3})' begin
oldfile start '^[-]{3}'
oldfile start '^[-]'
newfile start '^[+]'
difflines start '^@@'
end

# diff created with -c option
state oldfile = '(?=^[*]{3})' begin
environment oldfile = '^[*]{3}[[:blank:]]+[[:digit:]]' begin
normal start '^[[:space:]]'
newfile = '(?=^[-]{3})' exit
end
oldfile start '^[*]{3}'

environment newfile = '^[-]{3}[[:blank:]]+[[:digit:]]' begin
normal start '^[[:space:]]'
newfile = '(?=^[*]{3})' exit
normal start '^diff' exit
end
newfile start '^[-]{3}'
end

# otherwise, created without options
state difflines = '(?=^[[:digit:]])' begin
difflines start '^[[:digit:]]'
oldfile start '^[<]'
newfile start '^[>]'
end



Since we can safely assume that when we process a diff file it contains only information created with the same diff command line switch, we define three different states that correspond to the three diff output formats. Notice that these states are entered with a simple definition; as noted in State/Environment Definitions, this means that no automatic exit means are provided, and since no explicit exit condition is specified, this means that once one of this state is entered it will never be exited. This is consistent with our goal. Of course, the expression that makes us enter a state must be defined correctly, and in particular we first search for an initial --- sequence since this is used as the first difference specification by the -u|--unified option, so this is a distinguishing feature to be used to infer which diff format file we are processing.

Another interesting thing, is that we use the forward lookahead assert for the opening parenthesis (see Notes on regular expressions), since we only want to see which file format we are processing. Once we entered the right state we can define the regular expressions for the elements of the specific diff file format.

For the files created with the option -c|--context we define two inner environments, one for the new file part and one for the old file part (these are delimited by a --- or *** and line number information). Notice that these are environments, so anything that is not matched by any expression is formatted according to the style of the element that defines the environment. Thus, we provide an expression for text that must be formatted as normal. For diff files this corresponds to a line that start with a space or with diff (take a look at the examples above). In particular the latter case can take place only during the new file part. In both environments we must define the exit conditions. In both cases these correspond to the beginning of the complementary part; also in this case we use forward lookahead assertions, since we use it only to exit the environment. The outer definitions for oldfile and newfile are used to match the lines with source file information information.

The third state, corresponding to the normal diff output format, should be straightforward by now.

#### 7.13.3 Pseudo semantic analysis

Source-highlight, by means of regular expressions can only perform lexical analysis of the input source. In particular, it is based on the assumption that the input source is syntactically correct with respect to the input language. However, by using the language definition syntax and by writing the right regular expression it is possible to simulate some sort of semantic analysis of the input source.

For instance, consider the following C (or C++) source file:

     // test special #if 0 treatment

int main() {
#if 0 // equivalent to a comment
int i = 10;
printf("this should never be executed\n");
return 1;
#else
printf("Hello world!\n");
return 0;
#endif

printf("never reach here!\n");
}



It is easy to verify that the code between #if 0 and #else will be never executed (indeed it will not even be compiled). Thus, we might want to format it as a comment.

We then write another language definition file, based on the file cpp.lang:

     environment comment start '^[[:blank:]]*#if[[:blank:]]+0' begin
comment start '^[[:blank:]]*#(else|endif)' exit
end

include "cpp.lang"



We intentionally included an error in this first version: we used the start element to start the environment, but such element has the scope of a single line, thus, it does not have the desired behavior:

     // test special #if 0 treatment

int main() {
#if 0 // equivalent to a comment
int i = 10;
printf("this should never be executed\n");
return 1;
#else
printf("Hello world!\n");
return 0;
#endif

printf("never reach here!\n");
}



A better solution is the following one:

     environment comment = '^[[:blank:]]*#[[:blank:]]*if[[:blank:]]+0' begin
comment start '^[[:blank:]]*#[[:blank:]]*(else|endif)' exit
end

include "cpp.lang"



here we enter the comment environment by not using a delimited element, but simply the regular expression to match #ifdef 0. Then we exit the environment either when we match an #else or a #endif. This seems to work:

     // test special #if 0 treatment

int main() {
#if 0 // equivalent to a comment
int i = 10;
printf("this should never be executed\n");
return 1;
#else
printf("Hello world!\n");
return 0;
#endif

printf("never reach here!\n");
}



However, it does not work if we consider nested #if...#else; for instance consider the following code, formatted with the previous language definition:

     // test special #if 0 treatment

int main() {
#if 0 // equivalent to a comment
int i = 10;
printf("this should never be executed\n");
#  ifdef FOO
printf("foo\n");
#     ifndef BAR
printf("no bar\n");
#     else
#     endif
#  else
printf("no foo\n");
#  endif // FOO
return 1;
#else
printf("Hello world!\n");
return 0;
#endif

printf("never reach here!\n");
}



The problem is that the previous language definition does not consider nested #if and thus, the first time it matches a #else or an #endif it exits the comment environment.

We must then take into account possible nested occurrences. This can be done by using a delimited element with the nested option (Delimited definitions):

     # treat the preprocess statement
#  #if 0
#    ...
#  #else
# as a comment

environment comment = '^[[:blank:]]*#[[:blank:]]*if[[:blank:]]+0' begin
comment start '^[[:blank:]]*#[[:blank:]]*else' exit
comment delim '^[[:blank:]]*#[[:blank:]]*if'
'^[[:blank:]]*#[[:blank:]]*endif' multiline nested

end

include "cpp.lang"



This time the right block of code is correctly formatted as a comment:

     // test special #if 0 treatment

int main() {
#if 0 // equivalent to a comment
int i = 10;
printf("this should never be executed\n");
#  ifdef FOO
printf("foo\n");
#     ifndef BAR
printf("no bar\n");
#     else
#     endif
#  else
printf("no foo\n");
#  endif // FOO
return 1;
#else
printf("Hello world!\n");
return 0;
#endif

printf("never reach here!\n");
}



Notice that it is crucial to exit the environment even when we match an #else (not only an #endif, since, this way, we can match again another #ifdef 0; consider, for instance, the following code:

     // test special #if 0 treatment

int main() {
#if 0 // equivalent to a comment
int i = 10;
printf("this should never be executed\n");
return 1;
#else
printf("Hello world!\n");
#   if 0 // another one
return 1;
#   else
return 0;
#   endif
#endif

printf("never reach here!\n");
}



## 8 Output Language Definitions

Since version 2.1 source-highlight uses a specific syntax to specify output formats (e.g., how to format in HTML, LaTeX, etc.). Before version 2.1, in order to add a new output format, many C++ classes had to be written. This had the drawback that a new output format could not be added “dynamically”: you had to recompile the whole source-highlight program.

Instead, now, an output format is specified in a file, loaded dynamically, through a (hopefully) simple syntax. Then, these definitions are used internally to create, on-the-fly, text formatters.

Here, we see such syntax in details, by relying on many examples. This allows a user to easily modify an existing output format definition and create a new one. These files have, typically, extension .outlang.

Each definition basically associates a text style (such as, e.g., bold, italics, colors, etc.) to the representation of that style into the output format (such as, e.g., <b>$text</b> in HTML). The representation is given in " and you can use the classic escape character \ to use the " inside the definition. If you want to specify the ASCII code for a character you can do so by specifying the numeric code in hexadecimal notation preceded by \x, for an example, see Style template. If no definition is given for a specific style, e.g., bold, then when that style is requested during formatting, the text will be formatted as it is, i.e., the style without the definition is simply ignored. Comments can be given by using #; the rest of the line is considered as a comment. Files can be included in the same way as for language definitions, File inclusion. In any case, if a definition for a style is given more than once, the last definition replaces all the others. Next: , Previous: Output Language Definitions, Up: Output Language Definitions ### 8.1 File extension With the line:  extension "<file extension>"  you define the default file extension (without the .) used to generate files formatted according to this output format. This is used when no output file name is specified; if the file extension is not included in the .outlang is not defined, and no output file name is specified, an error will occur. For instance, this is used in html_common.outlang:  extension "html"  Next: , Previous: File extension, Up: Output Language Definitions ### 8.2 Text styles These are the text styles that one can define:  bold italics underline notfixed fixed  These, of course, correspond to the ones used to specify the output format style, Output format style. These definitions, for instance, are from the HTML format definition:  bold "<b>$text</b>"
italics "<i>$text</i>" underline "<u>$text</u>"


Inside a definition you use the special variable $text to specify where the actual text to be formatted has to be inserted. For instance, the definition of bold above says that if you need to format the keyword class in bold in HTML, the following text will be generated: <b>class</b>. This variable is used also when mixing more than one styles recursively, in particular if you want to format in bold and italics (i.e, first bold and then italics, or, in other words, the sequence i, b is used in the the output format style file, see Output format style), then first the text class is substituted for $text into <b>$text</b> and then the text <b>class</b> will be substituted for $text into <i>$text</i>, thus obtaining <i><b>class</b></i>. Next: , Previous: Text styles, Up: Output Language Definitions ### 8.3 Colors The definition for using colors during formatting requires the definition for the color style:  color "..."  For instance, for HTML we have:  color "<font color=\"$style\">$text</font>"  Apart from the variable $text that we already saw, we have also the variable $style, that will be replaced with the actual color. Source-highlight recognizes a number of color constants, see Output format style. You then must associate a color constant to the color definition in the output format, through the colormap definition:  colormap "color constant" "color representation" "color constant" "color representation" ... default "default color representation" end  The default row (notice the absence of ") defines the color to be used in case a color constant is used during formatting, but it is not defined in the output format. For instance, for HTML we have:  colormap "green" "#33CC00" "red" "#FF0000" "darkred" "#990000" "blue" "#0000FF" "brown" "#9A1900" "pink" "#CC33CC" "yellow" "#FFCC00" "cyan" "#66FFFF" "purple" "#993399" "orange" "#FF6600" "brightorange" "#FF9900" "brightgreen" "#33FF33" "darkgreen" "#009900" "black" "#000000" "teal" "#008080" "gray" "#808080" "darkblue" "#000080" default "#000000" end  If your output format does not handle colors you can simply avoid the definitions of color and colormap and Source-highlight will simply ignore colors. The color is applied after applying the other styles, e.g., bold, italics, etc. Thus, by continuing the example of the previous section, suppose you defined the following output style for keywords:  keyword blue i, b;  then the class text will be replaced to $text variable and the value #0000FF to $style inside the color definition <font color="$style">$text</font> obtaining <font color="#0000FF">class</font> which will then be replaced to $text in <b>$text</b> and so on for italics, finally obtaining <i><b><font color="#0000FF">class</font></b></i>. Next: , Previous: Colors, Up: Output Language Definitions ### 8.4 Anchors and References When using the command line option --line-number-ref (Invoking source-highlight) an anchor is generated in the output file for each line numbering. The style of the anchor is defined by the definition anchor. If this is not defined, the option --line-number-ref has no effect. The $linenum variable will be replaced with the line number, and the $text variable with the actual text. For instance, for HTML we have  anchor "<a name=\"$linenum\">$text</a>"  Since version 2.2 source-highlight can also generate references to several elements (e.g., variables, class definitions, etc.), Generating References. Also in this case the definition anchor is used; furthermore, the definition of reference is required. In the definition of anchor and reference, apart from the variable $linenum, we also have the variables $infile (the name of the original input file) and $infilename (the name of the original input file without the path) and in the definition of reference we also have the variable $outfile (the name of the file where the anchor is). One can decide how to define an anchor and a reference by using these two variables. For instance, for HTML we have  reference "<a href=\"$outfile#$linenum\">$text</a>"


Notice, that in this case we use the $outfile since we actually generate a link to another (or possibly the same) output file. On the contrary, for LaTeX, since we do not generate a “clickable” reference, we refer to the original input file (we use both $infilename and $linenum in both definitions of anchor and reference):  anchor "\label{$infilename:$linenum}$text"
reference "{\hfill $text$\rightarrowinfile:$linenum, \ page~\pageref{$infilename:$linenum}}"  In particular, we use $infilename for generating the \label and not infile because the path symbol would “disturb” LaTeX (while we use the complete file path in the textual information of the reference). This will generate a right aligned reference. Notice that it is assumed that when generating references in LaTeX one uses --gen-references=postline or --gen-references=postdoc and not --gen-references=inline (Generating References), since it makes no sense to generate an inline reference (or at least I would not know how to generate a nice looking one :-). Furthermore, for Texinfo:  anchor "@anchor{infilename:$linenum}$text"
reference "@flushright
@xref{$infilename:$linenum,$text,$text $infile:$linenum}.
@end flushright"


Notice that using both $infilename (and not $infile for the same reasons) and $linenum also in the definition of anchor somehow ensures that there are no duplicate anchors; this is done for LaTeX and Texinfo but not for HTML because it is assumed that the generated .tex and .texinfo file is included directly in a master file, as it is done in this manual (while, for instance, it is assumed that a separate HTML file is generated for each source and kept separate). If this is not your case you can change the definitions of anchor and reference as you see fit. Some examples of outputs with references in Texinfo are shown in Examples. Indeed, one can use three more definitions for reference that corresponds to the three arguments that can be passed to --gen-references command line option (Generating References): inline_reference, postline_reference and postdoc_reference. If one of this not defined, then the same definition of reference is used. Having the possibility of specifying different definitions is useful for instance in the case of HTML: the same style for an inline reference is pretty ugly when used also for a postline or postdoc reference:  postline_reference "<a href=\"$outfile#$linenum\">$text -> $infile:$linenum</a>"
postdoc_reference "<a href=\"$outfile#$linenum\">$text ->$infile:$linenum</a>" reference "<a href=\"$outfile#$linenum\">$text</a>"


Next: , Previous: Anchors and References, Up: Output Language Definitions

If the output format you are defining does not have a specific style for bold, italics, ... and for colors you can simply use the definition onestyle, where you can use both $style and $text. This will be used for any style (indeed any other definition such as bold, italics, color will be ignored). Indeed, in this case, it is assumed that the style of each source element is defined in a file with its own syntax, i.e., not with a syntax defined by Source-highlight. (This is the case, for instance, of HTML using CSS style sheets.) Moreover, since the output format style is not used, during formatting the variable $style will be replaced with the name of the element to highlight (e.g., keyword, comment, etc.). For instance, for HTML CSS, we simply have:  onestyle "<span class=\"$style\">$text</span>"  In fact, HTML CSS relies on style definitions provided in a separate file (the .css file indeed). Thus, when formatting a keyword, e.g., abstract, we will obtain:  <span class="keyword">abstract</span>  Of course, the style for keyword must be defined in the .css file. Next: , Previous: One style, Up: Output Language Definitions ### 8.6 Style template Some output formats are based on a unique template that where the other styles are composed; during composition the styles can be separated with a specific separator:  styletemplate "..." styleseparator "..."  This is used, for instance, for the ANSI color escape sequence output format (esc.outlang):  styletemplate "\x1b[$stylem$text\x1b[m" styleseparator ";" bold "01$style"
underline "04$style" italics "$style"
color "$style"  Notice that, since more than one style can be mixed into the style template, bold, underline, ... explicitly use the variable $style.

Next: , Previous: Style template, Up: Output Language Definitions

### 8.7 Line prefix

This feature allows you to generate a string as the prefix of each generated line that corresponds to an input line (i.e., this prefix is not generated for other generated output elements, e.g., the lines in the header, footer, etc.).

We use this feature in the LaTeX output (LaTeX output):

     lineprefix "\mbox{}"


This way each line in the LaTeX output is prefixed with \mbox{}16.

Another interesting example that uses lineprefix is the javadoc output, see Generating HTML output.

Next: , Previous: Line prefix, Up: Output Language Definitions

### 8.8 String translation

Some character sequences that are in the source file may have a special meaning in an output format, so they need some preprocessing (e.g., escaping them). You can specify the translation table with:

     translations
"original sequence" "transformed sequence"
'regex' "transformed sequence"
...
end


The difference between "original sequence" and 'regex'17 is that with the former you specify a character sequence that will be matched literally, apart from special characters such as \ (which, if needed to be inserted, must be escaped), \n (new line) and \t (tab character). Instead, with the latter, you can specify a regular expression (this is basically the same difference between " and ' in language definitions, see Simple definitions).

For instance, for HTML, we have the following translation table:

     translations
"&" "&amp;"
"<" "&lt;"
">" "&gt;"
end


For LaTeX, the translation table is a little bit bigger; here we show only a little part, that shows how to escape special characters (such as \), to translate a new line character and tab character:

     translations
"<" "$<$"
">" "$>$"
"&" "\\&"
"\\" "\\textbackslash{}"
"\n" " \\\\\n"
" " "\\ "
"\t" "\\ \\ \\ \\ \\ \\ \\ \\ "
end


Notice that, since a new character must be translated in LaTeX with \\, we have to escape two \ (i.e., \\\\) and then we want to actually insert a new line in the output file \n.

For HTML with not fixed font by default, html_notfixed.outlang (see HTML output), we need two translate two space sequence (i.e., two adjacent spaces, since in HTML more adjacent spaces are rendered as only one space18, while we want them as they are), and we also need to translate a space starting a new line in the source (thus we use the regular expression ^ , enclosed in '); thus we have:

     translations
"\n" "<br>\n"
"  " "&nbsp; "
'^ ' "&nbsp;" # a space at the beginning of a line
"\t" "&nbsp; &nbsp; &nbsp; &nbsp; "
end


Next: , Previous: String translation, Up: Output Language Definitions

### 8.9 Document template

You can define the document template, i.e., the beginning and the end of an output file, with

     doctemplate
"...beginning..."
"...end..."
end


For instance, for HTML we have

     doctemplate
"<!-- Generator: $additional -->$header<pre><tt>"
"</tt></pre>$footer " end  Notice that in the end part there is an explicit new line. In the definition of the doctemplate the following variables can be used and will be replaced during the output generation: $title
the value of the title for the output file (e.g., the one passed with the --title command line option;
$header the contents of the file specified with the command line option --header; $footer
the contents of the file specified with the command line option --footer;
$css the value passed with the command line option --css; $additional
other additional information. Source-highlight replaces this with its name and its version.

For instance, for an HTML document with css, (file cssdoc.outlang) we have:

     doctemplate
"<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0//EN\"
\"http://www.w3.org/TR/REC-html40/strict.dtd\">
<html>
<meta http-equiv=\"Content-Type\"
content=\"text/html; charset=iso-8859-1\">
<meta name=\"GENERATOR\" content=\"$additional\"> <title>$title</title>
<link rel=\"stylesheet\" href=\"$css\" type=\"text/css\"> </head> <body>$header<pre><tt>"
"</tt></pre>
$footer</body> </html> " end  Previous: Document template, Up: Output Language Definitions ### 8.10 Generating HTML output As a complete example we show the file html_common.outlang which contains the common definitions for the various HTML output formats (html.outlang, htmldoc.outlang, etc.):  extension "html" bold "<b>$text</b>"
italics "<i>$text</i>" underline "<u>$text</u>"
color "<font color=\"$style\">$text</font>"
anchor "<a name=\"$linenum\">$text</a>"
postline_reference "<a href=\"$outfile#$linenum\">$text ->$infile:$linenum</a>" postdoc_reference "<a href=\"$outfile#$linenum\">$text -> $infile:$linenum</a>"
reference "<a href=\"$outfile#$linenum\">\$text</a>"

colormap
"green" "#33CC00"
"red" "#FF0000"
"darkred" "#990000"
"blue" "#0000FF"
"brown" "#9A1900"
"pink" "#CC33CC"
"yellow" "#FFCC00"
"cyan" "#66FFFF"
"purple" "#993399"
"orange" "#FF6600"
"brightorange" "#FF9900"
"brightgreen" "#33FF33"
"darkgreen" "#009900"
"black" "#000000"
"teal" "#008080"
"gray" "#808080"
"darkblue" "#000080"
default "#000000"
end

translations
"&" "&amp;"
"<" "&lt;"
">" "&gt;"
end



Moreover, this file is also used for generating javadoc output:

     include "html_common.outlang"

doctemplate
" * <!-- Generated by Source-highlight -->
* <pre><tt>
"
" * </tt></pre>
"
end

lineprefix " * "

translations
"*/" "&#42;/" # this avoids the */ to be interpreted as
# the end of a comment inside a javadoc comment
end



The javadoc output format is useful to format code snippets that have to be included inside a javadoc comment of another Java file19. Apart from being formatted nicely in the generated HTML documentation, this also releases the programmer from escaping specific characters in the code snippet (i.e., &, < and >). Notice also that it also avoids the sequence */ to be interpreted as the closing of the (javadoc) comment. For instance, if you write this code:

     /**
* This is an example of usage
*
* <pre><tt>
* System.out.println("*/");
* </tt></pre>
*/


The resulting Java code contains a syntax error. If you use source-highlight to format the code to insert in a javadoc comment you will avoid these problems.

An example of a javadoc generated HTML page containing a code snippet formatted with source-highlight can be found in the file SimpleClass-doc.html in the documentation directory.

## 9 Generating References

Since version 2.2 Source-highlight also produces references to fields, variables, etc. In order to do this it relies on the program Exuberant Ctags, by Darren Hiebert, available at http://ctags.sourceforge.net. Thus, you must install this program if you want Source-highlight to provide this feature.

The ctags program generates an index (or “tag”) file for a variety of language objects found in file(s). This allows these items to be quickly and easily located by a text editor or other utility (as in this case for Source-highlight). A “tag” signifies a language object for which an index entry is available (or, alternatively, the index entry created for that object)20.

This means that Source-highlight is able to generate references for a specific source language if and only if ctags handles such language. We refer to the command line options of ctags: --list-maps and --list-languages to find out the associations of file extensions and supported languages.

Reference generation is enable by using the command line option --gen-references (Invoking source-highlight). This option takes an argument that rules how references will be generated:

inline
a reference pointer will be generated exactly in the same place of the specific element. This is useful in output formats that naturally supports links, such as HTML, while it is useless for output formats that do not support inline links, such as LaTeX.
postline
if a line of the input source contains elements for which we found references, the list of references will be generated right after the line (see the examples, Examples).
postdoc
All the references will be generated after the whole input file has been generated.

There is an exception: when an element has more than one reference (because a variable is defined in many sources or because a method is overloaded) then if inline is specified, the generation switches to postline for that occurrence.

When --gen-references is specified, Source-highlight first invokes ctags. The use can customize this call by using the command line option --ctags (Invoking source-highlight). In particular, if one does not want ctags to be invoked by Source-highlight (e.g., because the tags file has already been generated) then --ctags must be passed an empty string, "". In this case or when the specified ctags command line generates an alternative output tag file (the default generated file is tags), one can specify the exact tag file with the command line option --ctags-file.

Once the tag file is generated, Source-highlight relies on the library readtags provided by the ctags distribution, and included in the Source-highlight sources.

Notice that if a program element is formatted according to a style that has the option noref (see Output format style) then this element is not considered a tag, and no reference is generated. This is the case, for instance, for a comment element: each string that is generated with the comment style, since this is declared with the option noref, it is not considered a tag (see Examples).

## 10 Examples

Here we provide some examples of sources formatted with Source-highlight using the -f texinfo command line option. Please keep in mind that the highlighting will not be visible in the Info file, but only in the printed manual and in the HTML output (well, at least line numbers are visible everywhere :-).

The first example is produced by using the command:

     source-highlight -f texinfo -i test.java -o test.java.texinfo -n


and here's the result

     01: /*
02:   This is a classical Hello program
03:   to test source-highlight with Java programs.
04:
05:   to have an html translation type
06:
07:         source-highlight -s java -f html --input Hello.java --output Hello.html
08:         source-highlight -s java -f html < Hello.java > Hello.html
09:
10:   or type source-highlight --help for the list of options
11:
12:   written by
13:   Lorenzo Bettini
14:   http://www.lorenzobettini.it
15:   http://www.gnu.org/software/src-highlite
16: */
17:
18: package hello;
19:
20: import java.io.* ;
21:
22: /**
23:  * <p>
24:  * A simple Hello World class, used to demonstrate some
25:  * features of Java source highlighting.
26:  * </p>
27:  * TODO: nothing, just to show an highlighted TODO or FIXME
28:  *
29:  * @author Lorenzo Bettini
30:  * @version 2.0
31:  */
32: public class Hello {
33:     int foo = 1998 ;
34:     int hex_foo = 0xCAFEBABE;
35:     boolean b = false;
36:     Integer i = null ;
37:     char c = '\'', d = 'n', e = '\\' ;
38:     String xml = "<tag attr=\"value\">&auml;</tag>", foo2 = "\\" ;
39:
40:     public static void main( String args[] ) {
41:         // just some greetings ;-)  /*
42:         System.out.println( "Hello from java2html :-)" ) ;
43:         System.out.println( "\tby Lorenzo Bettini" ) ;
44:         System.out.println( "\thttp://www.lorenzobettini.it" ) ;
45:         if (argc > 0)
46:             String param = argc[0];
47:         //System.out.println( "bye bye... :-D" ) ; // see you soon
48:     }
49: }



The second example shows the use of --gen-references functionality. In particular, the following output is generated with the command:

     source-highlight -f texinfo -i test.h -o test_ref.h.texinfo -n \
--gen-references=postline


and here's the result (notice how the comment line containing the string mysum does not contain references, since it is a comment element, and this element has the option noref in the texinfo.style, see Output format style. The same holds for the _TEXTGEN_H comment in the last comment line).

     01: /*
02: ** Copyright (C) 1999, 2000, 2001 Lorenzo Bettini
03: **
04: ** This program is free software; you can redistribute it and/or modify
06: ** the Free Software Foundation; either version 2 of the License, or
07: ** (at your option) any later version.
08: **
09: ** This program is distributed in the hope that it will be useful,
10: ** but WITHOUT ANY WARRANTY; without even the implied warranty of
11: ** MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
12: ** GNU General Public License for more details.
13: **
14: ** You should have received a copy of the GNU General Public License
15: ** along with this program; if not, write to the Free Software
16: ** Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
17: **
18: */
19:
20: // this file also contains the definition of mysum as a #define
21:
22: // textgenerator.h : Text Generator class &&
23:
24: #ifndef _TEXTGEN_H
See _TEXTGEN_H.

25: #define _TEXTGEN_H
26:
27: #define foo(x) (x + 1)
28:
29: #define mysum myfunbody
30:
31: #include <iostream.h> // for cerr
32:
33: #include "genfun.h" /* for generating functions */
34:
35: class TextGenerator {
36:   public :
37:     virtual void generate( const char *s ) const { (*sout) << s ; }
38:     virtual void generate( const char *s, int start, int end ) const
39:       {
40:         for ( int i = start ; i <= end ; ++i )
41:           (*sout) << s[i] ;
42:         return a<p->b ? a : 3;
43:       }
44:     virtual void generateln( const char *s ) const
45:         {
46:             generate( s ) ;
See generate.

See generate.

47:             (*sout) << endl ;
48:         }
49:     virtual void generateEntire( const char *s ) const
50:         {
51:             startTextGeneration() ;
See startTextGeneration.

See startTextGeneration.

52:             generate(s) ;
See generate.

See generate.

53:             endTextGeneration() ;
See endTextGeneration.

See endTextGeneration.

54:         }
55:     virtual void startTextGeneration() const {}
56:     virtual void endTextGeneration() const {}
57:     virtual void beginText( const char *s ) const
58:         {
59:             startTextGeneration() ;
See startTextGeneration.

See startTextGeneration.

60:             if ( s )
61:                 generate( s ) ;
See generate.

See generate.

62:         }
63:     virtual void endText( const char *s ) const
64:         {
65:             if ( s )
66:                 generate( s ) ;
See generate.

See generate.

67:             endTextGeneration() ;
See endTextGeneration.

See endTextGeneration.

68:         }
69: } ;
70:
71: // Decorator
72: class TextDecorator : public TextGenerator {
See TextGenerator.

73:   protected :
74:     TextGenerator *decorated ;
See TextGenerator.

75:
76:   public :
77:     TextDecorator( TextGenerator *t ) : decorated( t ) {}
See TextGenerator.

See decorated.

78:
79:     virtual void startTextGeneration() const
80:     {
81:         startDecorate() ;
82:         if ( decorated )
See decorated.

83:             decorated->startTextGeneration() ;
See startTextGeneration.

See decorated.

See startTextGeneration.

84:     }
85:     virtual void endTextGeneration() const
86:     {
87:         if ( decorated )
See decorated.

88:             decorated->endTextGeneration() ;
See endTextGeneration.

See decorated.

See endTextGeneration.

89:         endDecorate() ;
90:         mysum;
See mysum.

91:     }
92:
93:     // pure virtual functions
94:     virtual void startDecorate() const = 0 ;
95:     virtual void endDecorate() const = 0 ;
96: } ;
97:
98: #endif // _TEXTGEN_H



## 11 Reporting Bugs

If you find a bug in source-highlight, please send electronic mail to

bug-source-highlight at gnu dot org

Include the version number, which you can find by running source-highlight --version'. Also include in your message the output that the program produced and the output you expected.

If you have other questions, comments or suggestions about source-highlight, contact the author via electronic mail (find the address at http://www.lorenzobettini.it). The author will try to help you out, although he may not have time to fix your problems.

## 12 Mailing Lists

The following mailing lists are available:

help-source-highlight at gnu dot org

for generic discussions about the program and for asking for help about it (open mailing list), http://mail.gnu.org/mailman/listinfo/help-source-highlight

info-source-highlight at gnu dot org

for receiving information about new releases and features (read-only mailing list), http://mail.gnu.org/mailman/listinfo/info-source-highlight.

If you want to subscribe to a mailing list just go to the URL and follow the instructions, or send me an e-mail and I'll subscribe you.

#### Footnotes

[1] Although this might have been achieved with previous version, it is an official supported feature since version 2.5.

[2] Command lines that are too long are split into multiple indented lines separated by a \. Of course these commands are to be given in one line only, anyway.

[3] Command lines that are too long are split into multiple indented lines separated by a \. Of course these commands are to be given in one line only, anyway.

[4] Before version 2.1, this file was called tags.j2h which used to be a very obscure name. I hope this name convention is a better one :-).

[5] Before version 2.1, this command line option was called --tags-file which used to be a very obscure name. I hope this name convention is a better one :-).

[6] You can see these colors in HTML in the file colors.html.

[7] Notice that, since version 2.2, you must use double quotes.

[8] This is the main difference introduced in version 2.0 with respect the the previous version.

[9] This is the main difference introduced in version 2.1 with respect the the previous version.

[10] As explained before, originally Source-highlight was thought mainly for generating HTML output, this is why the term css is used for style sheets.

[11] At least, to the best of my knowledge :-)

[13] Since version 2.4.

[14] Please notice that this concept of state is different from the concept of “state” of an automaton.

[15] As a future extension we might think of providing a way, in the language definition syntax, to define a state/environment that extends the outer contexts instead of overriding them.

[16] This is a sort of trick to insert spaces at the beginning of a line without using a tabular environment; without the leading \mbox{} these spaces would be ignored. This is the only way I found to achieve this, if you have suggestions, please let me know!

[17] Since version 2.4.

[18] Unless they are inside a <tt>...</tt>.

[19] Although I haven't tested it, I think this will work also for Doxygen comments.

[20] This description is taken from the ctags man page