Information for GNU grep developers

[image of the head of a GNU]

Generic GNU info | Mailing lists | Savannah | CVS | Roadmap | Release procedure | To do | Distributors

1  Generic GNU information

A good start is to read the “GNU coding standards” and the “Information for maintainers of GNU software” documents.

2  Mailing lists

GNU grep's mailing lists are hosted on lists.gnu.org.

2.1  The bug-grep mailing list

To report bugs, suggest features, ask questions, or help in the development of GNU grep, please consider joining the bug-grep mailing list. Bug fixes and patches are better posted using the Savannah tools described below, rather than attaching them in email messages sent to this mailing list. To subscribe to this mailing list, send an email message to bug-grep-request@gnu.org with "subscribe" (without the quotation marks) in the subject header field (or in the body) of the email message, or visit the web page of the mailing list. Its archives are also available.

The list messages can be filtered by matching the following header field:

X-BeenThere: bug-grep@gnu.org

The list also automatically receives messages from the Savannah trackers that can be filtered by matching the following additional header fields:

X-Savane-Project: grep
X-Savane-Tracker: bugs

or:

X-Savane-Project: grep
X-Savane-Tracker: patch

or:

X-Savane-Project: grep
X-Savane-Tracker: support

2.2  The grep-commit mailing list

To follow development more closely, there is also the grep-commit mailing list. More details about what email messages are sent there can be found in the CVS repository section below. This is a read-only mailing list; subscribers cannot post directly to it. To subscribe to this mailing list, send an email message to grep-commit-request@gnu.org with "subscribe" (without the quotation marks) in the subject header field (or in the body) of the email message, or visit the web page of the mailing list. Its archives are also available.

The list messages can be filtered by matching the following header field:

X-BeenThere: grep-commit@gnu.org

2.3  Other deprecated mailing lists

Older GNU grep releases directed users to the bug-gnu-utils mailing list. As a consequence, some still post their bug reports and questions there. For this reason, it is a good idea for GNU grep developers to monitor this mailing list and follow up on related threads started there by redirecting them to the bug-grep mailing list. New threads about GNU grep must not be intentionally started there. To subscribe to this mailing list, send an email message to bug-gnu-utils-request@gnu.org with "subscribe" (without the quotation marks) in the subject header field (or in the body) of the email message, or visit the web page of the mailing list. Its archives are also available.

The list messages can be filtered by matching the following header field:

X-BeenThere: bug-gnu-utils@gnu.org

3  Project page on Savannah

The Savannah project page for GNU grep features a bug report area, a patch submission area, and other development-related tools.

If you wish to post bug reports or patches on Savannah, it is preferable that you create an account for yourself there and that you login before posting so that other developers can know who you are and follow up on your posting with that in mind.

Before contributing significant changes to GNU grep, the Free Software Foundation (FSF) requires that you sign copyright assignment papers. Therefore, if you have not already done so and are not willing or able to, it may be better then to just describe bugs or proposed features rather than post actual code (or documentation), as they would then have to be rewritten anyway.

Please keep these areas clean by only posting there information that is directly related to the bug or patch at hand. Ask basic questions on the bug-grep mailing list.

The identity of the current maintainers is also available there.

4  CVS repository

Generic instructions can be found on GNU grep's Savannah web page about CVS.

4.1  Source code

The contents of GNU grep's source code are stored in the following CVS repository:

CVS_RSH=ssh cvs -z3 -d:ext:anoncvs@savannah.gnu.org:/cvsroot/grep co grep

This repository is also available from its web interface.

Each time a commit is made to this tree, a message is sent to the grep-commit mailing list which can be filtered by matching the following header field:

To: grep-cvs-logs@gnu.org

Additionally, each time a file is modified in this tree, a message is sent to the grep-commit mailing list which can be filtered by matching the following header field:

To: grep-cvs-diffs@gnu.org

Daily snapshots of GNU grep's source code CVS repository are made available by Tony Abou-Assaleh. They have the advantage of containing files generated by the GNU auto tools (and which are not found in CVS), just like a regular release would.

4.2  Web site

The contents of GNU grep's web site at http://www.gnu.org/software/grep/ are automatically extracted from the following CVS repository:

CVS_RSH=ssh cvs -z3 -d:ext:anoncvs@savannah.gnu.org:/webcvs/grep co grep

This repository is also available from its web interface.

Each time a commit is made to this tree, a message is sent to the grep-commit mailing list which can be filtered by matching the following header field:

To: grep-webcvs-logs@gnu.org

Additionally, each time a file is modified in this tree, a message is sent to the grep-commit mailing list which can be filtered by matching the following header field:

To: grep-webcvs-diffs@gnu.org

(The grep-commit mailing list functionality for this tree should now work thanks to Savannah sr #103962.)

4.3  Tools

Information about CVS itself is available from its web site. Information about SSH is available from the OpenSSH web site or from the lsh web site.

Developers with write access to the CVS trees will need to create an account on Savannah and upload their SSH public identity information there.

People who can't access a CVS repository through its usual interface (because they sit behind a prohibitive firewall) can download individual files from a CVS repository's web interface, when one is available. This latter process can be automated by using a client program such as CVSGrab.

5  Roadmap

The latest stable release of GNU grep is "2.5.1a".

The current roadmap for GNU grep has been laid out in a 2005-03-08 post by Stepan Kasal on the bug-grep mailing list entitled “Plan for grep”:

2.5.2
=====
Our main goal for grep 2.5.2 is to get sane performance with utf-8.
That can be achieved by the patches written by Tim Waugh for Red Hat.

Besides that, I can do some changes in the infrastructure, so that
I can "breathe":

1) rewrite the configure.in script, perhaps also Makefile.am
2) set up for gnulib-tool --import
3) improve the test ifrastructure

I'm afraid I have to do 1) myself, and it is closely tied with 2),
so they probably have to be done together.

If someone likes awk and wanted to help with 3), it could help.
In short, there should be only one awk script for .test-->.script
rule.  The header of each .test file should state some details,
like which command to run, eg. "grep -E".  We also heve to invent
a way to collect the test cases for non-C locales; either by
running the whole set twice, or by creating a separate .test files.
The "make check" goal should run this, if the computer has a locale
like en_US.utf8 installed.

After completing these, we can:
4) check in the patches for the sync of dfa.c with GNU awk
5) other small patches which wait for a test case
6) process the RedHat patches

After 6), I should repeat Tim's measurments and see whether the utf8
performance improved.

Independently, I'd like to see
7) some _minimal_ cleanup of the grep(), grepdir(), recursion
   (the "main loop") and fix --directories=read
8) mark the -P option clearly as "experimental";

Well, that'll be perhaps enough for a release.

2.5.3
=====
Fix the combinations:
 * -i -o
 * --colour -i
 * -o -b
 * -o and zero-width matches
Go through the bug list im my mailbox and fix fixable.
Fix bugs reported with 2.5.2.

2.6.x
=====
The following should go here:
 - upgrade to current regex.c from glibc,
 - new functionality,
 - fixes for -P,
 - heavy refactoring.

6  Release procedure

A number of tasks must be performed before every release.

6.1  Source code compatibility with GNU awk

Drop dfa.[ch] into a copy of gawk and run “make check”.

6.2  Internationalization (i18n) and localization (l10n)

The grep.pot file must be sent to the Translation Project to get fresh po files.

The ABOUT-NLS file must be updated by getting a fresh copy from GNU gettext's CVS with

cvs -d:pserver:anoncvs@sources.redhat.com:/cvs/gettext co gettext/gettext-runtime/ABOUT-NLS

with password “anonymous” or the following line in $HOME/.cvspass:

/1 :pserver:anoncvs@sources.redhat.com:2401/cvs/gettext Ay=0=a%0bZ

(Shouldn't this be automated by “make dist” instead of keeping a redundant copy in GNU grep's CVS?)

6.3  Significant new features

The NEWS file must be updated to document significant new features in GNU grep.

6.4  Known limitations and failures

Some regression tests may be known to fail for the impending release. These specific tests should either document in their output that their failure is known about and to be expected and ignored, or they should just be disabled in the release (but kept activated in CVS after that). This is to limit the number of redundant bug reports.

7  To do

The source code for GNU grep includes a TODO file which contains various ideas and issues that may be worth exploring.

7.1  Other implementations

See this list of grep implementations.

Take a look at these and consider opportunities for merging or cloning:

7.2  POSIX

In general, interesting things to check in POSIX/OpenGroup include:

7.2.1  POSIX and --ignore-case

For this issue, interesting things to check in POSIX include:

In particular, consider the following with POSIX' approach on case folding in mind. Assume a non-Turkic locale with a character repertoire reduced to the following various forms of “LATIN LETTER I”:

0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069;
0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049

First note the differing UTF-8 octet lengths of U+0049 (0x49) and U+0069 (0x69) versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This implies that whole UTF-8 strings cannot be case-converted in place, using the same memory buffer, and that the needed octet-size of the new buffer cannot merely be guessed.

We have

lc(I) = i, uc(I) = I
lc(i) = i, uc(i) = I
lc(İ) = i, uc(İ) = İ
lc(ı) = ı, uc(ı) = I

where lc() and uc() denote lower-case and upper-case conversions.

There are several candidate --ignore-case logics (including the one mandated by POSIX):

Any optimization in the implementation of each logic must not change its basic semantic.

7.3  Unicode

In general, interesting things to check in Unicode include:

7.3.1  Unicode and --ignore-case

For this issue, interesting things to check in Unicode include:

Unicode uses the

if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string))

logic for caseless matching. Let's consider the “LATIN LETTER I” example mentioned above. In a non-Turkic locale, simple case folding yields

toCasefold_simple(U+0049) = U+0069
toCasefold_simple(U+0069) = U+0069
toCasefold_simple(U+0130) = U+0130
toCasefold_simple(U+0131) = U+0131

which leads to the following matches:

  \in  I  i  İ  ı
pat\   ----------
"I" |  Y  Y  n  n
"i" |  Y  Y  n  n
"İ" |  n  n  Y  n
"ı" |  n  n  n  Y

This is different from anything so far!

In a non-Turkic locale, full case folding yields

toCasefold_full(U+0049) = U+0069
toCasefold_full(U+0069) = U+0069
toCasefold_full(U+0130) = <U+0069, U+0307>
toCasefold_full(U+0131) = U+0131

with

0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;

which leads to the following matches:

  \in  I  i  İ  ı
pat\   ----------
"I" |  Y  Y  *  n
"i" |  Y  Y  *  n
"İ" |  n  n  Y  n
"ı" |  n  n  n  Y

This is just sad!

Note that having toCasefold(U+0131), simple or full, map to itself instead of U+0069 is in contradiction with the rules of Section 5.18 of the Unicode Standard since toUpperCase(U+0131) is U+0049. Same thing for toCasefold_simple(U+0130) since toLowerCase(U+0131) is U+0069. The justification for the weird toCasefold_full(U+0130) mapping is unknown; it doesn't even make sense to add a dot (U+0307) to a letter that already has one (U+0069). It would have been so simple to put them all in the same equivalence class!

Otherwise, also consider the following problem with Unicode's approach on case folding in mind. Assume that we want to perform

echo 'AßBC | grep -i 'Sb'

which corresponds to

input:    U+0041 U+00DF U+0042 U+0043 U+000A
pattern:  U+0053 U+0062

Following “CaseFolding-4.1.0.txt”, applying the toCasefold() transformation to these yields

input:    U+0061 U+0073 U+0073 U+0062 U+0063 U+000A
pattern:                U+0073 U+0062

so, according to this approach, the input should match the pattern. As long as the original input line is to be reported to the user as a whole, there is no problem (from the user's point-of-view; implementation is complicated by this).

However, consider both these GNU extensions:

echo 'AßBC' | grep -i --only-matching 'Sb'
echo 'AßBC' | grep -i --color=always  'Sb'

What is to be reported in these cases, since the match begins in the middle of the original input character 'ß'?

Note that Unicode's toCasefold() cannot be implemented in terms of POSIX' towctrans() since that can only return a single wint_t value per input wint_t value.

7.4  Miscellaneous

8  Distributors

The purpose of this listing is to help GNU grep maintainers track down bug fixes and improvements made by distributors so they can be integrated back into the upstream releases from GNU, if appropriate.

Users should not use this listing to find a substitute target where to send their bugs reports. These are still best sent upstream, to the GNU grep team, through the use of the bug-grep@gnu.org mailing list or of the GNU grep project page on Savannah.

This listing is not exhaustive; priority is given to listing distributors who actually maintain patches to the upstream package from GNU.

Please keep this listing sorted by entry. Each field type may appear more than once if appropriate, the field order being significant.

Debian GNU/Linux
Web sitehttp://www.debian.org/
Package database entryOld stable http://packages.debian.org/oldstable/base/grep
MaintainerRobert van der Meulen <rvdm at debian.org>
Package database entryStable http://packages.debian.org/stable/base/grep
MaintainerRyan M. Golbeck <rmgolbeck at debian.org>
MaintainerJeff Bailey <jbailey at nisa.net>
Package database entryTesting http://packages.debian.org/testing/base/grep
Package database entryUnstable http://packages.debian.org/unstable/base/grep
MaintainerAnibal Monsalve Salazar <anibal at debian.org>
MaintainerSantiago Ruano Rincon <santiago at unicauca.edu.co>
Bug trackinghttp://bugs.debian.org/grep
Source package namegrep
Binary package namegrep
Entry updated2005-11-08
Fedora Core/Red Hat
Web sitehttp://fedora.redhat.com/
Web sitehttp://www.redhat.com/
MaintainerTim Waugh <twaugh at redhat.com>
Bug trackingRed Hat Bugzilla http://bugzilla.redhat.com/
Managed repositorycvs -d:pserver:anonymous@cvs.fedora.redhat.com:/cvs/dist co devel/grep
Managed repositoryhttp://cvs.fedora.redhat.com/viewcvs/devel/grep/
Source package namegrep
Binary package namegrep
Entry updated2005-05-05
FreeBSD
Web sitehttp://www.freebsd.org/
Bug trackinghttp://www.freebsd.org/cgi/query-pr-summary.cgi?query
Managed repositoryCVS_RSH=ssh cvs -d:ext:freebsdanoncvs@anoncvs.FreeBSD.org:/home/ncvs co src/gnu/usr.bin/grep
Managed repositoryhttp://www.freebsd.org/cgi/cvsweb.cgi/src/gnu/usr.bin/grep/
Entry updated2005-05-05
Gentoo Linux
Web sitehttp://www.gentoo.org/
Package database entryhttp://packages.gentoo.org/packages/?category=sys-apps;name=grep
Bug trackingGentoo Bugzilla http://bugs.gentoo.org/
Managed repositoryhttp://www.gentoo.org/cgi-bin/viewcvs.cgi/sys-apps/grep/
Source package namegrep
Binary package namegrep
Entry updated2005-05-05
Mandriva Linux
Web sitehttp://www.mandrivalinux.com/
Bug trackingMandriva Bugzilla http://qa.mandriva.com/
Source package namegrep
Binary package namegrep
Entry updated2005-05-05
NetBSD
Web sitehttp://www.netbsd.org/
Package database entryftp://ftp.netbsd.org/pub/NetBSD/packages/pkgsrc/textproc/grep/README.html
Bug trackinghttp://www.netbsd.org/Misc/query-pr.html
Managed repositorycvs -d:pserver:anoncvs@anoncvs.NetBSD.org:/cvsroot co pkgsrc/textproc/grep
Managed repositoryhttp://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/textproc/grep/
Source package namegrep
Binary package namegrep
Entry updated2005-05-05
OpenBSD
Web sitehttp://www.openbsd.org/
Package database entryhttp://www.openbsd.org/3.8_packages/i386/ggrep-2.5.1p1.tgz-long.html
MaintainerChristian Weisgerber <naddy at openbsd.org>
Bug trackinghttp://www.openbsd.org/query-pr.html
Managed repositorycvs -d:pserver:anoncvs@anoncvs1.ca.openbsd.org:/cvs co ports/sysutils/ggrep
Managed repositoryhttp://www.openbsd.org/cgi-bin/cvsweb/ports/sysutils/ggrep/
Source package nameggrep
Binary package nameggrep
Entry updated2005-11-08
OpenPKG
Web sitehttp://www.openpkg.org/
MaintainerRalf S. Engelschall <rse at openpkg.org>
Managed repositorycvs -d :pserver:anonymous@cvs.openpkg.org:/v/openpkg/cvs co openpkg-src/grep
Managed repositoryrsync -av rsync://rsync.openpkg.org/openpkg-cvs/openpkg-src/grep/ .
Managed repositoryhttp://cvs.openpkg.org/dir?d=openpkg-src/grep
Source package namegrep
Binary package namegrep
Entry updated2005-06-19
SuSE Linux
Web sitehttp://www.novell.com/linux/suse/
MaintainerAndreas Schwab <schwab at suse.de>
Package database entryProfessional http://www.novell.com/products/linuxpackages/professional/grep.html
Source package namegrep
Binary package namegrep
Entry updated2005-06-19

Return to GNU grep's main page.

Return to the GNU Project's home page.

Return to the FSF's home page.