Fsdb(3) User Contributed Perl Documentation Fsdb(3) NAME Fsdb - a flat-text database for shell scripting SYNOPSIS Fsdb, the flatfile streaming database is package of commands for manipulating flat-ASCII databases from shell scripts. Fsdb is useful to process medium amounts of data (with very little data you'd do it by hand, with megabytes you might want a real database). Fsdb was known as as Jdb from 1991 to Oct. 2008. Fsdb is very good at doing things like: +o extracting measurements from experimental output +o examining data to address different hypotheses +o joining data from different experiments +o eliminating/detecting outliers +o computing statistics on data (mean, confidence intervals, correlations, histograms) +o reformatting data for graphing programs Fsdb is built around the idea of a flat text file as a database. Fsdb files (by convention, with the extension .fsdb), have a header documenting the schema (what the columns mean), and then each line represents a database record (or row). For example: #fsdb experiment duration ufs_mab_sys 37.2 ufs_mab_sys 37.3 ufs_rcp_real 264.5 ufs_rcp_real 277.9 Is a simple file with four experiments (the rows), each with a description, size parameter, and run time in the first, second, and third columns. Rather than hand-code scripts to do each special case, Fsdb provides higher-level functions. Although it's often easy throw together a custom script to do any single task, I believe that there are several advantages to using this library: +o these programs provide a higher level interface than plain Perl, so ** Fewer lines of simpler code: dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration Picks out just one type of experiment and computes statistics on it, rather than: while (<>) { split; $sum+=$F[1]; $ss+=$F[1]**2; $n++; } $mean = $sum / $n; $std_dev = ... in dozens of places. +o the library uses names for columns, so ** No more $F[1], use "_duration". ** New or different order columns? No changes to your scripts! Thus if your experiment gets more complicated with a size parameter, so your log changes to: #fsdb experiment size duration ufs_mab_sys 1024 37.2 ufs_mab_sys 1024 37.3 ufs_rcp_real 1024 264.5 ufs_rcp_real 1024 277.9 ufs_mab_sys 2048 45.3 ufs_mab_sys 2048 44.2 Then the previous scripts still work, even though duration is now the third column, not the second. +o A series of actions are self-documenting (each program records what it does). ** No more wondering what hacks were used to compute the final data, just look at the comments at the end of the output. For example, the commands dbrow '_experiment eq "ufs_mab_sys"' | dbcolstats duration add to the end of the output the lines # | dbrow _experiment eq "ufs_mab_sys" # | dbcolstats duration +o The library is mature, supporting large datasets, corner cases, error handling, backed by an automated test suite. ** No more puzzling about bad output because your custom script skimped on error checking. ** No more memory thrashing when you try to sort ten million records. +o Fsdb-2.x supports Perl scripting (in addition to shell scripting), with libraries to do Fsdb input and output, and easy support for pipelines. The shell script dbcol name test1 | dbroweval '_test1 += 5;' can be written in perl as: dbpipeline(dbcol(qw(name test1)), dbroweval('_test1 += 5;')); (The disadvantage is that you need to learn what functions Fsdb provides.) Fsdb is built on flat-ASCII databases. By storing data in simple text files and processing it with pipelines it is easy to experiment (in the shell) and look at the output. To the best of my knowledge, the original implementation of this idea was "/rdb", a commercial product described in the book UNIX relational database management: application development in the UNIX environment by Rod Manis, Evan Schaffer, and Robert Jorgensen (and also at the web page ). Fsdb is an incompatible re-implementation of their idea without any accelerated indexing or forms support. (But it's free, and probably has better statistics!). Fsdb-2.x supports threading and will exploit multiple processors or cores, and provides Perl-level support for input, output, and threaded- pipelines. Installation instructions follow at the end of this document. Fsdb-2.x requires Perl 5.8 to run. All commands have manual pages and provide usage with the "--help" option. All commands are backed by an automated test suite. The most recent version of Fsdb is available on the web at . WHAT'S NEW 2.50, 2014-05-27 a quick release for spec tweaks ENHANCEMENT In dbroweval, the "-N" (no output, even comments) option now implies "-n", and it now suppresses the header and trailer. BUG FIX A few more tweaks to the perl-Fsdb.spec from Petr Xabata. BUG FIX Fixed 3 uses of "use v5.10" in test suites that were causing test failures (due to warnings, not real failures) on some platforms. README CONTENTS executive summary what's new README CONTENTS installation basic data format basic data manipulation list of commands another example a gradebook example a password example history related work release notes copyright comments INSTALLATION Fsdb now uses the standard Perl build and installation from ExtUtil::MakeMaker(3), so the quick answer to installation is to type: perl Makefile.PL make make test make install Or, if you want to install it somewhere else, change the first line to perl Makefile.PL PREFIX=$HOME and it will go in your home directory's bin, etc. (See ExtUtil::MakeMaker(3) for more details.) Fsdb requires perl 5.8 or later and uses ithreads. A test-suite is available, run it with make test A FreeBSD port to Fsdb is available, see . A Fink (MacOS X) port is available, see . (Thanks to Lars Eggert for maintaining this port.) BASIC DATA FORMAT These programs are based on the idea storing data in simple ASCII files. A database is a file with one header line and then data or comment lines. For example: #fsdb account passwd uid gid fullname homedir shell johnh * 2274 134 John_Heidemann /home/johnh /bin/bash greg * 2275 134 Greg_Johnson /home/greg /bin/bash root * 0 0 Root /root /bin/bash # this is a simple database The header line must be first and begins with "#h". There are rows (records) and columns (fields), just like in a normal database. Comment lines begin with "#". Column names are any string not containing spaces or single quote (although it is prudent to keep them alphanumeric with underscore). By default, columns are delimited by whitespace. With this default configuration, the contents of a field cannot contain whitespace. However, this limitation can be relaxed by changing the field separator as described below. The big advantage of simple flat-text databases is that it is usually easy to massage data into this format, and it's reasonably easy to take data out of this format into other (text-based) programs, like gnuplot, jgraph, and LaTeX. Think Unix. Think pipes. (Or even output to Excel and HTML if you prefer.) Since no-whitespace in columns was a problem for some applications, there's an option which relaxes this rule. You can specify the field separator in the table header with "-F x" where "x" is a code for the new field separator. A full list of codes is at dbfilealter(1), but two common special values are "-F t" which is a separator of a single tab character, and "-F S", a separator of two spaces. Both allowing (single) spaces in fields. An example: #fsdb -F S account passwd uid gid fullname homedir shell johnh * 2274 134 John Heidemann /home/johnh /bin/bash greg * 2275 134 Greg Johnson /home/greg /bin/bash root * 0 0 Root /root /bin/bash # this is a simple database See dbfilealter(1) for more details. Regardless of what the column separator is for the body of the data, it's always whitespace in the header. There's also a third format: a "list". Because it's often hard to see what's columns past the first two, in list format each "column" is on a separate line. The programs dblistize and dbcolize convert to and from this format, and all programs work with either formats. The command dbfilealter -R C < DATA/passwd.fsdb outputs: #fsdb -R C account passwd uid gid fullname homedir shell account: johnh passwd: * uid: 2274 gid: 134 fullname: John_Heidemann homedir: /home/johnh shell: /bin/bash account: greg passwd: * uid: 2275 gid: 134 fullname: Greg_Johnson homedir: /home/greg shell: /bin/bash account: root passwd: * uid: 0 gid: 0 fullname: Root homedir: /root shell: /bin/bash # this is a simple database # | dblistize See dbfilealter(1) for more details. BASIC DATA MANIPULATION A number of programs exist to manipulate databases. Complex functions can be made by stringing together commands with shell pipelines. For example, to print the home directories of everyone with ``john'' in their names, you would do: cat DATA/passwd | dbrow '_fullname =~ /John/' | dbcol homedir The output might be: #fsdb homedir /home/johnh /home/greg # this is a simple database # | dbrow _fullname =~ /John/ # | dbcol homedir (Notice that comments are appended to the output listing each command, providing an automatic audit log.) In addition to typical database functions (select, join, etc.) there are also a number of statistical functions. The real power of Fsdb is that one can apply arbitary code to rows to do powerful things. cat DATA/passwd | dbroweval '_fullname =~ s/(\w+)_(\w+)/$2,_$1/' converts "John_Heidemann" into "Heidemann,_John". Not too much more work could split fullname into firstname and lastname fields. TALKING ABOUT COLUMNS An advantage of Fsdb is that you can talk about columns by name (symbolically) rather than simply by their positions. So in the above example, "dbcol homedir" pulled out the home directory column, and "dbrow '_fullname =~ /John/'" matched against column fullname. In general, you can use the name of the column listed on the "#fsdb" line to identify it in most programs, and _name to identify it in code. Some alternatives for flexibility: +o Numeric values identify columns positionally, numbering from 0. So 0 or _0 is the first column, 1 is the second, etc. +o In code, _last_columnname gets the value from columname's previous row. See dbroweval(1) for more details about writing code. LIST OF COMMANDS Enough said. I'll summarize the commands, and then you can experiment. For a detailed description of each command, see a summary by running it with the argument "--help" (or "-?" if you prefer.) Full manual pages can be found by running the command with the argument "--man", or running the Unix command "man dbcol" or whatever program you want. TABLE CREATION dbcolcreate add columns to a database dbcoldefine set the column headings for a non-Fsdb file TABLE MANIPULATION dbcol select columns from a table dbrow select rows from a table dbsort sort rows based on a set of columns dbjoin compute the natural join of two tables dbcolrename rename a column dbcolmerge merge two columns into one dbcolsplittocols split one column into two or more columns dbcolsplittorows split one column into multiple rows dbfilepivot "pivots" a file, converting multiple rows correponding to the same entity into a single row with multiple columns. dbfilevalidate check that db file doesn't have some common errors COMPUTATION AND STATISTICS dbcolstats compute statistics over a column (mean,etc.,optionally median) dbmultistats group rows by some key value, then compute stats (mean, etc.) over each group (equivalent to dbmapreduce with dbcolstats as the reducer) dbmapreduce group rows (map) and then apply an arbitrary function to each group (reduce) dbrvstatdiff compare two samples distributions (mean/conf interval/T-test) dbcolmovingstats computing moving statistics over a column of data dbcolstatscores compute Z-scores and T-scores over one column of data dbcolpercentile compute the rank or percentile of a column dbcolhisto compute histograms over a column of data dbcolscorrelate compute the coefficient of correlation over several columns dbcolsregression compute linear regression and correlation for two columns dbrowaccumulate compute a running sum over a column of data dbrowcount count the number of rows (a subset of dbstats) dbrowdiff compute differences between a columns in each row of a table dbrowenumerate number each row dbroweval run arbitrary Perl code on each row dbrowuniq count/eliminate identical rows (like Unix uniq(1)) dbfilediff compare fields on rows of a file (something like Unix diff(1)) OUTPUT CONTROL dbcolneaten pretty-print columns dbfilealter convert between column or list format, or change the column separator dbfilestripcomments remove comments from a table dbformmail generate a script that sends form mail based on each row CONVERSIONS (These programs convert data into fsdb. See their web pages for details.) cgi_to_db combined_log_format_to_db html_table_to_db HTML tables to fsdb (assuming they're reasonably formatted). kitrace_to_db ns_to_db tabdelim_to_db spreadsheet tab-delimited files to db tcpdump_to_db (see man tcpdump(8) on any reasonable system) xml_to_db XML input to fsdb, assuming they're very regular (And out of fsdb:) db_to_csv Comma-separated-value format from fsdb. db_to_html_table simple conversion of Fsdb to html tables STANDARD OPTIONS Many programs have common options: -? or --help Show basic usage. -N on --new-name When a command creates a new column like dbrowaccumulate's "accum", this option lets one override the default name of that new column. -T TmpDir where to put tmp files. Also uses environment variable TMPDIR, if -T is not specified. Default is /tmp. Show basic usage. -c FRACTION or --confidence FRACTION Specify confidence interval FRACTION (dbcolstats, dbmultistats, etc.) -C S or "--element-separator S" Specify column separator S (dbcolsplittocols, dbcolmerge). -d or --debug Enable debugging (may be repeated for greater effect in some cases). -a or --include-non-numeric Compute stats over all data (treating non-numbers as zeros). (By default, things that can't be treated as numbers are ignored for stats purposes) -S or --pre-sorted Assume the data is pre-sorted. May be repeated to disable verification (saving a small amount of work). -e E or --empty E give value E as the value for empty (null) records -i I or --input I Input data from file I. -o O or --output O Write data out to file O. --nolog. Skip logging the program in a trailing comment. When giving Perl code (in dbrow and dbroweval) column names can be embedded if preceded by underscores. Look at dbrow(1) or dbroweval(1) for examples.) Most programs run in constant memory and use temporary files if necessary. Exceptions are dbcolneaten, dbcolpercentile, dbmapreduce, dbmultistats, dbrowsplituniq. ANOTHER EXAMPLE Take the raw data in "DATA/http_bandwidth", put a header on it ("dbcoldefine size bw"), took statistics of each category ("dbmultistats -k size bw"), pick out the relevant fields ("dbcol size mean stddev pct_rsd"), and you get: #fsdb size mean stddev pct_rsd 1024 1.4962e+06 2.8497e+05 19.047 10240 5.0286e+06 6.0103e+05 11.952 102400 4.9216e+06 3.0939e+05 6.2863 # | dbcoldefine size bw # | /home/johnh/BIN/DB/dbmultistats -k size bw # | /home/johnh/BIN/DB/dbcol size mean stddev pct_rsd (The whole command was: cat DATA/http_bandwidth | dbcoldefine size | dbmultistats -k size bw | dbcol size mean stddev pct_rsd all on one line.) Then post-process them to get rid of the exponential notation by adding this to the end of the pipeline: dbroweval '_mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev);' (Actually, this step is no longer required since dbcolstats now uses a different default format.) giving: #fsdb size mean stddev pct_rsd 1024 1496200 284970 19.047 10240 5028600 601030 11.952 102400 4921600 309390 6.2863 # | dbcoldefine size bw # | dbmultistats -k size bw # | dbcol size mean stddev pct_rsd # | dbroweval { _mean = sprintf("%8.0f", _mean); _stddev = sprintf("%8.0f", _stddev); } In a few lines, raw data is transformed to processed output. Suppose you expect there is an odd distribution of results of one datapoint. Fsdb can easily produce a CDF (cumulative distribution function) of the data, suitable for graphing: cat DB/DATA/http_bandwidth | \ dbcoldefine size bw | \ dbrow '_size == 102400' | \ dbcol bw | \ dbsort -n bw | \ dbrowenumerate | \ dbcolpercentile count | \ dbcol bw percentile | \ xgraph The steps, roughly: 1. get the raw input data and turn it into fsdb format, 2. pick out just the relevant column (for efficiency) and sort it, 3. for each data point, assign a CDF percentage to it, 4. pick out the two columns to graph and show them A GRADEBOOK EXAMPLE The first commercial program I wrote was a gradebook, so here's how to do it with Fsdb. Format your data like DATA/grades. #fsdb name email id test1 a a@ucla.example.edu 1 80 b b@usc.example.edu 2 70 c c@isi.example.edu 3 65 d d@lmu.example.edu 4 90 e e@caltech.example.edu 5 70 f f@oxy.example.edu 6 90 Or if your students have spaces in their names, use "-F S" and two spaces to separate each column: #fsdb -F S name email id test1 alfred aho a@ucla.example.edu 1 80 butler lampson b@usc.example.edu 2 70 david clark c@isi.example.edu 3 65 constantine drovolis d@lmu.example.edu 4 90 debrorah estrin e@caltech.example.edu 5 70 sally floyd f@oxy.example.edu 6 90 To compute statistics on an exam, do cat DATA/grades | dbstats test1 |dblistize giving #fsdb -R C ... mean: 77.5 stddev: 10.84 pct_rsd: 13.987 conf_range: 11.377 conf_low: 66.123 conf_high: 88.877 conf_pct: 0.95 sum: 465 sum_squared: 36625 min: 65 max: 90 n: 6 ... To do a histogram: cat DATA/grades | dbcolhisto -n 5 -g test1 giving #fsdb low histogram 65 * 70 ** 75 80 * 85 90 ** # | /home/johnh/BIN/DB/dbhistogram -n 5 -g test1 Now you want to send out grades to the students by e-mail. Create a form-letter (in the file test1.txt): To: _email (_name) From: J. Random Professor Subject: test1 scores _name, your score on test1 was _test1. 86+ A 75-85 B 70-74 C 0-69 F Generate the shell script that will send the mail out: cat DATA/grades | dbformmail test1.txt > test1.sh And run it: sh passwd.fsdb To convert the group file cat /etc/group | sed 's/:/ /g' | \ dbcoldefine -F S group password gid members \ >group.fsdb To show the names of the groups that div7-members are in (assuming DIV7 is in the gecos field): cat passwd.fsdb | dbrow '_gecos =~ /DIV7/' | dbcol login gid | \ dbjoin -i - -i group.fsdb gid | dbcol login group SHORT EXAMPLES Which Fsdb programs are the most complicated (based on number of test cases)? ls TEST/*.cmd | \ dbcoldefine test | \ dbroweval '_test =~ s@^TEST/([^_]+).*$@$1@' | \ dbrowuniq -c | \ dbsort -nr count | \ dbcolneaten (Answer: dbmapreduce, then dbcolstats, dbfilealter and dbjoin.) Stats on an exam (in $FILE, where $COLUMN is the name of the exam)? cat $FILE | dbcolstats -q 4 $COLUMN <$FILE | dblistize | dbstripcomments cat $FILE | dbcolhisto -g -n 20 $COLUMN | dbcolneaten | dbstripcomments Merging a the hw1 column from file hw1.fsdb into grades.fsdb assuing there's a common student id in column "id": dbcol id hw1 t.fsdb dbjoin -a -e - grades.fsdb t.fsdb id | \ dbsort name | \ dbcolneaten >new_grades.fsdb Merging two fsdb files with the same rows: cat file1.fsdb file2.fsdb >output.fsdb or if you want to clean things up a bit cat file1.fsdb file2.fsdb | dbstripextraheaders >output.fsdb or if you want to know where the data came from for i in 1 2 do dbcolcreate source $i < file$i.fsdb done >output.fsdb (assumes you're using a Bourne-shell compatible shell, not csh). WARNINGS As with any tool, one should (which means must) understand the limits of the tool. All Fsdb tools should run in constant memory. In some cases (such as dbcolstats with quartiles, where the whole input must be re-read), programs will spool data to disk if necessary. Most tools buffer one or a few lines of data, so memory will scale with the size of each line. (So lines with many columns, or when columns have lots data, may cause larege memory consumption.) All Fsdb tools should run in constant or at worst "n log n" time. All Fsdb tools use normal Perl math routines for computation. Although I make every attempt to choose numerically stable algorithms (although I also welcome feedback and suggestions for improvement), normal rounding due to computer floating point approximations can result in inaccuracies when data spans a large range of precisions. (See for example the dbcolstats_extrema test cases.) Any requirements and limitations of each Fsdb tool is documented on its manual page. If any Fsdb program violates these assumptions, that is a bug that should be documented on the tool's manual page or ideally fixed. Fsdb does depend on Perl's correctness, and Perl (and Fsdb) have some bugs. Fsdb should work on perl from version 5.10 onward, but its use of threads gives bogus warnings in some versions of perl: +o perl-5.10 and 5.12 generate warnings "unbalanced string table refcount" and "scalars leaked" in dbmapreduce +o perl-5.10 generates warning "Attempt to free unreferenced scalar" in dbmultistats. To my knowledge these do not the correctness of the output, other than cluttering it up with warnings. HISTORY There have been three versions of Fsdb; fsdb 1.0 is a complete re-write of the pre-1995 versions, and was distributed from 1995 to 2007. Fsdb 2.0 is a significant re-write of the 1.x versions for reasons described below. Fsdb (in its various forms) has been used extensively by its author since 1991. Since 1995 it's been used by two other researchers at UCLA and several at ISI. In February 1998 it was announced to the Internet. Since then it has found a few users, some outside where I work. Fsdb 2.0 Rationale I've thought about fsdb-2.0 for many years, but it was started in earnest in 2007. Fsdb-2.0 has the following goals: in-one-process processing While fsdb is great on the Unix command line as a pipeline between programs, it should also be possible to set it up to run in a single process. And if it does so, it should be able to avoid serializing and deserializing (converting to and from text) data between each module. (Accomplished in fsdb-2.0: see dbpipeline, although still needs tuning.) clean IO API Fsdb's roots go back to perl4 and 1991, so the fsdb-1.x library is very, very crufty. More than just being ugly (but it was that too), this made things reading from one format file and writing to another the application's job, when it should be the library's. (Accomplished in fsdb-1.15 and improved in 2.0: see Fsdb::IO.) normalized module APIs Because fsdb modules were added as needed over 10 years, sometimes the module APIs became inconsistent. (For example, the 1.x "dbcolcreate" required an empty value following the name of the new column, but other programs specify empty values with the "-e" argument.) We should smooth over these inconsistencies. (Accomplished as each module was ported in 2.0 through 2.7.) everyone handles all input formats Given a clean IO API, the distinction between "colized" and "listized" fsdb files should go away. Any program should be able to read and write files in any format. (Accomplished in fsdb-2.1.) Fsdb-2.0 preserves backwards compatibility where possible, but breaks it where necessary to accomplish the above goals. In August 2008, fsdb-2.7 was declared preferred over the 1.x versions. Contributors Fsdb includes code ported from Geoff Kuenning ("Fsdb::Support::TDistribution"). Fsdb contributors: Ashvin Goel goel@cse.oge.edu, Geoff Kuenning geoff@fmg.cs.ucla.edu, Vikram Visweswariah visweswa@isi.edu, Kannan Varadahan kannan@isi.edu, Lars Eggert larse@isi.edu, Arkadi Gelfond arkadig@dyna.com, David Graff graff@ldc.upenn.edu, Haobo Yu haoboy@packetdesign.com, Pavlin Radoslavov pavlin@catarina.usc.edu, Graham Phillips, Yuri Pradkin, Alefiya Hussain, Ya Xu, Michael Schwendt, Fabio Silva fabio@isi.edu, Jerry Zhao zhaoy@isi.edu, Ning Xu nxu@aludra.usc.edu, Martin Lukac mlukac@lecs.cs.ucla.edu, Xue Cai, Michael McQuaid, Christopher Meng, Calvin Ardi. Fsdb includes datasets contributed from NIST (DATA/nist_zarr13.fsdb), from , the NIST/SEMATECH e-Handbook of Statistical Methods, section 1.4.2.8.1. Background and Data. The source is public domain, and reproduced with permission. RELATED WORK As stated in the introduction, Fsdb is an incompatible reimplementation of the ideas found in "/rdb". By storing data in simple text files and processing it with pipelines it is easy to experiment (in the shell) and look at the output. The original implementation of this idea was /rdb, a commercial product described in the book UNIX relational database management: application development in the UNIX environment by Rod Manis, Evan Schaffer, and Robert Jorgensen (and also at the web page ). While Fsdb is inspired by Rdb, it includes no code from it, and Fsdb makes several different design choices. In particular: rdb attempts to be closer to a "real" database, with provision for locking, file indexing. Fsdb focuses on single user use and so eschews these choices. Rdb also has some support for interactive editing. Fsdb leaves editing to text editors like emacs or vi. In August, 2002 I found out Carlo Strozzi extended RDB with his package NoSQL . According to Mr. Strozzi, he implemented NoSQL in awk to avoid the Perl start-up of RDB. Although I haven't found Perl startup overhead to be a big problem on my platforms (from old Sparcstation IPCs to 2GHz Pentium-4s), you may want to evaluate his system. The Linux Journal has a description of NoSQL at . It seems quite similar to Fsdb. Like /rdb, NoSQL supports indexing (not present in Fsdb). Fsdb appears to have richer support for statistics, and, as of Fsdb-2.x, its support for Perl threading may support faster performance (one-process, less serialization and deserialization). RELEASE NOTES Versions prior to 1.0 were released informally on my web page but were not announced. 0.0 1991 started for my own research use 0.1 26-May-94 first check-in to RCS 0.2 15-Mar-95 parts now require perl5 1.0, 22-Jul-97 adds autoconf support and a test script. 1.1, 20-Jan-98 support for double space field separators, better tests 1.2, 11-Feb-98 minor changes and release on comp.lang.perl.announce 1.3, 17-Mar-98 +o adds median and quartile options to dbstats +o adds dmalloc_to_db converter +o fixes some warnings +o dbjoin now can run on unsorted input +o fixes a dbjoin bug +o some more tests in the test suite 1.4, 27-Mar-98 +o improves error messages (all should now report the program that makes the error) +o fixed a bug in dbstats output when the mean is zero 1.5, 25-Jun-98 BUG FIX dbcolhisto, dbcolpercentile now handles non-numeric values like dbstats NEW dbcolstats computes zscores and tscores over a column NEW dbcolscorrelate computes correlation coefficients between two columns INTERNAL ficus_getopt.pl has been replaced by DbGetopt.pm BUG FIX all tests are now ``portable'' (previously some tests ran only on my system) BUG FIX you no longer need to have the db programs in your path (fix arose from a discussion with Arkadi Gelfond) BUG FIX installation no longer uses cp -f (to work on SunOS 4) 1.6, 24-May-99 NEW dbsort, dbstats, dbmultistats now run in constant memory (using tmp files if necessary) NEW dbcolmovingstats does moving means over a series of data NEW dbcol has a -v option to get all columns except those listed NEW dbmultistats does quartitles and medians NEW dbstripextraheaders now also cleans up bogus comments before the fist header BUG FIX dbcolneaten works better with double-space-separated data 1.7, 5-Jan-00 NEW dbcolize now detects and rejects lines that contain embedded copies of the field separator NEW configure tries harder to prevent people from improperly configuring/installing fsdb NEW tcpdump_to_db converter (incomplete) NEW tabdelim_to_db converter: from spreadsheet tab-delimited files to db NEW mailing lists for fsdb are "fsdb-announce@heidemann.la.ca.us" and "fsdb-talk@heidemann.la.ca.us" To subscribe to either, send mail to "fsdb-announce-request@heidemann.la.ca.us" or "fsdb-talk-request@heidemann.la.ca.us" with "subscribe" in the BODY of the message. BUG FIX dbjoin used to produce incorrect output if there were extra, unmatched values in the 2nd table. Thanks to Graham Phillips for providing a test case. BUG FIX the sample commands in the usage strings now all should explicitly include the source of data (typically from "cat foo.fsdb |"). Thanks to Ya Xu for pointing out this documentation deficiency. BUG FIX (DOCUMENTATION) dbcolmovingstats had incorrect sample output. 1.8, 28-Jun-00 BUG FIX header options are now preserved when writing with dblistize NEW dbrowuniq now optionally checks for uniqueness only on certain fields NEW dbrowsplituniq makes one pass through a file and splits it into separate files based on the given fields NEW converter for "crl" format network traces NEW anywhere you use arbitrary code (like dbroweval), _last_foo now maps to the last row's value for field _foo. OPTIMIZATION comment processing slightly changed so that dbmultistats now is much faster on files with lots of comments (for example, ~100k lines of comments and 700 lines of data!) (Thanks to Graham Phillips for pointing out this performance problem.) BUG FIX dbstats with median/quartiles now correctly handles singleton data points. 1.9, 6-Nov-00 NEW dbfilesplit, split a single input file into multiple output files (based on code contributed by Pavlin Radoslavov). BUG FIX dbsort now works with perl-5.6 1.10, 10-Apr-01 BUG FIX dbstats now handles the case where there are more n-tiles than data NEW dbstats now includes a -S option to optimize work on pre-sorted data (inspired by code contributed by Haobo Yu) BUG FIX dbsort now has a better estimate of memory usage when run on data with very short records (problem detected by Haobo Yu) BUG FIX cleanup of temporary files is slightly better 1.11, 2-Nov-01 BUG FIX dbcolneaten now runs in constant memory NEW dbcolneaten now supports "field specifiers" that allow some control over how wide columns should be OPTIMIZATION dbsort now tries hard to be filesystem cache-friendly (inspired by "Information and Control in Gray-box Systems" by the Arpaci-Dusseau's at SOSP 2001) INTERNAL t_distr now ported to perl5 module DbTDistr 1.12, 30-Oct-02 BUG FIX dbmultistats documentation typo fixed NEW dbcolmultiscale NEW dbcol has -r option for "relaxed error checking" NEW dbcolneaten has new -e option to strip end-of-line spaces NEW dbrow finally has a -v option to negate the test BUG FIX math bug in dbcoldiff fixed by Ashvin Goel (need to check Scheaffer test cases) BUG FIX some patches to run with Perl 5.8. Note: some programs (dbcolmultiscale, dbmultistats, dbrowsplituniq) generate warnings like: "Use of uninitialized value in concatenation (.)" or "string at /usr/lib/perl5/5.8.0/FileCache.pm line 98, line 2". Please ignore this until I figure out how to suppress it. (Thanks to Jerry Zhao for noticing perl-5.8 problems.) BUG FIX fixed an autoconf problem where configure would fail to find a reasonable prefix (thanks to Fabio Silva for reporting the problem) NEW db_to_html_table: simple conversion to html tables (NO fancy stuff) NEW dblib now has a function dblib_text2html() that will do simple conversion of iso-8859-1 to HTML 1.13, 4-Feb-04 NEW fsdb added to the freebsd ports tree . Maintainer: "larse@isi.edu" BUG FIX properly handle trailing spaces when data must be numeric (ex. dbstats with -FS, see test dbstats_trailing_spaces). Fix from Ning Xu "nxu@aludra.usc.edu". NEW dbcolize error message improved (bug report from Terrence Brannon), and list format documented in the README. NEW cgi_to_db converts CGI.pm-format storage to fsdb list format BUG FIX handle numeric synonyms for column names in dbcol properly ENHANCEMENT "talking about columns" section added to README. Lack of documentation pointed out by Lars Eggert. CHANGE dbformmail now defaults to using Mail ("Berkeley Mail") to send mail, rather than sendmail (sendmail is still an option, but mail doesn't require running as root) NEW on platforms that support it (i.e., with perl 5.8), fsdb works fine with unicode NEW dbfilevalidate: check a db file for some common errors 1.14, 24-Aug-06 ENHANCEMENT README cleanup INCOMPATIBLE CHANGE dbcolsplit renamed dbcolsplittocols NEW dbcolsplittorows split one column into multiple rows NEW dbcolsregression compute linear regression and correlation for two columns ENHANCEMENT cvs_to_db: better error handling, normalize field names, skip blank lines ENHANCEMENT dbjoin now detects (and fails) if non-joined files have duplicate names BUG FIX minor bug fixed in calculation of Student t-distributions (doesn't change any test output, but may have caused small errors) 1.15, 12-Nov-07 NEW fsdb-1.14 added to the MacOS Fink system . (Thanks to Lars Eggert for maintaining this port.) NEW Fsdb::IO::Reader and Fsdb::IO::Writer now provide reasonably clean OO I/O interfaces to Fsdb files. Highly recommended if you use fsdb directly from perl. In the fullness of time I expect to reimplement the entire thing using these APIs to replace the current dblib.pl which is still hobbled by its roots in perl4. NEW dbmapreduce now implements a Google-style map/reduce abstraction, generalizing dbmultistats. ENHANCEMENT fsdb now uses the Perl build system (Makefile.PL, etc.), instead of autoconf. This change paves the way to better perl-5-style modularization, proper manual pages, input of both listize and colize format for every program, and world peace. ENHANCEMENT dblib.pl is now moved to Fsdb::Old.pm. BUG FIX dbmultistats now propagates its format argument (-f). Bug and fix from Martin Lukac (thanks!). ENHANCEMENT dbformmail documentation now is clearer that it doesn't send the mail, you have to run the shell script it writes. (Problem observed by Unkyu Park.) ENHANCEMENT adapted to autoconf-2.61 (and then these changes were discarded in favor of The Perl Way. BUG FIX dbmultistats memory usage corrected (O(# tags), not O(1)) ENHANCEMENT dbmultistats can now optionally run with pre-grouped input in O(1) memory ENHANCEMENT dbroweval -N was finally implemented (eat comments) 2.0, 25-Jan-08 2.0, 25-Jan-08 --- a quiet 2.0 release (gearing up towards complete) ENHANCEMENT: shifting old programs to Perl modules, with the front-end program as just a wrapper. In the short-term, this change just means programs have real man pages. In the long-run, it will mean that one can run a pipeline in a single Perl program. So far: dbcol, dbroweval, the new dbrowcount. dbsort the new dbmerge, the old "dbstats" (renamed dbcolstats), dbcolrename, dbcolcreate, NEW: Fsdb::Filter::dbpipeline is an internal-only module that lets one use fsdb commands from within perl (via threads). It also provides perl function aliases for the internal modules, so a string of fsdb commands in perl are nearly as terse as in the shell: use Fsdb::Filter::dbpipeline qw(:all); dbpipeline( dbrow(qw(name test1)), dbroweval('_test1 += 5;') ); INCOMPATIBLE CHANGE: The old dbcolstats has been renamed dbcolstatscores. The new dbcolstats does the same thing as the old dbstats. This incompatibility is unfortunate but normalizes program names. CHANGE: The new dbcolstats program always outputs "-" (the default empty value) for statistics it cannot compute (for example, standard deviation if there is only one row), instead of the old mix of "-" and "na". INCOMPATIBLE CHANGE: The old dbcolstats program, now called dbcolstatscores, also has different arguments. The "-t mean,stddev" option is now "--tmean mean --tstddev stddev". See dbcolstatscores for details. INCOMPATIBLE CHANGE: dbcolcreate now assumes all new columns get the default value rather than requiring each column to have an initial constant value. To change the initial value, sue the new "-e" option. NEW: dbrowcount counts rows, an almost-subset of dbcolstats's "n" output (except without differentiating numeric/non-numeric input), or the equivalent of "dbstripcomments | wc -l". NEW: dbmerge merges two sorted files. This functionality was previously embedded in dbsort. INCOMPATIBLE CHANGE: dbjoin's "-i" option to include non-matches is now renamed "-a", so as to not conflict with the new standard option "-i" for input file. 2.1, 6-Apr-08 2.1, 6-Apr-08 --- another alpha 2.0, but now all converted programs understand both listize and colize format ENHANCEMENT: shifting more old programs to Perl modules. New in 2.1: dbcolneaten, dbcoldefine, dbcolhisto, dblistize, dbcolize, dbrecolize ENHANCEMENT dbmerge now handles an arbitrary number of input files, not just exactly two. NEW dbmerge2 is an internal routine that handles merging exactly two files. INCOMPATIBLE CHANGE dbjoin now specifies inputs like dbmerge2, rather than assuming the first two arguments were tables (as in fsdb-1). The old dbjoin argument "-i" is now "-a" or <--type=outer>. A minor change: comments in the source files for dbjoin are now intermixed with output rather than being delayed until the end. ENHANCEMENT dbsort now no longer produces warnings when null values are passed to numeric comparisons. BUG FIX dbroweval now once again works with code that lacks a trailing semicolon. (This bug fixes a regression from 1.15.) INCOMPATIBLE CHANGE dbcolneaten's old "-e" option (to avoid end-of-line spaces) is now "-E" to avoid conflicts with the standard empty field argument. INCOMPATIBLE CHANGE dbcolhisto's old "-e" option is now "-E" to avoid conflicts. And its "-n", "-s", and "-w" are now "-N", "-S", and "-W" to correspond. NEW dbfilealter replaces dbrecolize, dblistize, and dbcolize, but with different options. ENHANCEMENT The library routines "Fsdb::IO" now understand both list- format and column-format data, so all converted programs can now automatically read either format. This capability was one of the milestone goals for 2.0, so yea! 2.2, 23-May-08 Release 2.2 is another 2.x alpha release. Now most of the commands are ported, but a few remain, and I plan one last incompatible change (to the file header) before 2.x final. ENHANCEMENT shifting more old programs to Perl modules. New in 2.2: dbrowaccumulate, dbformmail. dbcolmovingstats. dbrowuniq. dbrowdiff. dbcolmerge. dbcolsplittocols. dbcolsplittorows. dbmapreduce. dbmultistats. dbrvstatdiff. Also dbrowenumerate exists only as a front-end (command-line) program. INCOMPATIBLE CHANGE The following programs have been dropped from fsdb-2.x: dbcoltighten, dbfilesplit, dbstripextraheaders, dbstripleadingspace. NEW combined_log_format_to_db to convert Apache logfiles INCOMPATIBLE CHANGE Options to dbrowdiff are now -B and -I, not -a and -i. INCOMPATIBLE CHANGE dbstripcomments is now dbfilestripcomments. BUG FIXES dbcolneaten better handles empty columns; dbcolhisto warning suppressed (actually a bug in high-bucket handling). INCOMPATIBLE CHANGE dbmultistats now requires a "-k" option in front of the key (tag) field, or if none is given, it will group by the first field (both like dbmapreduce). KNOWN BUG dbmultistats with quantile option doesn't work currently. INCOMPATIBLE CHANGE dbcoldiff is renamed dbrvstatdiff. BUG FIXES dbformmail was leaving its log message as a command, not a comment. Oops. No longer. 2.3, 27-May-08 (alpha) Another alpha release, this one just to fix the critical dbjoin bug listed below (that happens to have blocked my MP3 jukebox :-). BUG FIX Dbsort no longer hangs if given an input file with no rows. BUG FIX Dbjoin now works with unsorted input coming from a pipeline (like stdin). Perl-5.8.8 has a bug (?) that was making this case fail---opening stdin in one thread, reading some, then reading more in a different thread caused an lseek which works on files, but fails on pipes like stdin. Go figure. BUG FIX / KNOWN BUG The dbjoin fix also fixed dbmultistats -q (it now gives the right answer). Although a new bug appeared, messages like: Attempt to free unreferenced scalar: SV 0xa9dd0c4, Perl interpreter: 0xa8350b8 during global destruction. So the dbmultistats_quartile test is still disabled. 2.4, 18-Jun-08 Another alpha release, mostly to fix minor usability problems in dbmapreduce and client functions. ENHANCEMENT dbrow now defaults to running user supplied code without warnings (as with fsdb-1.x). Use "--warnings" or "-w" to turn them back on. ENHANCEMENT dbroweval can now write different format output than the input, using the "-m" option. KNOWN BUG dbmapreduce emits warnings on perl 5.10.0 about "Unbalanced string table refcount" and "Scalars leaked" when run with an external program as a reducer. dbmultistats emits the warning "Attempt to free unreferenced scalar" when run with quartiles. In each case the output is correct. I believe these can be ignored. CHANGE dbmapreduce no longer logs a line for each reducer that is invoked. 2.5, 24-Jun-08 Another alpha release, fixing more minor bugs in "dbmapreduce" and lossage in "Fsdb::IO". ENHANCEMENT dbmapreduce can now tolerate non-map-aware reducers that pass back the key column in put. It also passes the current key as the last argument to external reducers. BUG FIX Fsdb::IO::Reader, correctly handle "-header" option again. (Broken since fsdb-2.3.) 2.6, 11-Jul-08 Another alpha release, needed to fix DaGronk. One new port, small bug fixes, and important fix to dbmapreduce. ENHANCEMENT shifting more old programs to Perl modules. New in 2.2: dbcolpercentile. INCOMPATIBLE CHANGE and ENHANCEMENTS dbcolpercentile arguments changed, use "--rank" to require ranking instead of "-r". Also, "--ascending" and "--descending" can now be specified separately, both for "--percentile" and "--rank". BUG FIX Sigh, the sense of the --warnings option in dbrow was inverted. No longer. BUG FIX I found and fixed the string leaks (errors like "Unbalanced string table refcount" and "Scalars leaked") in dbmapreduce and dbmultistats. (All "IO::Handle"s in threads must be manually destroyed.) BUG FIX The "-C" option to specify the column separator in dbcolsplittorows now works again (broken since it was ported). 2.7, 30-Jul-08 beta The beta release of fsdb-2.x. Finally, all programs are ported. As statistics, the number of lines of non-library code doubled from 7.5k to 15.5k. The libraries are much more complete, going from 866 to 5164 lines. The overall number of programs is about the same, although 19 were dropped and 11 were added. The number of test cases has grown from 116 to 175. All programs are now in perl-5, no more shell scripts or perl-4. All programs now have manual pages. Although this is a major step forward, I still expect to rename "fsdb" to "fsdb". ENHANCEMENT shifting more old programs to Perl modules. New in 2.7: dbcolscorellate. dbcolsregression. cgi_to_db. dbfilevalidate. db_to_csv. csv_to_db, db_to_html_table, kitrace_to_db, tcpdump_to_db, tabdelim_to_db, ns_to_db. INCOMPATIBLE CHANGE The following programs have been dropped from fsdb-2.x: db2dcliff, dbcolmultiscale, crl_to_db. ipchain_logs_to_db. They may come back, but seemed overly specialized. The following program dbrowsplituniq was dropped because it is superseded by dbmapreduce. dmalloc_to_db was dropped pending a test cases and examples. ENHANCEMENT dbfilevalidate now has a "-c" option to correct errors. NEW html_table_to_db provides the inverse of db_to_html_table. 2.8, 5-Aug-08 Change header format, preserving forwards compatibility. BUG FIX Complete editing pass over the manual, making sure it aligns with fsdb-2.x. SEMI-COMPATIBLE CHANGE The header of fsdb files has changed, it is now #fsdb, not #h (or #L) and parsing of -F and -R are also different. See dbfilealter for the new specification. The v1 file format will be read, compatibly, but not written. BUG FIX dbmapreduce now tolerates comments that preceed the first key, instead of failing with an error message. 2.9, 6-Aug-08 Still in beta; just a quick bug-fix for dbmapreduce. ENHANCEMENT dbmapreduce now generates plausible output when given no rows of input. 2.10, 23-Sep-08 Still in beta, but picking up some bug fixes. ENHANCEMENT dbmapreduce now generates plausible output when given no rows of input. ENHANCEMENT dbroweval the warnings option was backwards; now corrected. As a result, warnings in user code now default off (like in fsdb-1.x). BUG FIX dbcolpercentile now defaults to assuming the target column is numeric. The new option "-N" allows selectin of a non-numeric target. BUG FIX dbcolscorrelate now includes "--sample" and "--nosample" options to compute the sample or full population correlation coefficients. Thanks to Xue Cai for finding this bug. 2.11, 14-Oct-08 Still in beta, but picking up some bug fixes. ENHANCEMENT html_table_to_db is now more agressive about filling in empty cells with the official empty value, rather than leaving them blank or as whitespace. ENHANCEMENT dbpipeline now catches failures during pipeline element setup and exits reasonably gracefully. BUG FIX dbsubprocess now reaps child prcoesses, thus avoiding running out of processes when used a lot. 2.12, 16-Oct-08 Finally, a full (non-beta) 2.x release! INCOMPATIBLE CHANGE Jdb has been renamed Fsdb, the flatfile-streaming database. This change affects all internal Perl APIs, but no shell command-level APIs. While Jdb served well for more than ten years, it is easily confused with the Java debugger (even though Jdb was there first!). It also is too generic to work well in web search engines. Finally, Jdb stands for ``John's database'', and we're a bit beyond that. (However, some call me the ``file-system guy'', so one could argue it retains that meeting.) If you just used the shell commands, this change should not affect you. If you used the Perl-level libraries directly in your code, you should be able to rename "Jdb" to "Fsdb" to move to 2.12. The jdb-announce list not yet been renamed, but it will be shortly. With this release I've accomplished everything I wanted to in fsdb-2.x. I therefore expect to return to boring, bugfix releases. 2.13, 30-Oct-08 BUG FIX dbrowaccumulate now treats non-numeric data as zero by default. BUG FIX Fixed a perl-5.10ism in dbmapreduce that breaks that program under 5.8. Thanks to Martin Lukac for reporting the bug. 2.14, 26-Nov-08 BUG FIX Improved documentation for dbmapreduce's "-f" option. ENHANCEMENT dbcolmovingstats how computes a moving standard deviation in addition to a moving mean. 2.15, 13-Apr-09 BUG FIX Fix a make install bug reported by Shalindra Fernando. 2.16, 14-Apr-09 BUG FIX Another minor release bug: on some systems programize_module looses executable permissions. Again reported by Shalindra Fernando. 2.17, 25-Jun-09 TYPO FIXES Typo in the dbroweval manual fixed. IMPROVEMENT There is no longer a comment line to label columns in dbcolneaten, instead the header line is tweaked to line up. This change restores the Jdb-1.x behavior, and means that repeated runs of dbcolneaten no longer add comment lines each time. BUG FIX It turns out dbcolneaten was not correctly handling trailing spaces when given the "-E" option to suppress them. This regression is now fixed. EXTENSION dbroweval(1) can now handle direct references to the last row via $lfref, a dubious but now documented feature. BUG FIXES Separators set with "-C" in dbcolmerge and dbcolsplittocols were not properly setting the heading, and null fields were not recognized. The first bug was reported by Martin Lukac. 2.18, 1-Jul-09 A minor release IMPROVEMENT Documentation for Fsdb::IO::Reader has been improved. IMPROVEMENT The package should now be PGP-signed. 2.19, 10-Jul-09 BUG FIX Internal improvements to debugging output and robustness of dbmapreduce and dbpipeline. TEST/dbpipeline_first_fails.cmd re- enabled. 2.20, 30-Nov-09 (A collection of minor bugfixes, plus a build against Fedora 12.) BUG FIX Loging for dbmapreduce with code refs is now stable (it no longer includes a hex pointer to the code reference). BUG FIX Better handling of mixed blank lines in Fsdb::IO::Reader (see test case dbcolize_blank_lines.cmd). BUG FIX html_table_to_db now handles multi-line input better, and handles tables with COLSPAN. BUG FIX dbpipeline now cleans up threads in an "eval" to prevent "cannot detach a joined thread" errors that popped up in perl-5.10. Hopefully this prevents a race condition that causes the test suites to hang about 20% of the time (in dbpipeline_first_fails). IMPROVEMENT dbmapreduce now detects and correctly fails when the input and reducer have incompatible field seperators. IMPROVEMENT dbcolstats, dbcolhisto, dbcolscorrelate, dbcolsregression, and dbrowcount now all take an "-F" option to let one specify the output field seperator (so they work better with dbmapreduce). BUG FIX An omitted "-k" from the manual page of dbmultistats is now there. Bug reported by Unkyu Park. 2.21, 17-Apr-10 bug fix release BUG FIX Fsdb::IO::Writer now no longer fails with -outputheader => never (an obscure bug). IMPROVEMENT Fsdb (in the warnings section) and dbcolstats now more carefully document how they handle (and do not handle) numerical precision problems, and other general limits. Thanks to Yuri Pradkin for prompting this documentation. IMPROVEMENT "Fsdb::Support::fullname_to_sortkey" is now restored from "Jdb". IMPROVEMENT Documention for multiple styles of input approaches (including performance description) added to Fsdb::IO. 2.22, 2010-10-31 One new tool dbcolcopylast and several bug fixes for Perl 5.10. BUG FIX dbmerge now correctly handles n-way merges. Bug reported by Yuri Pradkin. INCOMPATABLE CHANGE dbcolneaten now defaults to not padding the last column. ADDITION dbrowenumerate now takes -N NewColumn to give the new column a name other than "count". Feature requested by Mike Rouch in January 2005. ADDITION New program dbcolcopylast copies the last value of a column into a new column copylast_column of the next row. New program requested by Fabio Silva; useful for convereting dbmultistats output into dbrvstatdiff input. BUG FIX Several tools (particularly dbmapreduce and dbmultistats) would report errors like "Unbalanced string table refcount: (1) for "STDOUT" during global destruction" on exit, at least on certain versions of Perl (for me on 5.10.1), but similar errors have been off-and-on for several Perl releases. Although I think my code looked OK, I worked around this problem with a different way of handling standard IO redirection. 2.23, 2011-03-10 Several small portability bugfixes; improved dbcolstats for large datsets IMPROVEMENT Documentation to dbrvstatdiff was changed to use "sd" to refer to standard deviation, not "ss" (which might be confused with sum-of- squares). BUG FIX This documentation about dbmultistats was missing the -k option in some cases. BUG FIX dbmapreduce was failing on MacOS-10.6.3 for some tests with the error dbmapreduce: cannot run external dbmapreduce reduce program (perl TEST/dbmapreduce_external_with_key.pl) The problem seemed to be only in the error, not in operation. On MacOS, the error is now suppressed. Thanks to Alefiya Hussain for providing access to a Mac system that allowed debugging of this problem. IMPROVEMENT The csv_to_db command requires an external Perl library (Text::CSV_XS). On computers that lack this optional library, previously Fsdb would configure with a warning and then test cases would fail. Now those test cases are skipped with an additional warning. BUG FIX The test suite now supports alternative valid output, as a hack to account for last-digit floating point differences. (Not very satisfying :-( BUG FIX dbcolstats output for confidence intervals on very large datasets has changed. Previously it failed for more than 2^31-1 records, and handling of T-Distributions with thousands of rows was a bit dubious. Now datasets with more than 10000 are considered infinitely large and hopefully correctly handled. 2.24, 2011-04-15 Improvements to fix an old bug in dbmapreduce with different field separators IMPROVEMENT The dbfilealter command had a "--correct" option to work-around from incompatible field-seperators, but it did nothing. Now it does the correct but sad, data-loosing thing. IMPROVEMENT The dbmultistats command previously failed with an error message when invoked on input with a non-default field separator. The root cause was the underlying dbmapreduce that did not handle the case of reducers that generated output with a different field separator than the input. We now detect and repair incompatible field separators. This change corrects a problem originally documented and detected in Fsdb-2.20. Bug re-reported by Unkyu Park. 2.25, 2011-08-07 Two new tools, xml_to_db and dbfilepivot, and a bugfix for two people. IMPROVEMENT kitrace_to_db now supports a --utc option, which also fixes this test case for users outside of the Pacific time zone. Bug reported by David Graff, and also by Peter Desnoyers (within a week of each other :-) NEW xml_to_db can convert simple, very regular XML files into Fsdb. NEW dbfilepivot "pivots" a file, converting multiple rows correponding to the same entity into a single row with multiple columns. 2.26, 2011-12-12 Bug fixes, particularly for perl-5.14.2. BUG FIX Bugs fixed in Fsdb::IO::Reader(3) manual page. BUG FIX Fixed problems where dbcolstats was truncating floating point numbers when sorting. This strange behavior happenes as of perl-5.14.2 and it seems like a Perl bug. I've worked around it for the test suites, but I'm a bit nervous. 2.27, 2012-11-15 Accumulated bug fixes. IMPROVEMENT csv_to_db now reports errors in CVS input with real diagnostics. IMPROVEMENT dbcolmovingstats can now compute median, when given the "-m" option. BUG FIX dbcolmovingstats non-numeric handling (the "-a" option) now works properly. DOCUMENTATION The internal t/test_command.t test framework is now documented. BUG FIX dbrowuniq now corretly handles the case where there is no input (previously it output a blank line, which is a malformed fsdb file). Thanks to Yuri Pradkin for reporting this bug. 2.28, 2012-11-15 A quick release to fix most rpmlint errors. BUG FIX Fixed a number of minor release problems (wrong permissions, old FSF address, etc.) found by rpmlint. 2.29, 2012-11-20 a quick release for CPAN testing IMPROVEMENT Tweaked the RPM spec. IMPROVEMENT Modified Makefile.PL to fail gracefully on Perl installations that lack threads. (Without this fix, I get massive failures in the non-ithreads test system.) 2.30, 2012-11-25 imporovements to perl portability BUG FIX Removed unicode character in documention of dbcolscorrelated so pod tests will pass. (Sigh, that should work :-( ) BUG FIX Fixed test suite failures on 5 tests (dbcolcreate_double_creation was the first) due to Carp's addition of a period. This problem was breaking Fsdb on perl-5.17. Thanks to Michael McQuaid for helping diagnose this problem. IMPROVEMENT The test suite now prints out the names of tests it tries. 2.31, 2012-11-28 A release with actual improvements to dbfilepivot and dbrowuniq. BUG FIX Documentation fixes: typos in dbcolscorrelated, bugs in dbfilepivot, clarification for comment handling in Fsdb::IO::Reader. IMPROVEMENT Previously dbfilepivot assumed the input was grouped by keys and didn't very that pre-condition. Now there is no pre-condition (it will sort the input by default), and it checks if the invariant is violated. BUG FIX Previously dbfilepivot failed if the input had comments (oops :-); no longer. IMPROVEMENT Now dbrowuniq has the "-L" option to preserve the last unique row (instead of the first), a common idiom. 2.32, 2012-12-21 Test suites should now be more numerically robust. NEW New dbfilediff does fsdb-aware file differencing. It does not do smart intuition of add/removes like Unix diff(1), but it does know about columns, and with "-E", it does numeric-aware differences. IMPROVEMENT Test suites that are numeric now use dbfilediff to do numeric-aware comparisons, so the test suite should now be robust to slightly different computers and operating systems and complilers than exactly what I use. 2.33, 2012-12-23 Minor fixes to some test cases. IMPROVEMENT dbfilediff and dbrowuniq now supports the "-N" option to give the new column a different name. (And a test cases where this duplication mattered have been fixed.) IMPROVEMENT dbrvstatdiff now show the t-test breakpoint with a reasonable number of floating point digits. BUG FIX Fixed a numerical stability problem in the dbroweval_last test case. WHAT'S NEW 2.34, 2013-02-10 Parallelism in dbmerge. IMPROVEMENT Documention for dbjoin now includes resource requirements. IMPROVEMENT Default memory usage for dbsort is now about 256MB. (The world keeps moving forward.) IMPROVEMENT dbmerge now does merging in parallel. As a side-effect, dbsort should be faster when input overflows memory. The level of parallelism can be limited with the "--parallelism" option. (There is more work to do here, but we're off to a start.) 2.35, 2013-02-23 Improvements to dbmerge parallelism BUG FIX Fsdb temporary files are now created more securely (with File::Temp). IMPROVEMENT Programs that sort or merge on fields (dbmerge2, dbmerge, dbsort, dbjoin) now report an error if no fields on which to join or merge are given. IMPROVEMENT Parallelism in dbmerge is should now be more consistent, with less starting and stopping. IMPROVEMENT In dbmerge, the "--xargs" option lets one give input filenames on standard input, rather than the command line. This feature paves the way for faster dbsort for large inputs (by pipelining sorting and merging), expected in the next release. 2.36, 2013-02-25 dbsort pipelines with dbmerge IMPROVEMENT For large inputs, dbsort now pipelines sorting and merging, allowing earlier processing. BUG FIX Since 2.35, dbmerge delayed cleanup of intermediate files, thereby requiring extra disk space. 2.37, 2013-02-26 quick bugfix to support parallel sort and merge from recent releases BUG FIX Since 2.35, dbmerge delayed removal of input files given by "--xargs". This problem is now fixed. 2.38, 2013-04-29 minor bug fixes CLARIFICATION Configure now rejects Windows since tests seem to hang on some versions of Windows. (I would love help from a Windows developer to get this problem fixed, but I cannot do it.) See https://rt.cpan.org/Ticket/Display.html?id=84201. IMPROVEMENT All programs that use temporary files (dbcolpercentile, dbcolscorrelate, dbcolstats, dbcolstatscores) now take the "-T" option and set the temporary directory consistently. In addition, error messages are better when the temporary directory has problems. Problem reported by Liang Zhu. BUG FIX dbmapreduce was failing with external, map-reduce aware reducers (when invoked with -M and an external program). (Sigh, did this case ever work?) This case should now work. Thanks to Yuri Pradkin for reporting this bug (in 2011). BUG FIX Fixed perl-5.10 problem with dbmerge. Thanks to Yuri Pradkin for reporting this bug (in 2013). 2.39, date 2013-05-31 quick release for the dbrowuniq extension BUG FIX Actually in 2.38, the Fedora .spec got cleaner dependencies. Suggestion from Christopher Meng via . ENHANCEMENT Fsdb files are now explicitly set into UTF-8 encoding, unless one specifies "-encoding" to "Fsdb::IO". ENHANCEMENT dbrowuniq now supports "-I" for incremental counting. 2.40, 2013-07-13 small bug fixes BUG FIX dbsort now has more respect for a user-given temporary directory; it no longer is ignored for merging. IMPROVEMENT dbrowuniq now has options to output the first, last, and both first and last rows of a run ("-F", "-L", and "-B"). BUG FIX dbrowuniq now correctly handles "-N". Sigh, it didn't work before. 2.41, 2013-07-29 small bug and packaging fixes ENHANCEMENT Documentation to dbrvstatdiff improved (inspired by questions from Qian Kun). BUG FIX dbrowuniq no longer duplicates singleton unique lines when outputing both (with "-B"). BUG FIX Add missing "XML::Simple" dependency to Makefile.PL. ENHANCEMENT Tests now show the diff of the failing output if run with "make test TEST_VERBOSE=1". ENHANCEMENT dbroweval now includes documentation for how to output extra rows. Suggestion from Yuri Pradkin. BUG FIX Several improvements to the Fedora package from Michael Schwendt via , and from the harsh master that is rpmlint. (I am stymied at teaching it that "outliers" is spelled correctly. Maybe I should send it Schneier's book. And an unresolvable invalid-spec-name lurks in the SRPM.) 2.42, 2013-07-31 A bug fix and packaging release. ENHANCEMENT Documentation to dbjoin improved to better memory usage. (Based on problem report by Lin Quan.) BUG FIX The .spec is now perl-Fsdb.spec to satisfy rpmlint. Thanks to Christopher Meng for a specific bug report. BUG FIX Test dbroweval_last.cmd no longer has a column that caused failures because of numerical instability. BUG FIX Some tests now better handle bugs in old versions of perl (5.10, 5.12). Thanks to Calvin Ardi for help debugging this on a Mac with perl-5.12, but the fix should affect other platforms. 2.43, 2013-08-27 Adds in-file compression. BUG FIX Changed the sort on TEST/dbsort_merge.cmd to strings (from numerics) so we're less succeptable to false test-failures due to floating point IO differences. EXPERIMENTAL ENHANCEMENT Yet more parallelism in dbmerge: new "endgame-mode" builds a merge tree of processes at the end of large merge tasks to get maximally parallelism. Currently this feature is off by default because it can hang for some inputs. Enable this experimental feature with "--endgame". ENHANCEMENT "Fsdb::IO" now handles being given "IO::Pipe" objects (as exercised by dbmerge). BUG FIX Handling of NamedTmpfiles now supports concurrency. This fix will hopefully fix occasional "Use of uninitialized value $_ in string ne at ...NamedTmpfile.pm line 93." errors. BUG FIX Fsdb now requires perl 5.10. This is a bug fix because some test cases used to require it, but this fact was not properly documented. (Back-porting to 5.008 would require removing all "//" operators.) ENHANCEMENT Fsdb now handles automatic compression of file contents. Enable compression with "dbfilealter -Z xz" (or "gz" or "bz2"). All programs should operate on compressed files and leave the output with the same level of copmresion. "xz" is recommended as fastest and most efficient. "gz" is produces unrepeatable output (and so has no output test), it seems to insist on adding a timestamp. 2.44, 2013-10-02 A major change--all threads are gone. ENHANCEMENT Fsdb is now thread free and only uses processes for parallelism. This change is a big change--the entire motivation for Fsdb-2 was to exploit parallelism via threading. Parallelism--good, but perl threading--bad for performance. Horribly bad for performance. About 20x worse than pipes on my box. (See perl bug #119445 for the discussion.) NEW "Fsdb::Support::Freds" provides a thread-like abstraction over forking, with some nice support for callbacks in the parent upon child termination. ENHANCEMENT Details about removing theads: "dbpipeline" is thread free, and new tests to verify each of its parts. The easy cases are "dbcolpercentile", "dbcolstats", "dbfilepivot", "dbjoin", and "dbcolstatscores", each of which use it in simple ways (2013-09-09). "dbmerge" is now thread free (2013-09-13), but was a signficant rewrite, which brought "dbsort" along. "dbmapreduce" is partly thread free (2013-09-21), again as a rewrite, and it brings "dbmultistats" along. Full "dbmapreduce" support took much longer (2013-10-02). BUG FIX When running with user-only output ("-n"), dbroweval now resets the output vector $ofref after it has been output. NEW dbcolcreate will create all columns at the head of each row with the "--first" option. NEW dbfilecat will concatinate two files, verifying that thye have the same schema. ENHANCEMENT dbmapreduce now passes comments through, rather than eating them as before. Also, dbmapreduce now supports a "--" option to prevent misinterpreting sub-program parameters as for dbmapreduce. INCOMAPTIBLE CHANGE dbmapreduce no longer figures out if it needs to add the key to the output. For multi-key-aware reducers, it never does (and cannot). For non-multi-key-aware reducers, it defaults to add the key and will now fail if the reducer adds the key (with error "dbcolcreate: attempt to create pre-existing column..."). In such cases, one must disable adding the key with the new option "--no-prepend-key". INCOMAPTIBLE CHANGE dbmapreduce no longer copies the input field separator by default. For multi-key-aware reducers, it never does (and cannot). For non- multi-key-aware reducers, it defaults to not copying the field separator, but it will copy it (the old default) with the "--copy-fs" option 2.45, 2013-10-07 cleanup from de-thread-ification BUG FIX Corrected a fast busy-wait in dbmerge. ENHANCEMENT Endgame mode enabled in dbmerge; it (and also large cases of dbsort) should now exploit greater parallelism. BUG FIX Test case with "Fsdb::BoundedQueue" (gone since 2.44) now removed. 2.46, 2013-10-08 continuing cleanup of our no-threads version BUG FIX Fixed some packaging details. (Really, threads are no longer required, missing tests in the MANIFEST.) IMPROVEMENT dbsort now better communicates with the merge process to avoid bursty parallelism. Fsdb::IO::Writer now can take "-autoflush =" 1> for line-buffered IO. 2.47, 2013-10-12 test suite cleanup for non-threaded perls BUG FIX Removed some stray "use threads" in some test cases. We didn't need them, and these were breaking non-threaded perls. BUG FIX Better handling of Fred cleanup; should fix intermittent dbmapreduce failures on BSD. ENHANCEMENT Improved test framework to show output when tests fail. (This time, for real.) 2.48, 2014-01-03 small bugfixes and improved release engineering ENHANCEMENT Test suites now skip tests for libraries that are missing. (Patch for missing "IO::Compresss:Xz" contributed by Calvin Ardi.) ENHANCEMENT Removed references to Jdb in the package specification. Since the name was changed in 2008, there's no longer a huge need for backwards compatability. (Suggestion form Petr Xabata.) ENHANCEMENT Test suites now invoke the perl using the path from $Config{perlpath}. Hopefully this helps testing in environments where there are multiple installed perls and the default perl is not the same as the perl-under-test (as happens in cpantesters.org). BUG FIX Added specific encoding to this manpage to account for Unicode. Required to build correctly against perl-5.18. 2.49, 2014-01-04 bugfix to unicode handling in Fsdb IO (plus minor packaging fixes) BUG FIX Restored a line in the .spec to chmod g-s. BUG FIX Unicode decoding is now handled correctly for programs that read from standard input. (Also: New test scripts cover unicode input and output.) BUG FIX Fix to Fsdb documentation encoding line. Addresses test failure in perl-5.16 and earlier. (Who knew "encoding" had to be followed by a blank line.) AUTHOR John Heidemann, "johnh@isi.edu" See "Contributors" for the many people who have contributed bug reports and fixes. COPYRIGHT Fsdb is Copyright (C) 1991-2013 by John Heidemann . This program is free software; you can redistribute it and/or modify it under the terms of version 2 of the GNU General Public License as published by the Free Software Foundation. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. A copy of the GNU General Public License can be found in the file ``COPYING''. COMMENTS and BUG REPORTS Any comments about these programs should be sent to John Heidemann "johnh@isi.edu". perl v5.18.2 2014-05-27 Fsdb(3)