<< Prev  |  TOC  |  Front Page  |  Talkback  |  FAQ  |  Next >>
...making Linux just a little more fun!
Quick and Dirty Data Extraction in AWK
By Phil Hughes

CC: Quick and Dirty Data Extraction in AWK

Many years ago, probably close to 20, there was a regular point made on the comp Usenet newsgroups about using the minimum tool to get the job done. That is, someone would ask for a quick and dirty way to do something and the followups could include a C solution followed by an AWK solution, followed by a sed solution and so on.

Today, I still try to use this philosophy when addressing a problem. In this particular case, I picked AWK but if any of you old-timers are reading this I expect you will come up with a sed-based solution.

The Problem: Extracting Data from E-mail Messages

I signed up for a daily summary of currency exchange rates. It's free and you can subscribe too--just go here. Most days I take a quick look at how the $ is doing against the Euro and then save the e-mail. Some days I just save it. I have always thought that, someday, I would write a program to show me the trend but it has always been low priority.

Yesterday, as I was looking at a few of the save mail messages, I realized that while writing a fancy graphing program was low-priority, writing a quick and dirty hack would take less time than the random sampling I was doing. What I wanted was dates and numbers along with a minimalist graphical display of the trend.

First step was to look at the data. Here is an extract of part of a message.

>From list@en.ucc.xe.net  Wed Sep 10 12:22:53 2003

XE.com's Currency Update Service writes:

Here is today's Currency Update, a service of XE.com. Please read the
copyright, terms of use agreement, and information sections at the
end of this message.  CUS5D0B3D5C16D9

If you find our free currency e-mail updates useful, please forward this
message to a friend! Subscribe for free at: http://www.xe.com/cus/

Rates as of 2003.09.09 20:46:35 UTC (GMT). Base currency is EUR.

Currency Unit                          EUR per Unit         Units per EUR
================================   ===================   ===================
USD United States Dollars                 0.890585              1.12286     
EUR Euro                                  1.00000               1.00000     
GBP United Kingdom Pounds                 1.41659               0.705920    
CAD Canada Dollars                        0.651411              1.53513     


For help reading this mailout, refer to: http://www.xe.com/cus/sample.htm

The ... lines just indicate that I tossed a lot of uninteresting lines.

There are three things I use to produce the report:

The Solution

The numeric part of the solution is really easy. Just grab the date info and the rate info. When I get the </PRE> line, print it out.

The graphical part is just done by printing a number of plus signs that corresponds to the rate. To get decent resolution I would either need a very wide printout or some sort of offset. I went for the offset assuming the Euro will not drop below $.90 which is pretty safe considering the direction it is going.

Finally, I wanted a heading. Using AWK's BEGIN block, I put in a couple of print statements. Not liking to count characters, I defined the variable over to be the number of spaces that needed to be placed before the title info to align everything. This just meant that I had to run the program, see how far I was off and adjust the variable.

Here is the code.

		over = "                 "
		print over, " Cost of Euros in $ by date"
		print over, ".9       1.0       1.1       1.2       1.3"
		print over, "|         |         |         |         |"
/Rates as of/	{ date = $4 }
/^USD/		{ rate = $6 }
/^<\/PRE>/	{
		printf "%s %6.3f ", date, rate
		rc = (rate - .895) * 100
		for (i=0; i < rc; i++) printf "+"
		printf "\n"
		date = "xxx"
		rate = 0

Just running the program with the mail file as input prints all the result lines but the order is that of the data in the mail file. The sort program to the rescue. The first field in the output is the date and some careful choice of the first character of the title lines means everything sorts just right with no options. Thus, to run, use:

    awk -f cc.as messages | sort 
and you get your fancy report. Pipe the result thru more if you have a lot of lines to look at.

Here is a sample of the output:

                   Cost of Euros in $ by date
                  .9       1.0       1.1       1.2       1.3
                  |         |         |         |         |
2003.01.02  1.036 +++++++++++++++
2003.08.28  1.087 ++++++++++++++++++++
2003.08.29  1.098 +++++++++++++++++++++
2003.08.31  1.099 +++++++++++++++++++++
2003.09.01  1.097 +++++++++++++++++++++
2003.09.02  1.081 +++++++++++++++++++
2003.09.04  1.094 ++++++++++++++++++++
2003.09.05  1.110 ++++++++++++++++++++++
2003.09.07  1.110 ++++++++++++++++++++++
2003.09.08  1.107 ++++++++++++++++++++++
2003.09.09  1.123 +++++++++++++++++++++++
2003.09.10  1.121 +++++++++++++++++++++++
2003.09.11  1.120 +++++++++++++++++++++++
2003.09.12  1.129 ++++++++++++++++++++++++
2003.09.14  1.127 ++++++++++++++++++++++++
2003.09.15  1.128 ++++++++++++++++++++++++
2003.09.16  1.117 +++++++++++++++++++++++
2003.09.17  1.129 ++++++++++++++++++++++++
2003.09.18  1.124 +++++++++++++++++++++++
2003.09.19  1.138 +++++++++++++++++++++++++

Ok sed experts, have at it. --


Phil Hughes is the publisher of Linux Journal, and thereby Linux Gazette. He dreams of permanently tele-commuting from his home on the Pacific coast of the Olympic Peninsula. As an employer, he is "Vicious, Evil, Mean, & Nasty, but kind of mellow" as a boss should be.

Copyright © 2003, Phil Hughes. Copying license http://www.linuxgazette.net/copying.html
Published in Issue 95 of Linux Gazette, October 2003

<< Prev  |  TOC  |  Front Page  |  Talkback  |  FAQ  |  Next >>