2-Cent Tips
2-cent Tips: understand file system hierarchy right from the man pages
Mulyadi Santosa [mulyadi.santosa at gmail.com]
Fri, 23 Jul 2010 14:28:18 +0700
Probably one of my shortest tips so far:
Confused with all those /proc, /sys, /dev, /boot etc really mean and
why on Earth they are there? Simply type "man hier" in your shell and
hopefully you'll understand  
-- regards,
Mulyadi Santosa Freelance Linux trainer and consultant
blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com
[ Thread continues here (4 messages/3.80kB) ]
2-cent tip: De-Microsofting text files
Ben Okopnik [ben at linuxgazette.net]
Fri, 23 Jul 2010 14:21:02 -0400
I was doing some PDF to HTML conversions today, and noticed some really ugly, borken content in the resulting files; the content had obviously been created via some Microsoft program (probably Word):
Just say ?<80><98>hello, world!?<80><99>?<80><9d>
I had a few dozen docs to fix, and didn't have a mapping of the characters with which I wanted to replace these ugly clumps of hex. That is, I could see what I wanted, but expressing it in code would take a bit more than that.
Then, I got hit by an idea. After I got up, rubbed the bruise, and took an aspirin, I wrote the following:
#!/usr/bin/perl -w
# Created by Ben Okopnik on Fri Jul 23 12:05:05 EDT 2010
use encoding qw/utf-8/;
 
my ($s, %seen) = do { local $/; <> };
# Delete all "normal" characters
$s =~ s/[\011\012\015\040-\176]//g;
print "#!/usr/bin/perl -i~ -wp\n\n";
for (split //, $s){ next if $seen{$_}++; print "s/$_//g;\n"; }
When this script is given a list of all the text files as arguments, it collects a unique list of the UTF-8 versions of all the "weird" characters and outputs a second Perl script which you can now edit to define the replacements:
#!/usr/bin/perl -i~ -wp s/\xFE\xFF//g; s/?//g; s/?//g; s/?//g; s/?//g; s/?//g; s/?//g; s/?//g; s/?//g;
Note that the second half of each substitution is empty; that's where you put in your replacements, like so:
#!/usr/bin/perl -i~ -wp s/\xFE\xFF//g; # We'll get rid of the 'BOM' marker s/?/"/g; s/?/-/g; s/?/'/g; s/?/"/g; s/?/-/g; s/?/.../g; s/?/'/g; s/?/©/g; # We'll make an HTML entity out of this one
Now, just make this script executable, feed it a list of all your text files, and live happily ever after. Note that the original versions will be preserved with a '~' appended to their filenames, just in case.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
[ Thread continues here (5 messages/7.54kB) ]
| Share |   | 
 
