Tux

...making Linux just a little more fun!

2-cent tip: De-Microsofting text files

Ben Okopnik [ben at linuxgazette.net]


Fri, 23 Jul 2010 14:21:02 -0400

I was doing some PDF to HTML conversions today, and noticed some really ugly, borken content in the resulting files; the content had obviously been created via some Microsoft program (probably Word):

Just say ?<80><98>hello, world!?<80><99>?<80><9d>

I had a few dozen docs to fix, and didn't have a mapping of the characters with which I wanted to replace these ugly clumps of hex. That is, I could see what I wanted, but expressing it in code would take a bit more than that.

Then, I got hit by an idea. After I got up, rubbed the bruise, and took an aspirin, I wrote the following:

#!/usr/bin/perl -w
# Created by Ben Okopnik on Fri Jul 23 12:05:05 EDT 2010
use encoding qw/utf-8/;
 
my ($s, %seen) = do { local $/; <> };
# Delete all "normal" characters
$s =~ s/[\011\012\015\040-\176]//g;
print "#!/usr/bin/perl -i~ -wp\n\n";
for (split //, $s){ next if $seen{$_}++; print "s/$_//g;\n"; }

When this script is given a list of all the text files as arguments, it collects a unique list of the UTF-8 versions of all the "weird" characters and outputs a second Perl script which you can now edit to define the replacements:

#!/usr/bin/perl -i~ -wp
 
s/\xFE\xFF//g;
s/?//g;
s/?//g;
s/?//g;
s/?//g;
s/?//g;
s/?//g;
s/?//g;
s/?//g;

Note that the second half of each substitution is empty; that's where you put in your replacements, like so:

#!/usr/bin/perl -i~ -wp
 
s/\xFE\xFF//g;	# We'll get rid of the 'BOM' marker
s/?/"/g;
s/?/-/g;
s/?/'/g;
s/?/"/g;
s/?/-/g;
s/?/.../g;
s/?/'/g;
s/?/&copy;/g;	# We'll make an HTML entity out of this one

Now, just make this script executable, feed it a list of all your text files, and live happily ever after. Note that the original versions will be preserved with a '~' appended to their filenames, just in case.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Mulyadi Santosa [mulyadi.santosa at gmail.com]


Sat, 24 Jul 2010 02:38:03 +0700

On Sat, Jul 24, 2010 at 01:21, Ben Okopnik <ben at linuxgazette.net> wrote:

> Then, I got hit by an idea. After I got up, rubbed the bruise, and took
> an aspirin, I wrote the following:

Whenever I saw Ben wrote Perl scripts, I always wonder where the hell those ideas come up? :) Ben is doing this like I chew some gums.....

Pretty scary to imagine if Ben was Cobb in "Inception" :) -- regards,

Mulyadi Santosa Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Fri, 23 Jul 2010 16:20:48 -0400

On Sat, Jul 24, 2010 at 02:38:03AM +0700, Mulyadi Santosa wrote:

> On Sat, Jul 24, 2010 at 01:21, Ben Okopnik <ben at linuxgazette.net> wrote:
> > Then, I got hit by an idea. After I got up, rubbed the bruise, and took
> > an aspirin, I wrote the following:
> 
> Whenever I saw Ben wrote Perl scripts, I always wonder where the hell
> those ideas come up? :) Ben is doing this like I chew some gums.....

It's a mix of things. Between client work, trying to get things done for myself, and simple intellectual curiosity[1], I often come up against challenges that push the limits of the available tools. At that point, I have to create my own - and a lot of times, this involves non-linear thinking, which I enjoy. It's like writing recursive functions: if you don't understand the basic principle, there's a whole class of problems that'll cause you to struggle for days or weeks or even give up because the task is "impossible". If you do understand it, those problems get solved in just a few moments.

# Fibonacci sequence
sub fib { $_[0] <= 1 ? $_[0] : fib($_[0]-1) + fib($_[0]-2) }
print fib($_) for 0..10;

Love that stuff. :) Although you do have to be careful about that O[n^2] (or worse) run time...

> Pretty scary to imagine if Ben was Cobb in "Inception" :)

Hadn't heard of it, just looked it up. Yeah, I could dream up some interesting tools and methods. :)

[1] That's the bugger that gets me in more trouble than everything else combined. I truly believe that the world is going to end not because of some power-mad dictator pushing The Big Red Button but because some chemistry/physics/bio/genetics/whatever geek says, "gosh, I wonder what'll happen if I do *this?*" [shrug] Life would be boring otherwise, I guess.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Breen Mullins [bpm at sdf.org]


Fri, 23 Jul 2010 19:17:03 -0700

* Ben Okopnik <ben at linuxgazette.net> [2010-07-23 16:20 -0400]:

>I truly believe that the world is going to end not
>because of some power-mad dictator pushing The Big Red Button but
>because some chemistry/physics/bio/genetics/whatever geek says, "gosh, I
>wonder what'll happen if I do *this?*"

I hate that question. It's a good one, of course, and leads to great discoveries.

But I also recall a grizzled vet frobbing a literal BRB and tripping a breaker that was on the other side of a wall that had been sealed up since the circuit was installed. We had to knock on the neighbor's door and ask "may we go rummaging through your panels until we find our breaker?"

-- 
Breen Mullins
bpm at sdf.org


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Fri, 23 Jul 2010 23:06:21 -0400

On Fri, Jul 23, 2010 at 07:17:03PM -0700, Breen Mullins wrote:

> * Ben Okopnik <ben at linuxgazette.net> [2010-07-23 16:20 -0400]:
> 
> >I truly believe that the world is going to end not
> >because of some power-mad dictator pushing The Big Red Button but
> >because some chemistry/physics/bio/genetics/whatever geek says, "gosh, I
> >wonder what'll happen if I do *this?*"
> 
> I hate that question. It's a good one, of course, and leads to
> great discoveries.

The Southern version, of course, is "Hey, y'all - watch *this!*"

That phrase is the leading cause of death in that part of the country. :)

> But I also recall a grizzled vet frobbing a literal BRB and tripping
> a breaker that was on the other side of a wall that had been sealed
> up since the circuit was installed. We had to knock on the
> neighbor's door
> and ask "may we go rummaging through your panels until we find our
> breaker?"

[laugh] That has a very familiar flavor to it.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back