...making Linux just a little more fun!
By Ben Okopnik
"Foolish Things" is a now-and-again compilation we run based on our readers' input; once we have several of these stories assembled in one place, we get to share them with all of you. If you enjoy reading these cautionary tales of woe, proud stories of triumph, and just plain weird and fun things that happen between humans and silicon, that's great; if you have some to share so that others may enjoy them, even better. Please send them to .
[ You can even tell us that it happened to A Friend of Yours, and we'll believe you. ]
We have a nice SCSI HP ScanJet that's been sitting around for years; once in a while, we use it to scan X-ray films with a full A4 backlight. After some PC rearranging, it got moved and stopped working. The PC to which it was attached had SCSI cables that were too long (which worked fine before), but, after being shifted, the PC's SCSI disk would only run at lowest async speed. We changed about everything on the chain including the SCSI adapter in the stripped-down PC, but nothing helped. After giving up on connecting the scanner, I put in the network card again: the AGP graphics card stopped working, no screen output whatsoever. Pull the network card out: screen acts OK. Hm.
Dismissing the PC as too flaky and ancient to care about (PII 400), I decided to plug the HP scanner into my Linux laptop just to check it — and it worked perfectly. I then plugged it into a mini-tower similar to the discarded one, and it worked fine. So all is fine, right? Not so. After moving it around on the table and arranging the cables nicely, it wouldn't work. Then I realised that the scanner wasn't making the "normal noises" on power up: it would move the sledge inside a short way — and then nothing. When it was working, it would move the scan head back and forth for a few seconds, and then switch off the light. That was when we decided the scanner had a real problem after all, and sent it to a repair shop — but they sent it back saying the board was defective and they couldn't get a replacement, since it's too ancient.
There it was, sitting on the desk — a huge thing, sturdy beyond anything, and the board is "gone" just like that? By moving it a bit? I decided to open it up and look for a bad contact somewhere, maybe close to the SCSI connector. After I got the case open (tricky to find the last two screws), I saw a bit of the circuit board and the flat cables. Looks flawless to me — fine mechanics, cable guides everywhere. So, I switch it on, the head moves a little and stops with the light on — so it's still not OK. But the light is blinding me, so I put a sheet of paper on it — at which point, the head moves back and forth, goes to the 'park' position and switches off the light. (??!) It was reproducible, too. I turn around the top cover, and there is the white strip used for calibrating the scan sensors before each scan. I rub my finger over it, and it comes up black; I clean the glass plate from the inside, put it on, and — the scanner works perfectly. Damaged board indeed.... These fellows didn't even open it up.
This is a story that I'm telling on myself rather than exposing someone else's foolishness; as they say, "it was in a far country, and besides, the wench is dead" — and I'm smarter (I hope!) than I was then. Many, many years ago — those of you who were working with computers then will have a good idea of when, once you read this — when I was just a PFY and still learning the computer repair trade, I was working on a relatively new machine that I had brought home from a client's place. The problem was that the motherboard manufacturer (long since out of business — and with good reason) used tiny rivets to connect the layers on the opposite sides of the board instead of properly plating everything all the way through. Well, as time went on, and thermal expansion flexed the board microscopically, these rivets came loose — and so did the connections. Randomly. Sometimes yes and sometimes no — and sometimes maybe; depending on the weather, the humidity, how recently the computer had been pounded with a fist, etc.
Expensive as it was back then — and it was very expensive indeed — I had convinced this client to replace all these boards. He agreed, through gritted teeth, but got me to promise that I would try my best to keep his old boards working as long as possible..., so I spent a lot of time soldering these rivets, using a contact-cleaning file around their edges to get at least some kind of connection, or trying to solder hair-fine wires to the traces. (This worked very rarely, since the traces were so thin that they'd burn up as soon as you touched them with a soldering iron.)
Back then, the standard bit of advice was to leave the computer plugged in but turned off while you worked on it; this provided a local ground that you were always touching and decreased the chance of blowing the chips — which were very sensitive to static back then. The board that I was working on was fairly new, and had just started showing that random behavior. As usual, I plugged it in, made sure it was turned off, and was twiddling my tiny file while quietly cursing to myself — when suddenly, the chip next to the rivet, which I just happened to touch with the file, EXPLODED. I mean, literally — blew out a chunk of its ceramic packaging so I could see the silicon underneath.
I was completely flabbergasted. I had made absolutely sure that I switched it off; I knew that the capacitors in the PC power supply were designed to discharge in a fraction of a second after shutdown... What happened? I grabbed my voltmeter and started poking around in the machine, and figured it out after about two hours of measurements:
(Yep... it was that long ago.)
I felt really awful, but... I did indeed follow the agreement I had with that client: that board couldn't be repaired anymore, and needed to be replaced. I forgot to mention that I was the one that contributed to its slightly earlier demise.
P.S. I never signed up for
that kind of serfdom
anything like that again - and I always used a grounding strap instead of a
plugged-in chassis since that day. :)
The time having come to concoct a replacement for my server-grade 486DX2 EISA/VLB-bus Linux host, I decided one day to build a new machine from best-of-breed parts (well, within limits set by my being a cheapskate).
I happened to have an excellent tower case and PC Power & Cooling power supply handy, a pair of Adaptec PCI SCSI host adapters, a SoundBlaster Pro, Matrox G200 video, a couple of large, fast IBM SCSI hard drives, a Plextor SCSI CDR, and a pair of 3Com 3C509B 10 megabit ethernet cards. (Hey, all it had to do was potentially max out a 1.54 megabit T-1 line, OK?) That left the matter of motherboard and parts attendant thereto.
After some checking, I bought a nice little FIC PA-2007 Socket-7-type motherboard. Then, from my favourite vendors (SA Technology), I acquired 128 MB of Crucial Technology SDRAM capable of running at CAS2 operation at 100MHz, an AMD K6/233 CPU, and a big-kahuna heat sink with a fan on top that uses ball bearings instead of the standard, noisy, failure-prone sleeve bearings. The ensemble was designed to be highly reliable, and at the same time hackable if I ever decided to join the then-pervasive overclocking madness.
Anyhow, I banged everything together. It seemed to work fine. The big tower case, low-heat-output CPU, and major cooling capacity meant that the thing ran very cool and reasonably quiet, even with the two very fast IBM 10,000 RPM SCSI hard drives.
As was my custom, I started compiling a kernel after a bit. That compile errored out with a SIG11. Hmm. Tried again. Errored out even faster, and at a different point in the compile. Odd. Do I have bad RAM?
Shut down and re-seated the RAM. Went downstairs to the CoffeeNet (a recently reborn 1990s Linux-based Internet cafe I'd helped build) for a caffeine infusion. Came back, powered up, ran the compile. No problems. Ran the compile again: SIG11. Ran it again: SIG11 at a different place. Ran it with only half the RAM: Same symptoms. Ran it with the other half: Same. Didn't seem to be a RAM problem(?).
Slept on the problem. Woke up, fired up the machine, ran a compile: No problem. Ran it again: No problem. Ran it a third time: SIG11. Ran it again: Same error, different spot.
Pondered the problem for a bit: It seemed as if the error was kicking in only after the system reached heat equilibrium, but not during the initial 30 minutes of operation when the system was still stone-cold. But that didn't make much sense: I opened up the case and re-verified that the system really was an engineer's dream of conservative design, and that even the 10,000 RPM drives were running cool.
I drove down from San Francisco to the Palo Alto Fry's and bought both the heat-conductive pads you can sandwich between CPUs and their heat sinks and the thermal paste you can use instead. I was going to make double-sure that I had good contact, there, before taking the CPU back to SA Technology and looking like an idiot if it turned out to be perfectly OK.
I took the heat sink off the CPU, cleaned both off, put paste on them and pressed them together. For some reason, perhaps on account of some bell ringing in my unconscious mind, a few minutes later I pulled them apart again to look at the two pieces more closely.
The Socket-7 motherboard socket for the K6 has a square outline, and the pins have a square pattern, with one corner of the CPU's pin-layout being different so you won't destroy it by putting it in the wrong way. When you put the CPU into the socket, if you aren't looking too closely (cue ominous music), you think: square socket, square CPU with a keyed feature to keep you from screwing up, square heatsink/fan assembly. 1, 2, 3 — done. Foolproof. (Well, no.)
The top surface of the CPU turned out to have a heatsink-contact surface that extended only across maybe 70% of the lateral distance, and the other 30% was sunk down lower.
--------------------- | | | | | | | | | | | | | contact | | | surface | | | | | | | | ---------------------
I'd accidentally rotated the heatsink 180 degrees, so that its contact surface was staggered over to the right-hand-side:
--------------------- | | | | | | | | | | | | | | contact | | | surface | | | | | | | ---------------------
Only a little strip in the middle was actually touching, so all the other thermal paste was still protruding up into the air, un-squashed. Most of the CPU top surface was getting zero help with cooling, and instead was radiating out into a nice warm, insulating air pocket under the heatsink.
If this had been an Athlon or P4 Coppermine, the CPU probably would have committed seppuku, but the K6 was completely undamaged and has been happily cranking away ever since.
But that experience confirmed me in my prejudice that a cool CPU (like a cool system generally) is much, much, much to be preferred over one that runs hot and needs heroic measures like huge amounts of forced air flow to stave off disaster. Those other ones may be faster — but usually (for most machine roles) don't even manifest that speed in ways you especially care about.
Talkback: Discuss this article with The Answer Gang
Ben is the Editor-in-Chief for Linux Gazette and a member of The Answer Gang.
Ben was born in Moscow, Russia in 1962. He became interested in electricity at the tender age of six, promptly demonstrated it by sticking a fork into a socket and starting a fire, and has been falling down technological mineshafts ever since. He has been working with computers since the Elder Days, when they had to be built by soldering parts onto printed circuit boards and programs had to fit into 4k of memory. He would gladly pay good money to any psychologist who can cure him of the recurrent nightmares.
His subsequent experiences include creating software in nearly a dozen languages, network and database maintenance during the approach of a hurricane, and writing articles for publications ranging from sailing magazines to technological journals. After a seven-year Atlantic/Caribbean cruise under sail and passages up and down the East coast of the US, he is currently anchored in St. Augustine, Florida. He works as a technical instructor for Sun Microsystems and a private Open Source consultant/Web developer. His current set of hobbies includes flying, yoga, martial arts, motorcycles, writing, and Roman history; his Palm Pilot is crammed full of alarms, many of which contain exclamation points.
He has been working with Linux since 1997, and credits it with his complete loss of interest in waging nuclear warfare on parts of the Pacific Northwest.