Since when has the "stability" of a computer been an analog quantity anyway? A computer is either stable, or it's not. But in today's consumer PC culture, you often hear things like "for maximum stability, use only matched RAM" and such.
I guess it has to do with today's culture of Windows machines infested with hundreds of little snippets of mismatched software - drivers for hardware gadgets, browser toolbars, spyware from the internet and so on. In a software system of this complexity, the occasional interaction bug is unavoidable and then the system becomes unstable, right? So flakey hardware is only another contributing factor to the occasional crash or weird behaviour that forces a reboot.
But suppose you really know your computer - and tightly control what junk gets installed on it and what doesn't - or you run Linux, where less of the junk runs with kernel level privileges, and thus is prone to crash itself, but not the computer, and anyway you have a certain eye for what is a harmless application bug and what is hardware induced.
Then it becomes an issue, if out of the 8,589,934,592 bits of memory in your brand new computer (that's one gigabyte), one single lousy, measly bit has a tendency to sometimes flip from 1 to 0 without being asked to. For example, you've just copied 250 gigabytes of data from one hard disk to another (that kind of data accumulates fast if you capture video) and you check the copy against the original and find 33 corruptions. Pretty stable, huh? That's only one error in 7.25 gigabytes after all. That's good enough for surfing the web, writing email and playing shoot-everything-that-moves video games.
But it's not "stable" in the old digital sense, you know, either stable or not stable.
Which brings me to my beef with cheap computer memory.
My bad memory was from a well-respected consumer oriented brand, let's call it Acme brand memory. The little printed circuit board says Acme on it, and so do the actual chips. But if you happen to work in the digital hardware design business and sometimes need RAM for your products, you've never heard of Acme. You've only heard of Micron, Samsung, Infineon and their ilk, big companies that runs monstrously expensive IC foundries.
One of those obivously made the little rectangles of silicon that ultimately make the Acme brand work. So why didn't they actually sell it themselves under their own name? Why can Acme sell it cheaper? Here's my theory.
Testing RAM is very expensive. Why? Because a great number of patterns of ones and zeroes need to be written into the RAM and read back before you can trust the thing. The machine that does this testing is expensive and it goes obsolete with time, therefore time on the machine is expensive. In fact, it is a significant chunk of the cost of a tested RAM chip.
So my guess is that Acme comes in there and offers to buy untested chips from you cheap. To make those nice shiny round wafers with RAM chips on them takes, at minimum, a billion dollars in capital investment. To actually saw the wafers up and test the chips can be done with much less overhead. And then you can run a "consumer grade" test on the devices. You know, instead of half an hour on the million dollar machine, maybe only 15 minutes on a $10,000 machine. That test is still far more thorough than, say, Memtest86+. The RAMs will pass every test a consumer can throw at them. So off to packaging (in the process, Acme's logo ends up on the packaged ICs) and mounting on the little SIMM boards. Heck, maybe the "yield" is so high that they can just buy them packaged, untested from the manufacturer, solder them on the SIMMs and then test them and just throw out the SIMMs that fail.
And then one of those SIMMs ends up in my computer, and I notice that pesky little bit that, all tests notwithstanding, flips from 1 to 0 when its neighbouring bits are just such and such a pattern and if you leave it alone for just such a length of time or whatever, and my data gets corrupted.
This has happened to me twice now, with two different kinds of "brand X" memory SIMMs, brand-new, in new computers, 7 years apart.
Sadly, another thing that has gone in the quest for cheapness at the cost of reliability, is that extra 1/8 of memory for ECC (error correcting codes). If for every 64 bits of data, you store an extra 8, you can correct a single flipped bit in the resulting 72 bits. Not only that, but you can tell with good confidence whether no bits have been flipped, a single bit has been flipped that can be flipped back, or multiple bits have flipped - this is an uncorrectable error, and it can at least generate an error message to let you know your memory is bad and your data has taken a hit.
But that means an extra 9th RAM chip, increasing the cost of memory by 12.5%. And guess what, virtually no consumer is interested in paying extra, and so everyone buys "non ECC capable" RAM modules with only 8 chips, and the chipset makers say, oh, then we'll introduce consumer oriented chipsets that don't even have the ECC logic. And so now, when your computer is acting weird, maybe the memory is bad, but you can't say for sure, all you can do is run Memtest86+ and hope that it will find something.
Or, if you're the average computer user, just share those slightly corrupted MP3s and DivX files (the music/movie industry won't mind the corruptions, that's for sure), and if the computer acts a little wonky, reboot, and if reboot doesn't fix it, sell it at a garage sale and buy a new one that is "more stable".
Update, November 10 2008
Recently I put a new hard disk into my main machine, and went through the exercise of copying hundreds of gigabytes of data onto it. Bits were corrupted.
After swapping memory modules experimentally and running Memtest86+ overnight several times, I now see a 100% correlation between the test failing (overnight run minimum! It can take hours between error messages) and data being corrupted in large file copies in Linux. And I'm up to four memory modules which have actually caused me data corruption. They are from two different "off-brands". Interestingly, I haven't yet seen a marginal 256 megabyte PC2100 module, but three different 512 megabyte PC3200 modules and, way back in 1999, one 128 megabyte PC100 module. I think as a given memory size/technology "matures" i.e. has been manufactured for longer, even the junk brand memories get better. But the fact is, the four bad memories were all sold as good, two of them under a very respected "consumer" brand, and they won't even pass a freely available software memory test.
I now consider it a minimum requirement that any memory modification on a computer that will handle real data (i.e. not just mom's web mail machine) be followed by an overnight run of the latest version of Memtest86+. And if I ever buy "consumer grade" memory again, I will insist on a refund guarantee if the memory fails that test.
Memtest86+ is what I did my tests with. It is a fork of Memtest86. Both are under active development. Perhaps the most paranoid test would be an overnight run of each.
Memtest86+ comes as a boot option on Fedora install disks so it's always handy. These install disks also have a boot option to read through and MD5 check the entire disk. For someone who hates data corruption, this is a thing of beauty.
Back to my Miscellaneous Technical Stuff page / Home