13 January 2007

Bad RAM, Bad RAM Tester Design

This long post covers Vista's mOS, MemTest86 and Microsoft's stand-alone RAM testing utility.

How bad RAM presents

If RAM was originally OK, then goes bad, you'd start to see random errors, crashes, lockups, reports of corrupted registry or other files and operations, and perhaps some spontaneous resets. This random pattern may develop some reproducible errors, where the contents of the hard drive have been corrupted, either from bad RAM per se or from recurrent bad exits.

RAM crashes at full speed, so you won't notice any slowdown of the system. This contrasts with bad sectors on the hard drive, which slows the system due to attempts to retry the operation, and/or copy contents of failing sectors to spare sectors. On most consumer PCs, there's no attempt to detect RAM errors after the BIOS boot phase; where such detection is present, the system will usually halt as soon as a RAM error is detected.

Why bad RAM matters

RAM errors can not only corrupt what is written to disk, but also where it is written to disk, at a level beneath that of the file system. A sector intended to be written to the contents of a file may instead be written over some core file system structure, e.g. if a hi-order bit in the raw sector address is flipped from 1 to 0.

RAM errors can corrupt code, causing crashes, but a greater risk may arise where the code does not crash. Many disk operation calls may use a status byte in a register to differentiate between read and write operations, so a bit-flip there could cause a write instead of a read. This is why no disk access can be considered safe; any disk access starts with reading crucial areas of the file system, and if those reads become writes, the disk contents could be trashed.

Why bad RAM may be tough to exclude

When I started out with PCs in the era of DOS 3.3, 286 processors etc. I wondered why there were so many RAM testing utilities around. Surely you would just copy data to a spare register, write it into RAM, read it back from RAM, and compare with the spare register?

I had found that in practice, several testers would pass RAM as "OK" even though swap testing would clearly demonstrate that problems would clear up on the suspect PC with good RAM and appear on a known-good PC with the suspect RAM added.

So I though a bit more about how RAM can fail; not just by returning different data compared what was written to it, but altering data in other addresses when certain addresses are accessed, or behaving differently according to whether the RAM is read for instructions vs. data, or whether it is being accessed by the processor, AGP, or some other device via DMA.

Also, some failures can crash, lock up or reset the system, instead of being presented as a nice list of bad addresses. If the RAM testing boot disk is left in the system during the test, a spontaneous reset may be missed, unless you happen to notice the test has been running for fewer hours than have elapsed since you started the test.

For a long time, I gave up on RAM testing utilities, and just did swap testing as above. My faith in RAM testers started to return with SIMM Tester, and grew stronger with MemTest86 and MemTest86+. But I find that even with these tools, either one of the two MemTest86 projects may detect errors where SIMM Tester does not, and 8 hours of MemTest86 may pass, only to throw errors somewhere in the next 12 hours of testing.

How to design a RAM testing utility

This isn't about test sequence and data intended to provoke errors due to local power starvation or whatever. Instead, it's about how this core of test routines should be wrapped into a safe and usable utility - as illuminated by issues raised earlier in this post.

Microsoft have a free stand-alone RAM tester that is called the "Windows Memory Diagnostic". But why is "Windows" in the tool's name, given this is a tool that should run at the system level, before any OS has booted up or is left running in the background?

I used this stand-alone form of the tool, and noticed something rather nasty about it - when set to repeat the test sequence, it clears the results of all previous test passes! It also does not indicate elapsed clock time, so if the tester disk is left in the boot drive, the test will restart and look exactly the same as if it had been running without any interruptions.

Any RAM failure is significant, even if it shows up only once in 24 hours of testing. If you use MemTest86 and one such error occurs, you will see it listed when you return after an overnight unattended test - whereas even if Microsoft's tester flagged it at the time, you will only see the "OK" result of the last test pass when you return in the morning.

There's no point in doing 24 hours of testing, if only the last pass (possibly the last 20 minutes of testing) is reported! Who is going to sit and watch an "unattended" RAM test loop for 24 hours, just in case one pass fleetingly shows an error on screen?

How to integrate RAM testing with a mOS

I'd love to include RAM testing within my maintenance OS, but I can't see a way to fully automate this. The mOS boot disk should not boot past the RAM testing component into loading the full mOS, because that involves a lot of disk operations that may be unsafe when RAM is bad. There's no safe and standard way that the RAM tester can set a flag that it is in session, that will persist after a spontaneous reset. The best I can think of would be to boot the mOS to a menu that defaults to testing RAM, but that does not timeout but will wait forever for a keypress.

So I can't see a safe way to incorporate RAM testing into a wizard-driven mOS intended for unskilled use. It would be lovely to have a boot disk that would do x hours of RAM testing, then test the hard drive for physical errors, then test and possibly fix file system logical errors, before commencing with formal scanning for malware. But without a safe way for the mOS boot to detect whether RAM had been recently (define "recently") tested without errors, the best I could design would be a mOS that booted to a 3-item menu (test RAM, continue with the wizard, or display help) and stayed there until a selection was made.

How to get all this sooo wrong

The good news is that the Vista DVD has RAM testing incorporated into the mOS. The bad news is that Microsoft made just about every mOS design mistake possible:
  1. mOS boot will fall through to hard drive boot unless key is pressed
  2. mOS runs a lot of code before the UI from which RAM can be tested
  3. mOS looks for a Vista installation on hard drive before anything else
  4. mOS drops RAM tester on hard drive, then reboots to run it
  5. RAM tester does one pass only, unless this is overridden by user
  6. RAM tester displays no results on screen
  7. RAM tester writes results to Vista installation's logs on hard drive
OK, let's walk through what happens when you test a system that may have bad RAM. Microsoft seems to expect this RAM to be so bad that a test single pass will catch it, even though we know from experience that you may only see one error in 24 hours of testing (mistake 5).

If RAM is so bad that one test pass will always catch it, then it is surely too dangerous to run large complex GUI code (mistake 2), or to read into the logic of a Vista installation on the hard drive (mistake 3). If BIOS standard practice is to halt a system whenever bad RAM is detected, irrespective of what the OS was doing at the time, then surely it is foolhardy to boot up a complex OS from the hard drive (mistake 1), write material to hard drive (mistake 4), especially if the RAM has been proven to be bad (mistake 7)?

What happens if the nature of the defective hardware causes the system to reset, rather than lock up or carry on running so the tester can flag the error? Well, the Vista DVD will chain into the Vista installation on the hard drive and boot that (mistake 1), which is about the worst possible thing one can do - and this will happen even if you had explicitly excluded the hard drive from BIOS bootability, because the DVD boot chains directly into it irrespective of any BIOS-level settings you may have applied.

If the RAM did test bad, how would you know? It seems as if the only way would be by booting Vista from the hard drive and scratching around in Event Viewer. If the process of writing those results into Vista's logs didn't corrupt the contents of the hard drive, then booting Vista (with all the attendant paging, temp files and registry updates this may imply) to reach Event Viewer may well do so.

This is a bit like being a driving license tester faced with a pupil who immediately tries to mash down pedestrians a la Carmageddon at the start of the test. It's nice to see Microsoft (at last!) taking an interest in maintaining sick systems, but the lack of insight displayed is scary.

3 comments:

Anonymous said...

Great article.

Anonymous said...

Really informative, I'm reading this 1 year after.. I have a question, since you (and me too) are not very convinced about software level memory testers, do you plan to write about hardware level ones? Thanks and keep the great writting.

Chris Quirke said...

I haven't tried any hardware-based testers, and doubt if I go for those due to reasons of cost.

So far, my SOP screening practice (24-hour MemTest, HD Tune HD check before formal malware scans etc.)have missed only twice, AFAIK.

In one case, MemTest picked up the first (only) RAM error beyond 24 hours of testing.

In the other case, a HD that had passed HT Tune SMART, temperature and surface scrutiny, went on to fail catastrophically ("click of death", ?seek failures or disrupted mechanics) during the formal malware scans.

These events highlight the difference between "best practice" and "perfect results", but haven;t been sufficiently compelling for me to look beyond MemTest.

In particular, when PCs have remained flaky after passing MemTest, I've yet to see a positive result when swap-testing RAM.

This contrasts with my experience in the XT/286/386 era, when swap testing would regularly point to bad RAM that had passed the RAM testers of the time (BIOS, HiMem, Norton Diagnostics, Checkit and Checkit Pro, PC Probe).

My main problem with "bad RAM tester design" isn't about failure to detect bad RAM, but whether they perform operations that are unsafe in the presence of bad RAM, e.g. writing to hard drive, or booting the hard drive in the event of a spontaneous reset.