14 January 2006

Bad File System or Incompetent OS?

"Use NTFS instead of FAT32, it's a better file system", goes the knee-jerk. NTFS is a better file system, but not in a sense that every norm in FAT32 has been improved; depending on how you use your PC and what infrastructure you have, FATxx may still be a better choice. All that is discussed here.

The assertion is often made that NTFS is "more robust" than FAT32, and that FAT32 "always has errors and gets corrupted" in XP. There are two apparent aspects to this; NTFS's transaction rollback capability, and inherent file system robustness. But there's a third, hidden factor as well.

Transaction Rollback

A blind spot is that the only thing expected to go wrong with file systems, is the interruption of sane write operations. All of the strategies and defaults in Scandisk and ChkDsk/AutoChk (and automated handling of "dirty" file system states) are based on this.

When sane file system writes are interrupted in FATxx, you are either left with a length mismatch between FAT chaining and directory entry (in which case the file data will be truncated) or a FAT chain that has no directory entry (in which case the file data may be recovered as a "lost cluster chain"). It's very rare that the FAT will be mismatched (the benign "mismatched FAT", and the only case where blind one-FAT-over-the-other is safe). After repair, you are left with a sane file system, and the data you were writing is flagged and logged as damaged (therefore repaired) and you know you should treat that data with suspicion.

When sane file system writes are interrupted in NTFS, transaction rollback "undoes" the operation. This assures file system sanity without having to "repair" it (in essence, the repair is automated and hidden from you). It also means that all data that was being written is smoothly and seamlessly lost. The small print in the articles on Transaction Rollback make it clear that only the metadata is preserved; "user data" (i.e. the actual content of the file) is not preserved.

Inherent Robustness


What happens when other things cause file system corruption, such as insane writes to disk structures, arbitrary sectors written to the wrong addresses, physically unrecoverable bad sectors, or malicious malware payloads a la Witty? That is the true test of file system robustness, and survivability pivots on four things; redundant information, documentation, OS accessibility, and data recovery tools.

FATxx redundancy includes the comparison of file data length as defined in directory entry vs. FAT cluster chaining, and the dual FATs to protect chaining information that cannot be deduced should this information be lost. Redundancy is required not only to guide repair, but to detect errors in the first place - each cluster address should appear only once within the FAT and collected directory entries, i.e. each cluster should be part of the chain of one file or the start of the data of one file, so it is easy to detect anomalies such as cross-links and lost cluster chains.

NTFS redundancy isn't quite as clear-cut, extending as it does to duplication of the first 5 records in the Master File Table (MFT). It's not clear what redundancy there is for anything else, nor are there tools that can hardness this in a user-controlled way.

FATxx is a well-documented standard, and there are plenty of repair tools available for it. It can be read from a large number of OSs, many of which are safe for at-risk volumes, i.e. they will not initiate writes to the at-risk volume of their own accord. Many OSs will tolerate an utterly deranged FATxx volume simply because unless you initiate an action on that volume, the OS will simply ignore it. Such OSs can be used to safely platform your recovery tools, which include interactively-controllable file system repair tools such as Scandisk.

NTFS is undocumented at the raw bytes level because it is proprietary and subject to change. This is an unavoidable side-effect of deploying OS features and security down into the file system (essential if such security is to be effective), but it does make it hard for tools vendors. There is no interactive NTFS repair tool such as Scandisk, and what data recovery tools there are, are mainly of the "trust me, I'll do it for you" kind. There's no equivalent of Norton DiskEdit, i.e. a raw sector editor with an understanding of NTFS structure.

More to the point, accessibility is fragile with NTFS. Almost all OSs depend on NTFS.SYS to access NTFS, whether these be XP (including Safe Command Only), the bootable XP CD (including Recovery Console), Bart PE CDR, MS WinPE, Linux that uses the "capture" approach to shelling NTFS.SYS, or SystemInternals' "Pro" (writable) feeware NTFS drivers for DOS mode and Win9x GUI.

This came to light when a particular NTFS volume started crashing NTFS.SYS with STOP 0x24 errors in every context tested (I didn't test Linux or feeware DOS/Win9x drivers). For starters, that makes ChkDsk impossible to run, washing out MS's advice to "run ChkDsk /F" to fix the issue, possible causes of which are sanguinely described as including "too many files" and "too much file system fragmentation".

The only access I could acquire was BING (www.bootitng.com) to test the file system as a side-effect of imaging it off and resizing it (it passes with no errors), and two DOS mode tactics; the LFN-unaware ReadNTFS utility that allows files and subtrees to be copied off, one at a time, and full LFN access by loading first an LFN TSR, then the freeware (read-only) NTFS TSR. Unfortunately, XCopy doesn't see LFNs via the LFN TSR, and Odi's LFN Tools don't work through drivers such as the NTFS TSR, so files had to be copied one directory level at a time.

These tools are described and linked to from here.

FATxx concentrates all "raw" file system structure at the front of the disk, making it possible to backup and drop in variations of this structure while leaving file contents undisturbed. For example, if the FATs are botched, you can drop in alternate FATs (i.e. using different repair strategies) and copy off the data under each. It also means the state of the file system can be snapshotted in quite a small footprint.

In contrast, NTFS sprawls its file system structure all over the place, mixed in with the data space. This may remove the performance impact of "back to base" head travel, but it means the whole volume has to be raw-imaged off to preserve the file system state. This is one of several compelling arguments in favor of small volumes, if planning for survivability.

OS Competence

From reading the above, one wonders if NTFS really is more survivable or robust that FATxx. One also wonders why NTFS advocates are having such bad mileage with FATxx, given there's little inherent in the file system structural design to account for this. The answer may lie here.

We know XP is incompetent in managing FAT32 volumes over 32G in size, in that it is unable to format them. If you do trick XP into formatting a volume larger than 32G as FAT32, it fails in the dirtiest, most destructive way possible; it begins the format (thus irreversibly clobbering whatever was there before), grinds away for ages, and then dies with an error when it gets to 32G. This standard of coding is so bad as to look like a deliberate attempt to create the impression that FATxx is inherently "bad".

But try this on a FATxx volume; run ChkDsk on it from an XP command prompt and see how long it takes, then right-click the volume and go Properties, Tools and "check the file system for errors" and note how long that takes. Yep, the second process is magically quick; so quick, it may not even have time to recalculate free space (count all FAT entries of zero) and compare that to the free space value cached in the FAT32 boot record.

Now test what this implies; deliberately hand-craft errors in a FATxx file system, do the right-click "check for errors", note that it finds none, then get out to DOS mode and do a Scandisk and see what that finds. Riiight... perhaps the reason FATxx "always has errors" in XP is because XP's tools are too brain-dead to fix them?

My strategy has always been to build on FATxx rather than NTFS, and retain a Win9x DOS mode as an alternate boot via Boot.ini - so when I want to check and fix file system errors, I use DOS mode Scandisk, rather than XP's AutoChk/ChkDsk (I suppress AutoChk). Maybe that's why I'm not seeing the "FATxx always has errors" problem? Unfortunately, DOS mode and Scandisk can't be trusted > 137G, so there's one more reason to prefer small volumes.

No comments: