15 September 2006

How To Design a mOS

A maintenance OS (mOS) is one that you can use when you daren't trust your system to boot into the OS that is installed on it. Through the DOS and Win9x years, we were used to diskette-booted DOS in this role - but NTFS, > 137G hard drives, USB etc. make this less useful in XP.

As at September 2006, Microsoft provide no mOS for modern Windows, but you can build one for yourself by using Bart PE Builder (and perhaps you should!). Out of the box, Bart CDR meets the criteria for a safe mOS, but you can botch this when "enhancing" it.

There are all sorts of jobs one can do from a mOS, but mainly, it's:
  • Diagnostics
  • Data recovery
  • Malware management
You may need to do all three, when approaching carried-in PCs that "don't work".

Re-establishing safe functioning

Running a PC assumes various levels of functionality work perfectly. When a PC "doesn't work", one has to re-establish each of these in turn, before one can stand on each to reach the next. At each stage, one has to not use what cannot yet be trusted.

Is it safe to plug into the mains?

PCs with metallic rattles when shaken, may not be - a loose metal object could short out circuitry and burn it out. It's best to check inside the case for loose objects; salty wet dust; metal objects, flakes or rinds; power connecters danging onto pins on circuit boards,
and also that the power supply is set to the correct mains voltage, and that rain didn't fall into the case and power supply while the PC was being carried in.

Is the hardware logic safe?

This mainly goes about RAM, but implicit in a 12-hour RAM test is a test to see whether the PC can stay running that long, or will spontaneously reset or hang. The ideal RAM checker would also display processor and motherboard temperatures, and possibly operating voltages, best served with latched lowest and highest detected values.

Is the hard drive safe to use?

That goes about the physical condition of the hard drive, and is tested retrospectively by looking at the S.M.A.R.T. details, and also by test-reading every sector on the drive. It's important not to beat the drive to death; ideally, the surface test should avoid getting stuck in retry loops when a failing sector is encountered, and should abort when the first bad sector is found. The testing process should not attempt to "fix" anything!

Is the hard drive safe to write to?

Certain contexts (e.g. requests to recover deleted data) define the hard drive as being unsafe to write to, because material outside the file system's mapped space is not protected from overwrites. Otherwise, the drive may be considered safe for writes if the file system contains no physical errors, plus the hardware and physical hard drive must pass their tests.

Is the hard drive installation safe to run?

In addition to all of the above, this requires the presence of active malware to be excluded - and in practice, this may form the bulk of your mOS use. There are many challenges here, given that even a combination of anti-malware scanners is likely to miss some things that you'd have to look for and manage by hand.

Is it safe to network?

This goes about what's on the rest of the network (i.e. are all other computers on the LAN clean, and is WiFi allowing arbitrary computers to join this network?) and whether your system is adequately separated (NAT, firewall, patching of network edge code) from the 'net. The latter question has to be asked twice; for the mOS (if you are networking from it) and for the hard drive installation when this is finally booted again.

Boot safety

Many boot CDs are not safe, because they will automatically chain into booting the hard drive unless a key is pressed within a short time-out period. This is particularly dangerous, given that the chaining process ignores CMOS settings that would otherwise define what hard drives are visible, what device should boot next, or whether the hard drive should boot at all.

Every bootable Windows installation disk from Microsoft fails this test. Standard Bart PE is safe here, but has a plugin setting that can select the same automatic chaining to hard drive behavior. The Bart-based Avast! antivirus scanning CD enables this, and thus fails the test, as may other Bart-based boot disk projects.

Many mOS tasks take a lot of unattended clock time to run, starting with RAM testing, then hard drive surface testing, then virus scanning or searches for data to recover. If anything should cause the system to reset (remember, this is a sick PC being maintained) then it will fall through to boot the hard drive, thus running ?infected code in ?bad RAM that writes to an at-risk hard drive and file system. Disaster!

Even if you have tested RAM, hard drive etc. and now consider the hardware to be trustworthy, an unexpected reset will usually dispell that trust. The only safe thing for a mOS boot disk to do under such circumstances, is to stop and wait for a keypress (with no time-out fall-through).

It's tempting to have a mOS disk boot straight into a RAM check, as that's generally what one should do after unexpected lockups or resets, but that can make it easy to miss spontaneous resets during an overnight RAM test. You'd wake up, see the test still running and no errors found, but for all you know it may have reset and restarted the test a dozen times.

Testing RAM

At the time one tests RAM and perhaps core motherboard and processor logic, one can assume nothing to be safe. So the mOS and the programs you run from it should not write to the hard drive, or even read it (as a bad-RAM bit-flip can change a "read disk" to a "write disk").

I haven't figured out how to integrate RAM testers such as MemTest86, MemTest86+ , SIMMTester etc. into the same CDR as Bart, so I use a separate CDR for this. I then remove the CDR after it's booted and swap it for another that will boot but not access hard drive, such as a different RAM tester or a DOS boot CDR.

I'd love a RAM tester that showed system temperatures, but I haven't seen one that does.

Hardware compatibility

One would prefer a mOS that works on any hardware without having to have "special" drivers added to it, and Bart generally passes this test, unless oddball add-on hard drive cards or RAID are in use. Even S-ATA hard drives on the current i945 chipsets will work from Bart.

Bart will detect USB storage devices at boot time, but won't detect changes to these thereafter. So you'd have to insert a USB stick before boot, and not pull it out, swap it, add others, or add the same one back after changing the contents elsewhere. However, Bart treats card reader devices as containing removable "disks", so you can add and swap SD cards etc. quite happily. For this and other reasons, I generally use SD cards instead of USB sticks.

You cannot remove the Bart disk during a Bart session, and that means no burning to CDRs from most PCs.

Memory management

A mOS has to take no risks that are not initiated by the user, and on a sick PC, everything is a risk until testing and management re-establishes it as safe.

So a mOS should not make assumptions about the hard drive contents; automatically access, "grope" material or run code from the hard drive, or commence networking. That also means not using the hard drive for swapping to virtual memory or temp file workspace - and that makes memory management a challenge, especially when some of the available RAM is already used as a RAM drive.

A standard Bart CDR will create a small RAM drive and locate Temp files there, and will prompt before commencing networking. I've modified mine to leave networking inactive, and added on-demand facilities to change RAM drive size, relocate Temp location, create a page file on a selected hard drive volume, and start networking if required.

My usual SOP is then to divert Temp to a newly-created location on the hard drive, once I've tested the physical hard drive and logical file system. If RAM is low, I shrink the RAM disk and create a page file on the hard drive, before starting programs that will need Temp workspace (e.g. anti-malware scanners that extract archives to scan the contents).

Testing hard drive

The usual advice is to use hard drive vendors' tools, or ChkDsk /R. Neither are really acceptable, but for different reasons.

Hard drive vendor tools tend to display a summary S.M.A.R.T. report, which can be "OK" even when S.M.A.R.T. detail shows multiple failed sectors have been detected and "fixed". The surface scan may be useful, as long as it doesn't "fix" anything. Then there may be "deeper" tests that are data-destructive, such as "write zeros to disk" or a pseudo-"low level format".

ChkDsk /R is unacceptable because it's orientated to "fixing" things without prompting you for permission. First it tests the file system logic and "fixes" it, so that when it tests the surface of the disk, it can "fix" bad clusters by re-writing the contentrs elsewhere in the file system. All of which is unacceptably destructive if you'd rather have recovered data first.

Instead of these, I use HD Tune for Windows, which will run from Bart CDR just fine. It ignores the contents of the hard drive entirely, reports S.M.A.R.T. detail that is updated in real time even during other tests, can test hard drives over USB and memory cards (neither will show S.M.A.R.T.), and displays the hard drive's operating temperature (again, updated in real time) no matter which test is currently in progress.

Testing file system and data recovery

I haven't any good tools for NTFS, alas, so I use ChkDsk without any parameters that would cause it to "fix" anything. If the file system is FATxx and hard drive is < 137G, I prefer to use DOS mode Scandisk, as that allows interactive repair, and DiskEdit for when I'd rather do such repairs manually.

If data is to be recovered, I have a few semi-automatic tools in my Bart that are sometimes effective - but before using them, I prefer to copy off files and do a BING image backup of any NT-family partition that is to remain bootable.

I usually keep core user data on a 2G FAT16 volume, so if that requires data recovery, it's small enough to peel off as raw CDR-sized slabs using DiskEdit. I can then reformat the stricken data volume and get the PC back into the field, while I operate on the volume as pasted onto a different and working hard drive. FAT16's large data clusters mean any files that can fit in a single cluster, can be recovered intact even if the FATs are trashed.

Malware management

A mOS will often have to work on infected systems, so it must never run code from them unless the user explicitly initiates this. That requirement goes beyond not booting from the hard drive, to not including the hard drive in the Path, and not handling material on the hard drive in a "rich" enough way to expose exploitable surfaces.

A mOS should not "grope" the hard drive for other reasons, e.g. in case some of the material includes bad sectors that would bog the mOS down in retry loops, or cause it to crash on deranged file system logic. When your file manager of choice lists files, you want no cratching in file content for icons or metatdata.

Standard Bart is safe in this regard. There's no "desktop" in the hard drive file system sense, and the file managers that are included do not grope metadata when they "list" files. However, many Bart projects use XPE or similar to improve the UI by using Explorer.exe as the shell; I prefer not to do this, because doing so may expose exploitable surfaces.

A mOS should perform no automatic disk access - thus no indexing service, no System Restore, no resident antivirus and no thumbnailling.

Many malware scanners and integration checkers require registry access, and that is complicated when you have booted from a different OS installation. If simply used as-is, these tools would report results based on the Bart CDR's registry, not the one on the hard drive.

The solution for Bart is the RunScanner plugin. This redirects registry access to the hard drive installation for the tool that is run through it, but not child processes that this tool may launch. There are parameters to specify which hives to use, and to delay the swich from Bart to hard drive hives so that the tool can initialize itself according to the former before use on the latter.

Any tests that rely on run-time behavior (such as LSPFix, some driver and service managers, and most rootkit scanners) will not return meaningful results during a mOS session (unless you wish to test the behavior of the mOS). In particular, drivers and services may list a a mixture of "live" and registry-derived results, thus blending these from the mOS and hard drive. Interpret such results with care.

Any changes you make from mOS will not be monitored by the hard drive installation. This is generally desirable, as it prevents malware intervention, or Windows itself updating registry references so that malware may remain integrated. But it also means no System Restore undoability, and the quarantine material from various scanners may be lost, and/or not work when attempts are made to restore these later.

For this reason, I usually scan to kill when dealing with intrafile code infectors and other hard-core malware, but scan to detect only, when it comes to commercial malware that I expect to pose more problems due to botched removal than malicious persistence. I defer clean-up of those to a later Safe Mode Cmd Only boot, so that undoability is maintained.

When it comes to rootkits, these are exposed to normal scanning just like any other inert file. Tools that aim to detect rootkit behavior will not have any such behavior to detect, unless the mOS has triggered the malware into action. It can also help to save integration checks (such as HiJackThis or Nirsoft utility logs) as redirected by RunScanner and compare these with logs saved from Safe Mode or normal Windows. Unexplained differences may suggest rootkit activity during your "Safe" or normal Windows sessions, unless the mOS tests were done based on the mOS's registry rather than the hard drive's hives.

Beyond the mOS session

A mOS disk can be useful even when not being used as a mOS. For example, it can Autorun to provide tools for use from Windows, be used as storage space for updates and installables, and can operate as a diskette builder for tasks the mOS cannot do from itself.

As an example of the last, my own Bart CDR can spawn bootable diskettes for BING, RAM testers, and various DOS boot disks containing various tools. The DOS boot diskettes can then access the Bart CDR and thus extend the range of available tools via an appropriate Path.

I also set up my Bart so that I can test the menu UI against the output build, even before it is committed to disk, and the installation of some tools can double up to be run from both the host system and from Bart CDRs built from it. This is accomplished mainly by careful use of base-relative paths within the nu2menu (the native shell for standard Bart) and batch file logic.

I've found nu2menu to be useful in its own right, and use a stand-alone menu to manage the entire Bart-building process - updating the scanners, selecting wallpaper and UI button graphics, editing and testing the nu2menus, accessing Bart forums and plugin documentation, and building the CDRs themselves.

No comments: