Friday, January 25, 2019

the x7b error is something that came up a lot when i did vista support, and what i learned is that the training documents don't really reflect what you see in real life. what it means is that something is interfering with the boot process, but it could really be anything. the error codes refer to specific  symptoms, but they say much less about actual causes. the actual truth of it is that there consequently isn't any way to really know what causes a x7b other than to look at the situation critically and take a bunch of educated guesses.

file system corruption was a good guess, because i think i rebooted during a chkdsk that i had forgotten i scheduled, and i've experienced some issues with the drive in the past. and, it did fix a couple of errors. if the laptop boots right up tonight, i may conclude that the chkdsk had some effect, or that the disk was otherwise corrupted in some way (i took the opportunity to reformat the drive, and that itself may have helped), but i'm broadly going to be left with a mystery.

and, the more mysterious it seems, the more that the boot sector virus idea opens itself up, as it would necessarily exist behind a wall of abstraction i can't get at. some kind of stuxnet-like worm could have jumped from the boot sector on the drive to the bios in the pc. and, you'll recall that my bios died on me, apparently randomly. but, a bios virus needs electricity to exist, and if one existed then whatever it was has been drained from the capacitors. the reformats also cleared out the boot sector on the drive, so i think i can state with some confidence that the disk is clear, at this point. is there still something hiding in the bios in the laptop? we'll find out. at the least, it doesn't seem interested in attacking my actual data.

but, i ruled out the boot sector as best as i could, using the tools that existed, as best as i could remember them - both the automated fix (which never works) and the command line tools. there was no sign of anything wrong at this stage.

the next thing to check is a "driver error", but that's an incredibly vague term that necessitates a huge amount of trial and error. i've seen x7bs connected to things like raid setups, sata drivers, video cards, chipsets - essentially anything at all that throws an error before windows loads is going to throw out an x7b. boot logging rarely works properly, and can lead to wrong answers if you take it too seriously. so, what you normally do in this kind of situation is try to figure out what changed and take a good guess that this is the cause. in this scenario, i was taking a hard drive out of a system that wouldn't boot and putting it into a known, working system where windows had access to all of the correct drivers. so, if it isn't booting due to a driver error, it has to be because the driver itself is corrupt or because something is corrupted in the registry - which could have been anything at all. and, the aborted chkdsk produced a potential cause.

so, i backed up the broken install, reinstalled to the questionable partition (to verify that a clean install was capable of booting out of the box - that i wasn't missing a driver, just in case i was wrong), then replaced chunks of the new install with the broken one, waiting for the x7b to come back. this allowed me to check for corruption in the actual drivers by copying them back; it continued to boot, indicating the drivers themselves were not compromised - i didn't have a broken driver, and the error must be in the registry.

normally, a tech agent would tell you to run a system file scan rather than backup, reformat and copy back to find the problem via trial and error. but, that wouldn't tell you what the problem is - and i wanted to know what was causing the issue. on top of that, an sfc would restore all of the system files i had altered or deleted, possibly to old versions from before an update. it's a brute force solution that might fix your computer quickly, but i would advise avoiding it, because it could cause further problems over time.

via a few sneaky tricks to get the right comparisons over trial and error, i was then able to determine that the issue was specifically in the system hive. but, that is narrowing the issue down to a wormhole in a haystack - i found the right file on the pc, only to have that file be a 30 mb database with thousands of entries, any one of which could be the problem. last known goods weren't working; both control sets seemed fucked. so, it's down to trial and error to pinpoint the problem, yet again. i was eventually able to find the problem in the services directory.

now, you can argue that this was obvious, and be right in some way, but miss the point altogether. it happens to be that the end fix was simple enough, and didn't require sorting through all of the more exotic registry keys. but, this is kind of just luck. if i had gone directly to the services key and checked trial and error, and found it was, say, a video card driver i needed to get vga, i would have then needed to check through all of the subkeys that the vga driver calls, which would have created a complicated tree. this is more of a question of approach than anything else. trying to pinpoint the cause using logic may have led me on a time consuming wild goose chase that would have ended with an offline driver install through importing registry keys - a crazily complicated thing to reverse engineer from scratch. approaching the issue with a gauche trial and error probably actually saved me a lot of time. call it the monte carlo approach to finding registry corruption. but, if you've studied search algorithms, you realize this - that a sequential or randomized algorithm can often find something much faster than a sophisticated data-driven model.

in the end, what i found out was that the registry wasn't corrupt at all - that nothing was broken and the computer was doing exactly what it was told to do. there was no way i could have guessed that, or at least not effectively. i mean, i could have set every single driver to ignore on a lark, but then i'd just have to work backwards, anyways. i wasn't looking for a corruption of data, i was looking for careless programming...

again: i still don't know why the laptop didn't boot. it's almost 7:00, almost time to find out. if it was a physical disk problem, or something in the boot sector, it's gone and i'll never know. if it's something else, i'll learn soon enough.