All Indications Point To Disk Corruption

OldStrummer · 26 November 2022 21:29

I have posted here recently about problems I had reconfiguring iniramfs, getting VMware Workstation Pro to lauch/compile and now I think my next step is to reboot in emergency mode and run fsck insteractively on my files systems. I have 20Tb of disk, but haven't chunked it up since I first installed Ubuntu MATE, and that may be a shortfall on my part.

My latest attempt was to completely remove all vestiges of VMware Workstation and do a fresh re-install. I found directions on how to do this, and have done it successfully before. Not this time.

VMware Workstation Pro is delivered in a .bundle file, meaning it's simply a script to be executed with elevated privileges. So, I cheerily went forth and issue the sudo ./VMware-Workstation-blah-blah-blah.bundle, and off it went.

However, it's now stalled at 51%. I've been waiting for the progress bar to continue, but it's not. Just like the other processes I ran, it just stalls and doesn't throw any errors.

So, even though booting my server is an automatic process, I believe I can enter "emergency boot mode" at the startup screen by pressing escape, the 'e' to edit the boot command. Once at the grub script, I should just add systemd.unit=emergency.target. Then, if all goes well, I should be able to run fskc on my file systems and see what's what.

Any hints or suggestions to add?

ricmarques · 26 November 2022 23:58

Hi, @OldStrummer (Fred).

You mention that you have "20Tb of disk"?! That's a lot! Is that a single disk (given the huge size, I'm assuming it's an HDD and not an SSD) or is it some kind of RAID setup with several disks?

Before running "fsck" (filesystem check), I would first:

1 - Check the /var/log/syslog and search for disk errors there (messages like "DRDY ERR" and/or "Unrecovered read error" and/or "I/O error" are likely indicators of disk problems and/or a bad cable).

2 - Do you have a good backup of your data? If not, I suggest you create a backup first, before you run a filesystem check, especially if that filesystem check will try to repair things. The article "Repair a damaged filesystem" - https://help.ubuntu.com/stable/ubuntu-help/disk-repair.html.en - includes, amid other useful information, the following warning:

"(...) Possible data loss when repairing

If the filesystem structure is damaged it can affect the files stored in it. In some cases these files can not be brought into a valid form again and will be deleted or moved to a special directory. It is normally the lost+found folder in the top level directory of the filesystem where these recovered file parts can be found.

If the data is too valuable to be lost during this process, you are advised to back it up by saving an image of the volume before repairing.

This image can be then processed with forensic analysis tools like sleuthkit to further recover missing files and data parts which were not restored during the repair, and also previously removed files. (...)"

3 - I think you should also check the S.M.A.R.T. / SMART (Self-Monitoring, Analysis, and Reporting Technology) information for your hard drive. The article "Check your hard disk for problems" - https://help.ubuntu.com/stable/ubuntu-help/disk-check.html.en - mentions the following (amid other information):

"(...) Checking the hard disk

Hard disks have a built-in health-check tool called SMART (Self-Monitoring, Analysis, and Reporting Technology), which continually checks the disk for potential problems. SMART also warns you if the disk is about to fail, helping you avoid loss of important data.

Although SMART runs automatically, you can also check your disk’s health by running the Disks application:

Check your disk’s health using the Disks application

1. Open Disks from the Activities overview.
2. Select the disk you want to check from the list of storage devices on the left. Information and status of the disk will be shown.
3. Click the menu button and select SMART Data & Self-Tests…. The Overall Assessment should say “Disk is OK”.
4. See more information under SMART Attributes, or click the Start Self-test button to run a self-test. (...)"

(Alternatively, if you're familiar with the "smartctl" command that comes with the "smartmontools" package, you can also use that for checking the "SMART" Data and doing the "SMART Self-tests")

4 - If you don't already have a good backup of your data, you may consider using "ddrescue" (I've never used it) - Ddrescue - GNU Project - Free Software Foundation (FSF) - to create a disk image, before doing some "fsck" of the filesystem in a mode that may try to repair things. HOWEVER, it can take a LOT of time to create a disk image and using "ddrescue" for a 20 TB disk like yours will very likely take a REALLY LONG time! There's the following post here in the "Ubuntu MATE Community" regarding "ddrescue", posted on 10th April 2018 by @andyp6 :

Other than that: your approach looks good to me (entering " 'emergency boot mode' at the startup screen by pressing escape, then 'e' to edit the boot command. Once at the grub script, I should just add systemd.unit=emergency.target .") as it's described, for instance, in the article How to Boot Ubuntu 20.04 LTS in Rescue / Emergency Mode particularly in the section "Booting Ubuntu 20.04 LTS in Emergency Mode" of that article.

Having said that, in Emergency Mode, I would first run the "fsck" in its "dry run" mode by using the -N (minus uppercase N) switch that, acccording to "man fsck", has the meaning of "Don’t execute, just show what would be done." and also using the "-V" (minus uppercase V) switch to get verbose output and checking the results, before running the "fsck" command again without the -N switch.

I hope this helps. Good luck!

OldStrummer · 2 December 2022 16:37

@ricmarques, I appreciate your taking the time to help. I've just spent a frustrating three hours trying to boot into recovery/emergency mode with no success. In fact, I can't even get to the GRUB menu. In fact, I suspect it doesn't exist, as my system is a Dell server running in UEFI mode.

When I boot up, the system goes through its hardware (onboard) diagnostics (memory, disk, network, remote management access and then immediately launches into MATE. If I press at any point, the boot process simply halts and I'm left with a blank screen and a blinking cursor. Which does not respond to keyboard or mouse, at which point it's back to Ctrl-Alt-Del to try another restart.

At this point, I think my next approach is to make a bootable USB drive with an Ubuntu MATE 20.04 LTS .iso on it and attempt to install on top of my system (unless there's a repair option).

ricmarques · 3 December 2022 01:38

Hi again, @OldStrummer,

Thanks for your appreciation for my help. I'm sorry that you are having such a hard time with that Dell server You mention that you suspect that you may not even have a GRUB menu, because your system is running in UEFI mode. I don't think that is the case: my laptop is running in UEFI mode and I have a GRUB menu. I think what it's more likely is that the GRUB menu in your installation is hidden.

First, I like your idea of making a bootable USB drive with Ubuntu MATE 20.04. If I were you, I would start by doing that, so that you have a way to boot the machine if other problems arise.

Then - assuming that you already have a good backup - my suggestion would be to edit the GRUB configuration to be able to see it . However, as you're using Ubuntu MATE 20.04, I've searched and found the following topic here in the "Ubuntu MATE Community", started by @tomec on 14th February 2020, that reveals that you may be hitting / hit a bug, when you try to do that:

In that discussion (that had/has several replies),@franksmcb (Bill) wrote the following reply that mentions a Bug:

That bug - Bug #1863434 “20.04 grub menu not visible” : Bugs : grub2 package : Ubuntu - was reported by @franksmcb (Bill (franksmcb) in Launchpad) himself on 15th February 2020

Some comments in that bug point to a workaround described in "Get grub menu back after installing Ubuntu 20.04 alongside Windows | by Jerry | Medium" which basically amounts to edit the "/etc/default/grub" file (I suggest that you make a backup of that file before editing it) and, in that file, change the "GRUB_TIMEOUT_STYLE" line so its value is menu instead of hidden:

GRUB_TIMEOUT_STYLE=menu

... and uncomment the following line:

GRUB_TERMINAL=console

Also check/confirm that GRUB_TIMEOUT in that file is set to a positive number:

GRUB_TIMEOUT=10

Finally, run sudo update-grub to update the GRUB configuration and then reboot.

I hope that helps

OldStrummer · 3 December 2022 23:16

Hi @ricmarques. I was a few steps ahead of you and realized GRUB was present, just hidden. So, I went and changed the /etc/default/grub file and ran the sudo update-grub command. But this made it only worse. Now, the boot process winds up with a black screen with a white cursor on the upper left corner. No keyboard input, mouse, etc. are recognized.

I have now spent the better part of last night and today attempting a variety of solutions. I have burned to bootable USB drives Boot Repair Disk, Super Grub Disk, Rescatux and a full .iso of Ubuntu-MATE 20.04.5 LTS. I have been through the UEFI boot process so many times I can pick out the appropriate F key when needed, in the dark!

Every one of these "rescue" disks at least loads. Super Grub at least let me boot into Linux, and even after re-installing grub, failed to give me a clean boot. At least fsck didn't present me with any disk errorrs. I confess, I'm baffled.

I have been trying to avoid re-installing Ubuntu-MATE, but it looks like that may be my last resort. I do not have 12T of backup storage, so I'll likely be looking at preserving that which is most important (and least re-creatable). I'll probably slice the large disk into partitions and copy what I can over.

What's most frustrating about this is that when Windows or Mac collapses, everything becomes a dust heap. Here, I have Linux, and can boot into it, but there isn't much I can do with it at this level.

ricmarques · 4 December 2022 00:34

Hi again, @OldStrummer .

That is a very tough situation, to be sure There's an additional change in GRUB that you can also do (if you haven't already done it) which is to remove the "quiet splash" part of the value of the GRUB_CMDLINE_LINUX_DEFAULT line in the "/etc/default/grub" file, in order to see the Linux boot messages. So, if currently your GRUB_CMDLINE_LINUX_DEFAULT line reads as follows:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

... then change it so it becomes simply this:

GRUB_CMDLINE_LINUX_DEFAULT=""

... and then run sudo update-grub and reboot.

In a related note: have you also checked the output of sudo dmesg after booting, to see if any other errors / relevant information appear there?

Like you said: "At least fsck didn't present me with any disk errors". Those are good news! It's still a very good idea to back up all the important stuff that you have now, before doing any other changes, but at least you may feel now a bit more relieved about the condition of the hard disk.

In your original post, you wrote that running the VMware Workstation Pro installer, "like the other processes I ran, it just stalls and doesn't throw any errors.". Assuming this is a hardware problem, maybe the problem is with some Memory (RAM) Module (DIMM) then? Have you ran some memory checks by using, for instance "MemTest86" - https://www.memtest86.com/ - or "Memtest86+" - https://www.memtest.org/ ?

A somewhat related question about memory: does that Dell Server of yours (is that some PowerEdge model?) have "ECC Memory" (Error Correction Code Memory) - which should detect and correct some types of data corruption in memory? According, for instance, to https://www.dell.com/community/PowerEdge-Hardware-General/How-to-verify-the-installed-memory-type-ECC-or-no/td-p/5061163 you can run sudo dmidecode -t 16 and check the "Error Correction Type" line. In the case of this laptop that I'm using now (to write this reply), as you'd expect it does NOT have "ECC memory" (as that kind of memory type is more common in servers, as far as I know), so I have "Error Correction Type: None" in the output of sudo dmidecode -t 16:

$ sudo dmidecode -t 16
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.

Handle 0x0025, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: None
	Maximum Capacity: 16 GB
	Error Information Handle: No Error
	Number Of Devices: 2

As you may already know, you get a bit more information if you run sudo dmidecode -t memory (instead of sudo dmidecode -t 16) specifically you'll get some information for each installed memory module.

Good luck! Keep us posted, please.

OldStrummer · 4 December 2022 18:42

It's almost like having you in my back pocket, @ricmarques. I like that! At times I feel I'm one step ahead of you, but having another "set of eyes" go a long way toward making me comfortable about this mess. Even if I'm not getting solutions.

I too, thought maybe memory might be an issue. When my server boots, the first thing it reports is "Configuring memory..." which takes nearly a minute to complete. But I have 18 slots on the motherboard, and 16 of those are filled with 4GB DIMM modules, so I can expect this to take a little time. My sudo dmidecode -t 16 and sudo dmidecode -t memory don't reveal any anomalies. I also ran through the entire dmesg output looking for issues there. Nothing out of the ordinary.

Here's what my dmidecode -t memory output sample looks like (head -32 lines):

dmidecode 3.2

Getting SMBIOS data from sysfs.
SMBIOS 2.6 present.

Handle 0x1000, DMI type 16, 15 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 288 GB
Error Information Handle: Not Provided
Number Of Devices: 18

Handle 0x1100, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x1000
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 4096 MB
Form Factor: DIMM
Set: 1
Locator: DIMM_A1
Bank Locator: Not Specified
Type: DDR3
Type Detail: Synchronous Registered (Buffered)
Speed: 1333 MT/s
Manufacturer: 00CE00B380CE
Serial Number: 86220626
Asset Tag: 02111161
Part Number: M393B5273CH0-YH9
Rank: 2

And so on for the rest of the modules.

ricmarques · 4 December 2022 19:49

Thanks for your kind words, @OldStrummer

Yep: according to that "sudo dmidecode -t memory" output, your Memory Array is indeed ECC, specifically Multi-bit ECC (as we can see in the part that reads Error Correction Type: Multi-bit ECC). The fact that, for the first memory Module of your server (that you've also included the information in the output), the "Total Width" value (72 bits) - Total Width: 72 bits - is greater than the "Data Width" value (64 bits) - Data Width: 64 bits - is another likely indicator that that specific Memory module (DIMM) is an ECC module (and I assume that is also the case for the other memory modules in your system).

For instance, in "Chris's Wiki :: blog/linux/CheckingRAMDIMMInfo" the author of that web page/blog (Chris Siebenmann) has written the following:

"(...) Having looked at a number of our servers, my conclusion is that if dmidecode reports a 'Total Width' larger than the 'Data Width' (typically 72 and 64), you can definitely conclude that you have ECC DIMMs. If it also reports that ECC is enabled in the 'Physical Memory Array' section, ECC is almost certainly on. Otherwise, who knows short of the kernel complaining about ECC problems. (...)"

OldStrummer · 4 December 2022 20:59

In the interest of completeness, here is the output of inxi

root@r710:/# sudo inxi -Fmxx
System: Host: r710 Kernel: 5.13.0-52-generic x86_64 bits: 64 compiler: N/A Console: tty 3 wm: marco dm: LightDM
Distro: Ubuntu 20.04.5 LTS (Focal Fossa)
Machine: Type: Server System: Dell product: PowerEdge R710 v: N/A serial: DRV4TR1 Chassis: type: 23 serial: [redacted]
Mobo: Dell model: 00NH4P v: A12 serial: ..CN1374018500IQ. UEFI: Dell v: 6.6.0 date: 05/22/2018
Memory: RAM: total: 62.88 GiB used: 3.25 GiB (5.2%)
Array-1: capacity: 288 GiB slots: 18 EC: Multi-bit ECC max module size: 16 GiB note: est.
Device-1: DIMM_A1 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-2: DIMM_A2 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-3: DIMM_A3 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-4: DIMM_A4 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-5: DIMM_A5 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-6: DIMM_A6 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-7: DIMM_A7 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-8: DIMM_A8 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-9: DIMM_A9 size: No Module Installed
Device-10: DIMM_B1 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-11: DIMM_B2 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-12: DIMM_B3 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-13: DIMM_B4 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-14: DIMM_B5 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-15: DIMM_B6 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-16: DIMM_B7 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-17: DIMM_B8 size: 4 GiB speed: 1333 MT/s type: DDR3 manufacturer: 00CE00B380CE part-no: M393B5273CH0-YH9
Device-18: DIMM_B9 size: No Module Installed
CPU: Topology: 2x 6-Core model: Intel Xeon X5670 bits: 64 type: MT MCP SMP arch: Nehalem rev: 2 L1 cache: 384 KiB
L2 cache: 24.0 MiB L3 cache: 24.0 MiB
flags: lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 140446
Speed: 1649 MHz min/max: 1596/2927 MHz Core speeds (MHz): 1: 1605 2: 1596 3: 1666 4: 1596 5: 1596 6: 1596 7: 1596
8: 1595 9: 1596 10: 1596 11: 1596 12: 1596 13: 1596 14: 1594 15: 1596 16: 1596 17: 1595 18: 1596 19: 1596 20: 1596
21: 1689 22: 1596 23: 1596 24: 1596
Graphics: Device-1: Matrox Systems MGA G200eW WPCM450 vendor: Dell PowerEdge R710 driver: mgag200 v: kernel bus ID: 08:03.0
chip ID: 102b:0532
Display: server: X.org 1.20.13 driver: modesetting unloaded: fbdev,vesa alternate: mga compositor: marco
tty: 133x43
Message: Advanced graphics data unavailable for root.
Audio: Message: No Device data found.
Network: Device-1: Broadcom and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet vendor: Dell PowerEdge R710 driver: bnx2
v: kernel port: eca0 bus ID: 01:00.0 chip ID: 14e4:1639
IF: eno1 state: up speed: 1000 Mbps duplex: full mac: 78:2b:cb:53:7c:35
Device-2: Broadcom and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet vendor: Dell PowerEdge R710 driver: bnx2
v: kernel port: eca0 bus ID: 01:00.1 chip ID: 14e4:1639
IF: eno2 state: down mac: 78:2b:cb:53:7c:37
Device-3: Broadcom and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet vendor: Dell PowerEdge R710 driver: bnx2
v: kernel port: eca0 bus ID: 02:00.0 chip ID: 14e4:1639
IF: eno3 state: down mac: 78:2b:cb:53:7c:39
Device-4: Broadcom and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet vendor: Dell PowerEdge R710 driver: bnx2
v: kernel port: eca0 bus ID: 02:00.1 chip ID: 14e4:1639
IF: eno4 state: down mac: 78:2b:cb:53:7c:3b
Drives: Local Storage: total: 10.92 TiB used: 6.16 TiB (56.4%)
ID-1: /dev/sda vendor: Dell PowerEdge RAID Card model: PERC H700 size: 10.91 TiB serial: N/A
ID-2: /dev/sdb type: USB vendor: Generic model: Flash Disk size: 7.71 GiB serial: AA515E5B
RAID: Hardware-1: Broadcom / LSI MegaRAID SAS 2108 [Liberator] driver: megaraid_sas v: 07.714.04.00-rc1 bus ID: 03:00.0
chip ID: 1000.0079
Partition: ID-1: / size: 10.83 TiB used: 6.16 TiB (56.9%) fs: ext4 dev: /dev/sda2
Sensors: Message: No ipmi sensors data was found.
System Temperatures: lm-sensors cpu: 29.0 C mobo: N/A
Fan Speeds (RPM): lm-sensors N/A
Info: Processes: 507 Uptime: 3h 42m Init: systemd v: 245 runlevel: 5 Compilers: gcc: 9.4.0 alt: 9 Shell: bash v: 5.0.17
running in: mate-terminal inxi: 3.0.38

All indications suggest a healthy system, yet I cannot boot into the system without using a bootable thumb drive (note the kernel I booted into: 5.13.0-52-generic; I have actually upgraded the kernel to 5.13.0-53-generic, which is approximately when I started having issues.

ricmarques · 4 December 2022 21:47

Hi again, @OldStrummer

You wrote:

Hmmm.... I've found the following Bug report in Launchpad for the "5.13.0-53-generic" Linux kernel, reported by Johan van Dijk on 16th November 2022, but it's for Ubuntu 22.04 (Jammy Jellyfish) and not for Ubuntu 20.04 (Focal Fossa). Judging by that bug report and the comments in the Bug page, it seems to be (mostly?) applicable for people that have AMD Graphics Cards / GPUs (which, according to your "inxi" output, is not your case):

According to the following comment by the "Launchpad Janitor", that particular bug has been fixed in kernel 5.15.0-56.62:

Maybe it's worth a shot of trying to boot your server from another USB Flash Drive that has Ubuntu MATE 20.04.5 (Focal Fossa) but with that kernel "5.15.0-56.62" applied and see if the system behaves properly (before risking to attempt to upgrade the kernel in the server itself, which could undesirably remove the functioning 5.13.0-52 kernel and only keep the bad 5.13.0-53 kernel and the 5.13.0-56 ones)?

For what it's worth, in my laptop (with Ubuntu MATE 22.04), I already have that 5.15.0-56 kernel installed as the default one:

$ uname -a
Linux [ hostname ] 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

OldStrummer · 4 December 2022 22:33

I recently downloaded the kernel 5.15.0-56-generic (but didn't see 5.15.0-56.62). There's a pending update for linux-headers-5.4.0-135-generic, but unless I'm off, this is development stuff and not for day-to-day running.

The Rescatux disk shows me I have the 5.0.15-56-generic available. I'll give that a try. Thanks!

ricmarques · 4 December 2022 22:43

That "5.15.0-56-generic" kernel that you have already downloaded is most likely "5.15-0.56.62". Here's the related package information for my laptop:

$ dpkg --list | grep -i 'linux kernel' | grep '5.15.0-56'  | grep generic
ii  linux-headers-5.15.0-56-generic          5.15.0-56.62                               amd64        Linux kernel headers for version 5.15.0 on 64 bit x86 SMP
ii  linux-modules-5.15.0-56-generic          5.15.0-56.62                               amd64        Linux kernel extra modules for version 5.15.0 on 64 bit x86 SMP
ii  linux-modules-extra-5.15.0-56-generic    5.15.0-56.62                               amd64        Linux kernel extra modules for version 5.15.0 on 64 bit x86 SMP

Notice that the name of the installed packages above is only "linux-[something]-5.15.0-56-generic" but the Version is actually "5.15.0-56.62". That "62" also appears in the "uname -a" output after the # (hash) character:

$ uname -a
Linux [ hostname ] 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

OldStrummer · 5 December 2022 21:50

Crossing my fingers here...

Got a notification that 7 new updates were available. Among them were new kernel headers. Since I can still boot from a live USB, I figured, "Why not?" So I ran

sudo apt clean && sudo apt update

As expected, it choked on update-initramfs and told me dpkg needed to be reconfigured. Sigh.

So, off I went. Much to my surprise, after the first chokepoint, it progressed on!

sudo dpkg --configure -a
Setting up linux-firmware (1.187.35) ...
update-initramfs: Generating /boot/initrd.img-5.15.0-53-generic
cp: error reading '/lib/modules/5.15.0-53-generic/kernel/drivers/gpu/drm/i2c/ch7006.ko': Input/output error
depmod: ERROR: failed to load symbols from /var/tmp/mkinitramfs_CenHVH/lib/modules/5.15.0-53-generic/kernel/drivers/gpu/drm/i2c/ch7006.ko: Invalid argument
update-initramfs: Generating /boot/initrd.img-5.15.0-52-generic
update-initramfs: Generating /boot/initrd.img-5.13.0-52-generic
Setting up linux-modules-extra-5.15.0-56-generic (5.15.0-56.62~20.04.1) ...
Setting up linux-image-generic-hwe-20.04 (5.15.0.56.62~20.04.22) ...
Setting up linux-generic-hwe-20.04 (5.15.0.56.62~20.04.22) ...
Processing triggers for linux-image-5.15.0-56-generic (5.15.0-56.62~20.04.1) ...
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-5.15.0-56-generic
/etc/kernel/postinst.d/zz-update-grub:
Sourcing file /etc/default/grub' Sourcing file /etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.15.0-56-generic
Found initrd image: /boot/initrd.img-5.15.0-56-generic
Found linux image: /boot/vmlinuz-5.15.0-53-generic
Found initrd image: /boot/initrd.img-5.15.0-53-generic
Found linux image: /boot/vmlinuz-5.15.0-52-generic
Found initrd image: /boot/initrd.img-5.15.0-52-generic
Found linux image: /boot/vmlinuz-5.13.0-52-generic
Found initrd image: /boot/initrd.img-5.13.0-52-generic
Found Mac OS X on /dev/sdb3
done
And now... I'm going to cross my fingers and see if the system boots. More to come...

OldStrummer · 6 December 2022 12:08

I am now cautiously optimistic. The system rebooted! I've been looking over services and installed software, and everything seems to be intact. Even better, VMware Workstation Pro 16.2.4 installed completely, and now I am working through my VMs. I did need to run the VMware Network Editor to get back to my chosen virtual network but I had no problems with that, either.

At the moment, everything seems back to normal. Given that I have no confirmation that the issue was something buggy in the kernel 5.15.0-53-generic but that I'm now running 5.15.0-56-generic suggests the issue was kernel-based. I'm hoping this is now a past issue; I lost a week of productive time trying to troubleshoot this issue. Fortunately, all that was lost was time!