I have posted here recently about problems I had reconfiguring iniramfs, getting VMware Workstation Pro to lauch/compile and now I think my next step is to reboot in emergency mode and run fsck insteractively on my files systems. I have 20Tb of disk, but haven't chunked it up since I first installed Ubuntu MATE, and that may be a shortfall on my part.
My latest attempt was to completely remove all vestiges of VMware Workstation and do a fresh re-install. I found directions on how to do this, and have done it successfully before. Not this time.
VMware Workstation Pro is delivered in a .bundle file, meaning it's simply a script to be executed with elevated privileges. So, I cheerily went forth and issue the sudo ./VMware-Workstation-blah-blah-blah.bundle, and off it went.
However, it's now stalled at 51%. I've been waiting for the progress bar to continue, but it's not. Just like the other processes I ran, it just stalls and doesn't throw any errors.
So, even though booting my server is an automatic process, I believe I can enter "emergency boot mode" at the startup screen by pressing escape, the 'e' to edit the boot command. Once at the grub script, I should just add systemd.unit=emergency.target. Then, if all goes well, I should be able to run fskc on my file systems and see what's what.
You mention that you have "20Tb of disk"?! That's a lot! Is that a single disk (given the huge size, I'm assuming it's an HDD and not an SSD) or is it some kind of RAID setup with several disks?
Before running "fsck" (filesystem check), I would first:
1 - Check the /var/log/syslog and search for disk errors there (messages like "DRDY ERR" and/or "Unrecovered read error" and/or "I/O error" are likely indicators of disk problems and/or a bad cable).
2 - Do you have a good backup of your data? If not, I suggest you create a backup first, before you run a filesystem check, especially if that filesystem check will try to repair things. The article "Repair a damaged filesystem" - https://help.ubuntu.com/stable/ubuntu-help/disk-repair.html.en - includes, amid other useful information, the following warning:
"(...) Possible data loss when repairing
If the filesystem structure is damaged it can affect the files stored in it. In some cases these files can not be brought into a valid form again and will be deleted or moved to a special directory. It is normally the lost+found folder in the top level directory of the filesystem where these recovered file parts can be found.
If the data is too valuable to be lost during this process, you are advised to back it up by saving an image of the volume before repairing.
This image can be then processed with forensic analysis tools like sleuthkit to further recover missing files and data parts which were not restored during the repair, and also previously removed files. (...)"
Hard disks have a built-in health-check tool called SMART (Self-Monitoring, Analysis, and Reporting Technology), which continually checks the disk for potential problems. SMART also warns you if the disk is about to fail, helping you avoid loss of important data.
Although SMART runs automatically, you can also check your disk’s health by running the Disks application:
Check your disk’s health using the Disks application
1. Open Disks from the Activities overview. 2. Select the disk you want to check from the list of storage devices on the left. Information and status of the disk will be shown. 3. Click the menu button and select SMART Data & Self-Tests…. The Overall Assessment should say “Disk is OK”. 4. See more information under SMART Attributes, or click the Start Self-test button to run a self-test. (...)"
(Alternatively, if you're familiar with the "smartctl" command that comes with the "smartmontools" package, you can also use that for checking the "SMART" Data and doing the "SMART Self-tests")
4 - If you don't already have a good backup of your data, you may consider using "ddrescue" (I've never used it) - Ddrescue - GNU Project - Free Software Foundation (FSF) - to create a disk image, before doing some "fsck" of the filesystem in a mode that may try to repair things. HOWEVER, it can take a LOT of time to create a disk image and using "ddrescue" for a 20 TB disk like yours will very likely take a REALLY LONG time! There's the following post here in the "Ubuntu MATE Community" regarding "ddrescue", posted on 10th April 2018 by @andyp6 :
Other than that: your approach looks good to me (entering " 'emergency boot mode' at the startup screen by pressing escape, then 'e' to edit the boot command. Once at the grub script, I should just add systemd.unit=emergency.target .") as it's described, for instance, in the article How to Boot Ubuntu 20.04 LTS in Rescue / Emergency Mode particularly in the section "Booting Ubuntu 20.04 LTS in Emergency Mode" of that article.
Having said that, in Emergency Mode, I would first run the "fsck" in its "dry run" mode by using the -N (minus uppercase N) switch that, acccording to "man fsck", has the meaning of "Don’t execute, just show what would be done." and also using the "-V" (minus uppercase V) switch to get verbose output and checking the results, before running the "fsck" command again without the -N switch.
@ricmarques, I appreciate your taking the time to help. I've just spent a frustrating three hours trying to boot into recovery/emergency mode with no success. In fact, I can't even get to the GRUB menu. In fact, I suspect it doesn't exist, as my system is a Dell server running in UEFI mode.
When I boot up, the system goes through its hardware (onboard) diagnostics (memory, disk, network, remote management access and then immediately launches into MATE. If I press at any point, the boot process simply halts and I'm left with a blank screen and a blinking cursor. Which does not respond to keyboard or mouse, at which point it's back to Ctrl-Alt-Del to try another restart.
At this point, I think my next approach is to make a bootable USB drive with an Ubuntu MATE 20.04 LTS .iso on it and attempt to install on top of my system (unless there's a repair option).
Thanks for your appreciation for my help. I'm sorry that you are having such a hard time with that Dell server You mention that you suspect that you may not even have a GRUB menu, because your system is running in UEFI mode. I don't think that is the case: my laptop is running in UEFI mode and I have a GRUB menu. I think what it's more likely is that the GRUB menu in your installation is hidden.
First, I like your idea of making a bootable USB drive with Ubuntu MATE 20.04. If I were you, I would start by doing that, so that you have a way to boot the machine if other problems arise.
Then - assuming that you already have a good backup - my suggestion would be to edit the GRUB configuration to be able to see it . However, as you're using Ubuntu MATE 20.04, I've searched and found the following topic here in the "Ubuntu MATE Community", started by @tomec on 14th February 2020, that reveals that you may be hitting / hit a bug, when you try to do that:
In that discussion (that had/has several replies),@franksmcb (Bill) wrote the following reply that mentions a Bug:
Some comments in that bug point to a workaround described in "Get grub menu back after installing Ubuntu 20.04 alongside Windows | by Jerry | Medium" which basically amounts to edit the "/etc/default/grub" file (I suggest that you make a backup of that file before editing it) and, in that file, change the "GRUB_TIMEOUT_STYLE" line so its value is menu instead of hidden:
GRUB_TIMEOUT_STYLE=menu
... and uncomment the following line:
GRUB_TERMINAL=console
Also check/confirm that GRUB_TIMEOUT in that file is set to a positive number:
GRUB_TIMEOUT=10
Finally, runsudo update-grub to update the GRUB configuration and then reboot.
Hi @ricmarques. I was a few steps ahead of you and realized GRUB was present, just hidden. So, I went and changed the /etc/default/grub file and ran the sudo update-grub command. But this made it only worse. Now, the boot process winds up with a black screen with a white cursor on the upper left corner. No keyboard input, mouse, etc. are recognized.
I have now spent the better part of last night and today attempting a variety of solutions. I have burned to bootable USB drives Boot Repair Disk, Super Grub Disk, Rescatux and a full .iso of Ubuntu-MATE 20.04.5 LTS. I have been through the UEFI boot process so many times I can pick out the appropriate F key when needed, in the dark!
Every one of these "rescue" disks at least loads. Super Grub at least let me boot into Linux, and even after re-installing grub, failed to give me a clean boot. At least fsck didn't present me with any disk errorrs. I confess, I'm baffled.
I have been trying to avoid re-installing Ubuntu-MATE, but it looks like that may be my last resort. I do not have 12T of backup storage, so I'll likely be looking at preserving that which is most important (and least re-creatable). I'll probably slice the large disk into partitions and copy what I can over.
What's most frustrating about this is that when Windows or Mac collapses, everything becomes a dust heap. Here, I have Linux, and can boot into it, but there isn't much I can do with it at this level.
That is a very tough situation, to be sure There's an additional change in GRUB that you can also do (if you haven't already done it) which is to remove the "quiet splash" part of the value of the GRUB_CMDLINE_LINUX_DEFAULT line in the "/etc/default/grub" file, in order to see the Linux boot messages. So, if currently your GRUB_CMDLINE_LINUX_DEFAULT line reads as follows:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
... then change it so it becomes simply this:
GRUB_CMDLINE_LINUX_DEFAULT=""
... and then run sudo update-grub and reboot.
In a related note: have you also checked the output of sudo dmesg after booting, to see if any other errors / relevant information appear there?
Like you said: "At least fsck didn't present me with any disk errors". Those are good news! It's still a very good idea to back up all the important stuff that you have now, before doing any other changes, but at least you may feel now a bit more relieved about the condition of the hard disk.
In your original post, you wrote that running the VMware Workstation Pro installer, "like the other processes I ran, it just stalls and doesn't throw any errors.". Assuming this is a hardware problem, maybe the problem is with some Memory (RAM) Module (DIMM) then? Have you ran some memory checks by using, for instance "MemTest86" - https://www.memtest86.com/ - or "Memtest86+" - https://www.memtest.org/ ?
A somewhat related question about memory: does that Dell Server of yours (is that some PowerEdge model?) have "ECC Memory" (Error Correction Code Memory) - which should detect and correct some types of data corruption in memory? According, for instance, to https://www.dell.com/community/PowerEdge-Hardware-General/How-to-verify-the-installed-memory-type-ECC-or-no/td-p/5061163 you can run sudo dmidecode -t 16 and check the "Error Correction Type" line. In the case of this laptop that I'm using now (to write this reply), as you'd expect it does NOT have "ECC memory" (as that kind of memory type is more common in servers, as far as I know), so I have "Error Correction Type: None" in the output of sudo dmidecode -t 16:
$ sudo dmidecode -t 16
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.
Handle 0x0025, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: None
Maximum Capacity: 16 GB
Error Information Handle: No Error
Number Of Devices: 2
As you may already know, you get a bit more information if you run sudo dmidecode -t memory (instead of sudo dmidecode -t 16) specifically you'll get some information for each installed memory module.
It's almost like having you in my back pocket, @ricmarques. I like that! At times I feel I'm one step ahead of you, but having another "set of eyes" go a long way toward making me comfortable about this mess. Even if I'm not getting solutions.
I too, thought maybe memory might be an issue. When my server boots, the first thing it reports is "Configuring memory..." which takes nearly a minute to complete. But I have 18 slots on the motherboard, and 16 of those are filled with 4GB DIMM modules, so I can expect this to take a little time. My sudo dmidecode -t 16 and sudo dmidecode -t memory don't reveal any anomalies. I also ran through the entire dmesg output looking for issues there. Nothing out of the ordinary.
Here's what my dmidecode -t memory output sample looks like (head -32 lines):
dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 2.6 present.
Handle 0x1000, DMI type 16, 15 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 288 GB
Error Information Handle: Not Provided
Number Of Devices: 18
Handle 0x1100, DMI type 17, 28 bytes
Memory Device
Array Handle: 0x1000
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 4096 MB
Form Factor: DIMM
Set: 1
Locator: DIMM_A1
Bank Locator: Not Specified
Type: DDR3
Type Detail: Synchronous Registered (Buffered)
Speed: 1333 MT/s
Manufacturer: 00CE00B380CE
Serial Number: 86220626
Asset Tag: 02111161
Part Number: M393B5273CH0-YH9
Rank: 2
Yep: according to that "sudo dmidecode -t memory" output, your Memory Array is indeed ECC, specifically Multi-bit ECC (as we can see in the part that reads Error Correction Type: Multi-bit ECC). The fact that, for the first memory Module of your server (that you've also included the information in the output), the "Total Width" value (72 bits) - Total Width: 72 bits - is greater than the "Data Width" value (64 bits) - Data Width: 64 bits - is another likely indicator that that specific Memory module (DIMM) is an ECC module (and I assume that is also the case for the other memory modules in your system).
"(...) Having looked at a number of our servers, my conclusion is that if dmidecode reports a 'Total Width' larger than the 'Data Width' (typically 72 and 64), you can definitely conclude that you have ECC DIMMs. If it also reports that ECC is enabled in the 'Physical Memory Array' section, ECC is almost certainly on. Otherwise, who knows short of the kernel complaining about ECC problems. (...)"
All indications suggest a healthy system, yet I cannot boot into the system without using a bootable thumb drive (note the kernel I booted into: 5.13.0-52-generic; I have actually upgraded the kernel to 5.13.0-53-generic, which is approximately when I started having issues.
Hmmm.... I've found the following Bug report in Launchpad for the "5.13.0-53-generic" Linux kernel, reported by Johan van Dijk on 16th November 2022, but it's for Ubuntu 22.04 (Jammy Jellyfish) and not for Ubuntu 20.04 (Focal Fossa). Judging by that bug report and the comments in the Bug page, it seems to be (mostly?) applicable for people that have AMD Graphics Cards / GPUs (which, according to your "inxi" output, is not your case):
According to the following comment by the "Launchpad Janitor", that particular bug has been fixed in kernel 5.15.0-56.62:
Maybe it's worth a shot of trying to boot your server from another USB Flash Drive that has Ubuntu MATE 20.04.5 (Focal Fossa) but with that kernel "5.15.0-56.62" applied and see if the system behaves properly (before risking to attempt to upgrade the kernel in the server itself, which could undesirably remove the functioning 5.13.0-52 kernel and only keep the bad 5.13.0-53 kernel and the 5.13.0-56 ones)?
For what it's worth, in my laptop (with Ubuntu MATE 22.04), I already have that 5.15.0-56 kernel installed as the default one:
$ uname -a
Linux [ hostname ] 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
I recently downloaded the kernel 5.15.0-56-generic (but didn't see 5.15.0-56.62). There's a pending update for linux-headers-5.4.0-135-generic, but unless I'm off, this is development stuff and not for day-to-day running.
The Rescatux disk shows me I have the 5.0.15-56-generic available. I'll give that a try. Thanks!
That "5.15.0-56-generic" kernel that you have already downloaded is most likely "5.15-0.56.62". Here's the related package information for my laptop:
$ dpkg --list | grep -i 'linux kernel' | grep '5.15.0-56' | grep generic
ii linux-headers-5.15.0-56-generic 5.15.0-56.62 amd64 Linux kernel headers for version 5.15.0 on 64 bit x86 SMP
ii linux-modules-5.15.0-56-generic 5.15.0-56.62 amd64 Linux kernel extra modules for version 5.15.0 on 64 bit x86 SMP
ii linux-modules-extra-5.15.0-56-generic 5.15.0-56.62 amd64 Linux kernel extra modules for version 5.15.0 on 64 bit x86 SMP
Notice that the name of the installed packages above is only "linux-[something]-5.15.0-56-generic" but the Version is actually "5.15.0-56.62". That "62" also appears in the "uname -a" output after the # (hash) character:
$ uname -a
Linux [ hostname ] 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Got a notification that 7 new updates were available. Among them were new kernel headers. Since I can still boot from a live USB, I figured, "Why not?" So I ran
sudo apt clean && sudo apt update
As expected, it choked on update-initramfs and told me dpkg needed to be reconfigured. Sigh.
So, off I went. Much to my surprise, after the first chokepoint, it progressed on!
sudo dpkg --configure -a
Setting up linux-firmware (1.187.35) ...
update-initramfs: Generating /boot/initrd.img-5.15.0-53-generic
cp: error reading '/lib/modules/5.15.0-53-generic/kernel/drivers/gpu/drm/i2c/ch7006.ko': Input/output error
depmod: ERROR: failed to load symbols from /var/tmp/mkinitramfs_CenHVH/lib/modules/5.15.0-53-generic/kernel/drivers/gpu/drm/i2c/ch7006.ko: Invalid argument
update-initramfs: Generating /boot/initrd.img-5.15.0-52-generic
update-initramfs: Generating /boot/initrd.img-5.13.0-52-generic
Setting up linux-modules-extra-5.15.0-56-generic (5.15.0-56.62~20.04.1) ...
Setting up linux-image-generic-hwe-20.04 (5.15.0.56.62~20.04.22) ...
Setting up linux-generic-hwe-20.04 (5.15.0.56.62~20.04.22) ...
Processing triggers for linux-image-5.15.0-56-generic (5.15.0-56.62~20.04.1) ...
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-5.15.0-56-generic
/etc/kernel/postinst.d/zz-update-grub:
Sourcing file /etc/default/grub' Sourcing file /etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.15.0-56-generic
Found initrd image: /boot/initrd.img-5.15.0-56-generic
Found linux image: /boot/vmlinuz-5.15.0-53-generic
Found initrd image: /boot/initrd.img-5.15.0-53-generic
Found linux image: /boot/vmlinuz-5.15.0-52-generic
Found initrd image: /boot/initrd.img-5.15.0-52-generic
Found linux image: /boot/vmlinuz-5.13.0-52-generic
Found initrd image: /boot/initrd.img-5.13.0-52-generic
Found Mac OS X on /dev/sdb3
done
And now... I'm going to cross my fingers and see if the system boots. More to come...
I am now cautiously optimistic. The system rebooted! I've been looking over services and installed software, and everything seems to be intact. Even better, VMware Workstation Pro 16.2.4 installed completely, and now I am working through my VMs. I did need to run the VMware Network Editor to get back to my chosen virtual network but I had no problems with that, either.
At the moment, everything seems back to normal. Given that I have no confirmation that the issue was something buggy in the kernel 5.15.0-53-generic but that I'm now running 5.15.0-56-generic suggests the issue was kernel-based. I'm hoping this is now a past issue; I lost a week of productive time trying to troubleshoot this issue. Fortunately, all that was lost was time!