Intermittent boot problem in LTS 18.04

Hi Everyone.

I am a long time user of Ubuntu mate, which I think is great.

But just signed up to the community today to see if I can get some help with a strange problem I am having.

Intermittently when I boot up my laptop, I am getting the errors like the ones below and cannot boot up when they occur

journalctl -b -1 | grep -i long
Jun 07 10:26:30 dell systemd-udevd[516]: seq 2959 '/module/nvidia' is taking a long time
Jun 07 10:26:30 dell systemd-udevd[516]: seq 3425 '/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C14:02' is taking a long time
Jun 07 10:26:30 dell systemd-udevd[516]: seq 3269 '/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:10/PNP0C09:00/INT3403:00' is taking a long time
Jun 07 10:26:30 dell systemd-udevd[516]: seq 3248 '/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/PNP0C14:01' is taking a long time
Jun 07 10:26:30 dell systemd-udevd[516]: seq 3458 '/devices/pci0000:00/0000:00:04.0' is taking a long time

journalctl -b -1 | grep -i sda2.device
Jun 07 10:26:59 dell systemd[1]: dev-sda2.device: Job dev-sda2.device/start timed out.
Jun 07 10:26:59 dell systemd[1]: Timed out waiting for device dev-sda2.device.
Jun 07 10:26:59 dell systemd[1]: dev-sda2.device: Job dev-sda2.device/start failed with result 'timeout'.
Jun 07 10:28:29 dell systemd[1]: dev-sda2.device: Job dev-sda2.device/start timed out.
Jun 07 10:28:29 dell systemd[1]: Timed out waiting for device dev-sda2.device.
Jun 07 10:28:29 dell systemd[1]: dev-sda2.device: Job dev-sda2.device/start failed with result 'timeout'.

Sometimes it boots up normally with none of those kind of errors at all.

I seem to have had those sort of errors since after installing the latest kernel
/var/log/dpkg.log.1:2022-05-28 12:40:07 status installed linux-image-4.15.0-180-generic:amd64 4.15.0-180.189

I have tried booting from the previous kernel though, and still had the same problem

To fix the problem I boot to rescue mode and run an fsck on /dev/sda1 (/boot/efi)
and sometime it is marked as dirty and sometimes it is not.

and then I run
install-grub /dev/ssda
update-grub

and then it usually boots up ok

My hardware is a Dell Inspiron 7560 

My OS is 

cat /etc/os-release 

NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

my disk layout is like so... 

fdisk -l /dev/sda

Disk /dev/sda: 238.5 GiB, 256060514304 bytes, 500118192 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 8C7A1912-3565-4853-9E63-55749D58E708

Device       Start       End   Sectors   Size Type
/dev/sda1     2048     43007     40960    20M EFI System
/dev/sda2    43008   2140159   2097152     1G Linux filesystem
/dev/sda3  2140160 500118158 497977999 237.5G Linux filesystem

lsblk -f) is 

sda                                                                     
├─sda1         vfat              0275-84DA                              /boot/efi
├─sda2         ext4              0650ed2e-5153-459a-9eac-1e026f8e2d54   /boot
└─sda3         crypto_LUKS       d3851cb9-79e1-4e8d-af6b-1e2032234039   
  └─sda3_crypt LVM2_member       7F1rhm-U6qV-oMd4-effZ-cZy8-O0s1-8AS5xU 
    ├─ub-root  ext4              f29e8f45-765c-4551-9a93-f5715fcc5458   /
    ├─ub-home  ext4              21244b90-fa13-44a3-8cb0-879331d2ae6b   /home
    ├─ub-e     ext4              1941ef6f-7447-486a-8a88-4d620590ab3e   /e
    └─ub-swap  swap              3b470346-9672-4711-837c-ee0090dc4841   [SWAP]

sda3 is encrypted, and on the encrypted device I have a VG called "ub", and my LV's for root, swap, home, e, etc

s -l /dev/mapper/*
crw------- 1 root root 10, 236 Jun  8 07:29 /dev/mapper/control
lrwxrwxrwx 1 root root       7 Jun  8 07:29 /dev/mapper/sda3_crypt -> ../dm-0
lrwxrwxrwx 1 root root       7 Jun  8 07:29 /dev/mapper/ub-e -> ../dm-3
lrwxrwxrwx 1 root root       7 Jun  8 07:29 /dev/mapper/ub-home -> ../dm-2
lrwxrwxrwx 1 root root       7 Jun  8 07:29 /dev/mapper/ub-root -> ../dm-1
lrwxrwxrwx 1 root root       7 Jun  8 07:29 /dev/mapper/ub-swap -> ../dm-4


# mount | grep sda
/dev/sda2 on /boot type ext4 (rw,relatime,data=ordered)
/dev/sda1 on /boot/efi type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)

Currently my hard disk is a: 
Samsung 860 EVO M.2 Sata III internal SSD (non NMVe type)

I previously had an additional 2.5" 1TB drive as well, and had my LVM's spread accross both disks, and it seems that the either the second disk or its connection to the motherboard, the cable, or the connection, or even the Bus it is on has gone faulty.

So I hard to do a bare metal restore from a backup onto one disk. 

originally the boot disk was seen as /dev/sdb, and now with a single disk it is on /dev/sda 

To do the mare metal restore, I booted from a Live Disk ISO image on a USB, and manually repartitioned /dev/sda, with a gpt type disk label 

and created the LVMs on the encrypted sda3 partition, and did an mkfs.vfat on /dev/sda1 for /boot/efi, and ext4 for /boot   

then I edited /etc/fstab and corrected the UUID's for the mount points to match the ones since recreating the file-systems, and the updated the UUID in the /etc/crypttab bfile too, and did a grub-install /dev/sda and an update-grub. 

At that stage I should have been able to boot up, and it failed with similar errors, so I booted from the live ISO again, and changed /etc/fstab to use the /dev/mapper/ub-root (and other) LVM names and it booted okay. 

grep -v ^# /etc/fstab

/dev/mapper/ub-root /               ext4    errors=remount-ro 0       1
/dev/sda2 /boot           ext4    defaults        0       2
UUID=0275-84DA  /boot/efi       vfat    umask=0077      0       0
/dev/mapper/ub-home /home           ext4    defaults        0       2
/dev/mapper/ub-e /e           ext4    defaults        0       2
/dev/mapper/ub-swap none            swap    sw              0       0

I also removed the UUID related to the RESUME from hibernate entry in /etc/default/grub file, and did an update-grub 

GRUB_DEFAULT=0
#GRUB_HIDDEN_TIMEOUT=0
GRUB_HIDDEN_TIMEOUT_QUIET=true
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR="`lsb_release -i -s 2> /dev/null || echo Debian`"
## GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
## GRUB_CMDLINE_LINUX_DEFAULT="UUID=50f603eb-4b97-4027-a652-dc275336bc6f"
# corrected swap UUID after BMR restore on 3-jun-2022
# GRUB_CMDLINE_LINUX_DEFAULT="UUID=3b470346-9672-4711-837c-ee0090dc4841"
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX=""

But as mentioned sometimes it wont boot.

I also had the cable from the second drive that I removed still connected to the motherboard, and I suspected it might be causing intermittent problems with devices ion the PCI bus "taking a long time"

so later I removed the cable, and it looked like the problem had gone away, but a couple of days later I had the same boot problem again.

I am at a loss as to what is causing the problem.
It might be an intermittet hardware fault, or it might be some some sort of software bug, or I might have configured my disk layout since the bare metal restore in a way that makes it prone to this problem.

At this stage I am considering reinstalling from scratch with LTS 20.04 or 22.04, but that means I have to manually install various programs and setup my desktop with various customisations I have made.

But I suppose I ought to be upgrading anyway because LTS 18.04 is now end of life.

Has anyone got any ideas as to what my problem might be please?

Could it be an intermittent hardware problem or some sort of software issue, or have I mis-configured my disk layout in some way?

Try booting from an older kernel.

Have you checked on the health of your HD ? This problem has an eerie similarity to an HD with bad blocks or a failing HD cable..

Hi Thom,
thanks for your reply.

I was suspecting it might be a hard could be the real cause of the problem too.

I am not really sure of the best tool to check it with - it is a solid state SSD card type of drive.

Do you have advice on a good tool to check the SSD drive with?

Because the current SSD solid state drive in the laptop is only 256 MB, I have ordered a 1TB replacement. It should arrive in a few days.

So if I do a bare metal restore onto the new drive it should indicate if it was the original 256MB drive or not.

Also, because I could not seem to detect the 2nd 2.5" 1TB drive (which I have removed from the laptop, along with the connecting cable), I wondered if it could be an intermittent problem on the motherboard.

I put the 2.5" 1TB second hard disk into a USB enclosure last night, and it seems to work okay when connected

I hope it is not the motherboard, or that will be more hardware to buy and replace - but I will find out I guess.

I will update this post when I have had the new drive in the laptop anyway.

Thanks
Brad

Hi Basil_Cat.

the current kernel is 4.15.0-180

I did already try booting from the older kernel 4.15.0-177, (the only earlier one that is still on the system.

I get the same problem with the older kernel.

Thanks
Brad

Hi Brad,

No, unfortunately I am not familiar with any hardware level SSD tools. But if this would happen to me I would temporarily replace the SSD with another boot device in the same socket ( I presume that the 'card type' is an M2 slot ?) and try to run with that. That would definitely tell me if it was the motherboard or the storage device.

To be honest , after reading your logs again, it could indeed be the motherboard. I remember I had identical intermittent bootproblems with my previous motherboard due to a failing capacitor in the power-on-reset circuit. It took me some time to figure that out.