CPU hard and soft lockups - desktop crashing

OK so I have been experiencing some major crashing on my work station and it is getting me down because this is way beyond my scope of problem solving..

I came to the desktop this morning after some chores and my screen was all black, but with a cursor, and my graphics tablet display read: "No Device detected." (It's plugged into my GPU - AMD RX 6600 8GB)

I should have pinged the tower or tunneled in from elsewhere to snoop around but I was sort of in a frantic state of mind and it slipped my mind. I just hit the reboot button and when I came into the desktop I read the kern.log and syslog output for the times I experienced the lockup.

I pasted what looked like CPU Hard & Soft lockup errors on Ubuntu Pastebin (which will be up for about a month - happy holidays :slight_smile: )

Here are the links:
SysLog output: Ubuntu Pastebin
KernLog output: Ubuntu Pastebin

Can someone help me figure out what is causing the crashing?
This happened the other day while I left my Krita drawing and OBS recording my art session. I got up to get some MATE tea and when I came back the system was unresponsive. (probably a diff issue - driver issues with the tablet)

Some system info:
pleiades 5.15.0-56-generic #62-Ubuntu SMP

lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  6
  On-line CPU(s) list:   0-5
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i5-8600K CPU @ 3.60GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  1
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            10
    CPU max MHz:         4300.0000
    CPU min MHz:         800.0000
    BogoMIPS:            7200.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts 
                         rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_t
                         imer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep
                          bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   192 KiB (6 instances)
  L1i:                   192 KiB (6 instances)
  L2:                    1.5 MiB (6 instances)
  L3:                    9 MiB (1 instance)
H/W path               Device          Class          Description
=================================================================
                                       system         MS-7B98 (Default string)
/0                                     bus            Z390-A PRO (MS-7B98)
/0/0                                   memory         64KiB BIOS
/0/39                                  memory         32GiB System Memory
/0/39/0                                memory         8GiB DIMM DDR4 Synchronous 3200 MHz (0.3 ns)
/0/39/1                                memory         8GiB DIMM DDR4 Synchronous 3200 MHz (0.3 ns)
/0/39/2                                memory         8GiB DIMM DDR4 Synchronous 3200 MHz (0.3 ns)
/0/39/3                                memory         8GiB DIMM DDR4 Synchronous 3200 MHz (0.3 ns)
/0/43                                  memory         384KiB L1 cache
/0/44                                  memory         1536KiB L2 cache
/0/45                                  memory         9MiB L3 cache
/0/46                                  processor      Intel(R) Core(TM) i5-8600K CPU @ 3.60GHz
/0/100                                 bridge         8th Gen Core Processor Host Bridge/DRAM Registers
/0/100/1                               bridge         6th-10th Gen Core Processor PCIe Controller (x16)
/0/100/1/0                             bridge         Navi 10 XL Upstream Port of PCI Express Switch
/0/100/1/0/0                           bridge         Navi 10 XL Downstream Port of PCI Express Switch
/0/100/1/0/0/0         /dev/fb0        display        Navi 23 [Radeon RX 6600/6600 XT/6600M]
/0/100/1/0/0/0.1       card1           multimedia     Navi 21 HDMI Audio [Radeon RX 6800/6800 XT / 6900 XT]
/0/100/1/0/0/0.1/0     input10         input          HDA ATI HDMI HDMI/DP,pcm=7
/0/100/1/0/0/0.1/1     input11         input          HDA ATI HDMI HDMI/DP,pcm=8
/0/100/1/0/0/0.1/2     input12         input          HDA ATI HDMI HDMI/DP,pcm=9
/0/100/1/0/0/0.1/3     input13         input          HDA ATI HDMI HDMI/DP,pcm=10
/0/100/1/0/0/0.1/4     input9          input          HDA ATI HDMI HDMI/DP,pcm=3
/0/100/8                               generic        Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
/0/100/12                              generic        Cannon Lake PCH Thermal Controller
/0/100/14                              bus            Cannon Lake PCH USB 3.1 xHCI Host Controller
/0/100/14/0            usb1            bus            xHCI Host Controller
/0/100/14/0/1          input3          input          Logitech Mechanical keyboard Logitech Mechanical keyboard Keyboard
/0/100/14/0/2          input5          input          SteelSeries SteelSeries Rival 310 eSports Mouse Keyboard
/0/100/14/0/9          input7          input          Logitech Gamepad F710
/0/100/14/0/a                          bus            HighSpeed Hub
/0/100/14/0/a/1                        input          15.6 inch PenDisplay
/0/100/14/1            usb2            bus            xHCI Host Controller
/0/100/14/1/3          scsi6           storage        My Book 1235
/0/100/14/1/3/0.0.0    /dev/sdc        disk           4TB My Book 1235
/0/100/14/1/3/0.0.0/1  /dev/sdc1       volume         3725GiB EFI partition
/0/100/14/1/3/0.0.1                    generic        SES Device
/0/100/14/1/4          card2           multimedia     Logitech StreamCam
/0/100/14.2                            memory         RAM memory
/0/100/16                              communication  Cannon Lake PCH HECI Controller
/0/100/17              scsi0           storage        Cannon Lake PCH SATA AHCI Controller
/0/100/17/0            /dev/sda        disk           256GB ADATA SU760
/0/100/17/0/1          /dev/sda1       volume         238GiB EXT4 volume
/0/100/17/1            /dev/sdb        disk           1TB WDC WD1003FZEX-0
/0/100/17/1/1          /dev/sdb1       volume         465GiB EXT4 volume
/0/100/17/1/2          /dev/sdb2       volume         465GiB EXT4 volume
/0/100/1d                              bridge         Cannon Lake PCH PCI Express Root Port #9
/0/100/1d/0            /dev/nvme0      storage        ADATA FALCON
/0/100/1d/0/0          hwmon1          disk           NVMe disk
/0/100/1d/0/2          /dev/ng0n1      disk           NVMe disk
/0/100/1d/0/1          /dev/nvme0n1    disk           512GB NVMe disk
/0/100/1d/0/1/1        /dev/nvme0n1p1  volume         975MiB Windows FAT volume
/0/100/1d/0/1/2        /dev/nvme0n1p2  volume         1952MiB Linux swap volume
/0/100/1d/0/1/3        /dev/nvme0n1p3  volume         57GiB EXT4 volume
/0/100/1d/0/1/4        /dev/nvme0n1p4  volume         416GiB EXT4 volume
/0/100/1f                              bridge         Z390 Chipset LPC/eSPI Controller
/0/100/1f/0                            system         PnP device PNP0c02
/0/100/1f/1                            system         PnP device PNP0c02
/0/100/1f/2                            printer        PnP device PNP0400
/0/100/1f/3                            communication  PnP device PNP0501
/0/100/1f/4                            system         PnP device PNP0c02
/0/100/1f/5                            generic        PnP device INT3f0d
/0/100/1f/6                            system         PnP device PNP0c02
/0/100/1f/7                            system         PnP device PNP0c02
/0/100/1f/8                            system         PnP device PNP0c02
/0/100/1f/9                            system         PnP device PNP0c02
/0/100/1f/a                            system         PnP device PNP0c02
/0/100/1f.3            card0           multimedia     Cannon Lake PCH cAVS
/0/100/1f.3/0          input14         input          HDA Intel PCH Front Mic
/0/100/1f.3/1          input15         input          HDA Intel PCH Rear Mic
/0/100/1f.3/2          input16         input          HDA Intel PCH Line
/0/100/1f.3/3          input17         input          HDA Intel PCH Line Out Front
/0/100/1f.3/4          input18         input          HDA Intel PCH Line Out Surround
/0/100/1f.3/5          input19         input          HDA Intel PCH Line Out CLFE
/0/100/1f.3/6          input20         input          HDA Intel PCH Line Out Side
/0/100/1f.3/7          input21         input          HDA Intel PCH Front Headphone
/0/100/1f.4                            bus            Cannon Lake PCH SMBus Controller
/0/100/1f.5                            bus            Cannon Lake PCH SPI Controller
/0/100/1f.6            eno1            network        Ethernet Connection (7) I219-V
/1                                     power          To Be Filled By O.E.M.
/2                     input0          input          Sleep Button
/3                     input1          input          Power Button
/4                     input2          input          Power Button
/5                     input22         input          UGTABLET 15.6 inch PenDisplay Mouse
/6                     input23         input          UGTABLET 15.6 inch PenDisplay Keyboard
/7                     input24         input          UGTABLET 15.6 inch PenDisplay
/8                     input25         input          UGTABLET 15.6 inch PenDisplay

Thank you so much :heart:

1 Like

OK upon reboot, all screens are black. Im so awesome I set UFW to block everything the other day.. so helpful.

I went to GRUB menu and here's what I found:
I used kernel 5.15.06-56 (recovery mode) and tried to fix broken packages, updated grub, and was able to boot into the desktop, but only my main monitor worked (unrecognized)
I could not load the desktop by selecting the normal 5.15.06-56 (all screens are black)

I can only get into the desktop by selecting kernel 5.15.06-53

So I can use the system at least, but feeling uneasy and very unstable, which hinders workflow. :worried:

1 Like

I am afraid that it is, with a very high probability, a hardware failure.
CPU Hard & Soft lockup errors practically always point to a piece of hardware failing.
(but a loose connector can do the same type of damage)

After doing a full backup of your home directory, redownload and reinstall kernel 5.15.06-56 because it is, again with a high probability, corrupted by whatever-it-is-that-happened.

Checklist:
A. Connections
B. Temperatures
C. Harddisk or SolidStateDrive health
D. RAM
E. Powersupply

A. Connections:

  1. clean the inside of your PC case
  2. reseat your GPU and your ram-sticks (to ensure good contact)
  3. disconnect and reconnect the SATA cables and the powerconnectors.
    make sure the connectors fit 'snugly' i.e. not too loose/flexy.
    replace cables that look dodgy.

B. Temperatures: just check, and keep an eye on it.

C. Possibly your HD/SSD is slowly deteriorating.
A very good and useful "how to check" is written by @ricmarques here:

D. Memorycheck: A failing element on a ram-stick can cause CPU lockups and data corruption.
Most linux live ISOs have a memorycheck option in the bootmenu.
Let it run for 12 hours or so. Physical ram-errors will not always show up after a short single run.

E. check your powersupply.
A failing or overloaded powersupply can starve the motherboard which can lead to instable and erratic behaviour. If it smells bad or is generating a whining sound, it's dying.
(also, check if the powerrating is a step or two above sufficient for the load of your CPU+GPU+Motherboard)

There are, ofcourse, more possible causes.
The checklist above is just a small list of the most probable causes.

P.S.
Also, because you are in the USA, check if any brownouts or other instabilities of the electricity supply/net/grid happened in your district. Brownouts can pretty much mess up your computer.

2 Likes

Interesting. I just bought this GPU a few weeks ago, and the power supply to go along with it because I wanted to run 2 GPUs on this motherboard.

The CPU I bought last year and had to file a claim on Amazon because they bait and switched me. Sold me a used CPU as new, and was sent in a cardboard box without manufacture packaging.

I also just added 2 extra RAM sticks right before I bought the GPU.
This doesn't sound fun TBH

Edit:

Good news, I have clean parts and everything is seated tight. Semi-bad news is it was already very clean and tight. Better news is I re-installed the kernel 5.15.0-56 and it booted into the desktop without issues.

So now I'm wondering the best way to run a memory test since it's a UEFI system. I'm running memtester right now, but I really would rather test all the RAM, or at least individual DIMMs one at a time or something. That way I know if I have a faulty stick or slot or something.

Not sure what is the best way to go about this.

I got this so far:

Random Value : FAILURE: 0x39bd9ac36fcf7d51 != 0x39bd9a436fcf7d51 at offset 0x2e1ebca0.
FAILURE: 0x2ddb7380fcff96f2 != 0x2ddb7388fcff96f2 at offset 0x2c8fc50d8.

UPDATE

When I added the second set of DIMMS back in October, I did turn on XMP, which I've never done while I was in Debian for 2 years. So I removed all the ram and I tested each module, one by one at the base speed of 2133 MHz. They seem to be passing the mem tests at 2133 MHz. I used a live-image and ran the gnome-disk-utility to test the disks, and it says they're fine. One HDD has an old age reading, but that is expected as it's been around a while. The other disks are OK. All the temps are normal this whole time. I did go into BIOS and enable PWM on the extra fans, which I thought was enabled already.

IDK maybe the XMP profile is misconfigured ? Still need to install all 4 DIMMs and run a complete test on all the memory.

One good method to test ram in this case is this:

  1. Run for a couple of days on two ram sticks.
  2. if all goes well, switch the two ram-sticks for the other pair.
  3. if in one of the two situations the crashes return, you know which ram-stick pair is faulty.
  4. mark the good pair (with a permanent marker or something)
  5. repeat above with ram-pairs consisting of a marked and an unmarked stick

You will end up with one unmarked ram-stick. This one is the faulty one. :slight_smile:

The other question:
XMP can indeed induce instability if the ram sticks are not suitable for "overclocking"
Do your tests with XMP on, but if you have critical work to do switch XMP off for now to prevent dataloss and crashes.

3 Likes