System abend rebooting frequently. How to diagnose?

My Ubuntu MATE (22.04.5 LTS) has been auto-rebooting today nearly every 15 minutes or so. In the past, I have usually believed this to be caused by a networking problem, but my network has been very stable today. So, I'm curious what diagnostics I can view to determine the cause. I've checked sudo dmesg but with all the ring buffer messages, it's hard to isolate one. My last reboot shows

$ last reboot | head -7
reboot system boot 6.8.0-87-generic Thu Oct 30 17:38 still running
reboot system boot 6.8.0-87-generic Thu Oct 30 17:15 still running
reboot system boot 6.8.0-87-generic Thu Oct 30 16:47 still running
reboot system boot 6.8.0-87-generic Thu Oct 30 15:21 still running
reboot system boot 6.8.0-87-generic Wed Oct 29 18:32 still running
reboot system boot 6.8.0-86-generic Wed Oct 29 18:18 - 18:29 (00:10)
reboot system boot 6.8.0-86-generic Mon Oct 27 14:29 - 18:15 (2+03:46)

All of today's reboots were spurious.

Any ideas?

is your pc getting hot? it could be a thermal safety issue.

2 Likes

I don't think so. I have a cron job that checks the ambient temperature every 5 minutes, and it hasn't reported any changes.

1 Like

systemctl list-timers will show which timers exist/run.

can you share that cron script?

2 Likes

Someone brought me a computer to fix, two days ago, which switched itself off after 15 minutes but sometimes even during boot.

I moved the RAM to another slot and the problem was solved, so it might be a defective RAM slot.

1 Like

it looks like a hw issue

  • is psu dying? if it does not provide enough 5V and 12V, the cpu may calls quits
  • memory sticks, maybe re-seat them.
3 Likes

This is the script it requires ipmitool. If your system doesn't support the 'Intelligent Platform Management Interface" (IPMA) standard, it likely won't do much for you. It's available on github:

Here's my script, modified from the default:

`#!/bin/bash
if [ "$EUID" -ne 0 ]
then echo "Please run as root"
exit
fi
TEMP=$(ipmitool sdr type temperature | grep Ambient | grep degrees | grep -Po '\d{2}' | tail -1)

# sdr = sensor data repository
# Notes: the grep -Po '\d{2}' uses an experimental feature of grep: a perl
# expression which resolves to only (the -o) 2 digits of the previous greps.

echo "Ambient temperature on the server is ($TEMP C)"`

2 Likes

I have a supermicro server which has IPMI.

Can you log on directly to IPMI (admin/admn) and look at the Health of the Server?

1 Like

I don't know, I've never tried. How would I do that?

You should have 2 ip for that server, one is for its name and another for the ipmi interface. pve9 is a proxmox server. From your browser, http://10.0.0.56

pve9.lan                  => 10.0.0.9
supermicro.lan.           => 10.0.0.56

Login page

In my case, ADMIN/ADMIN, there are other variations of user/pw, depends on your server.

Second tab, Server Health, on the left Sensor Readings

I have an issue with FAN 1

There are other options as to power on/off the server. Even if the server is turned off, IPMI is active so you can log on and power on the server using the IPMI.

2 Likes

I have a Dell r710. I don't think IPMI is installed, jut the ipmitool I added to it. I have two IP addresses, but the second is for Dell's own Integrated Dell Remote Access Controller (iDRAC), which is pretty old and not very helpful.

1 Like

I have a R710, too. The power supplies have a green light, check both are on. You may have to re-seat the psu's. Also, the front has a small display where it can show you status. I cannot power it on now to show you the iDRAC interface but, in my opinion, a server that reboots every 15 min is a H/W issue.

1 Like
1 Like

And, as quickly an unexpectedly as these reboots began, they've ended.

I hate it when a problem "solves" itself. It will likely resurface again, and I'll go through this same exercise again. :frowning:

1 Like

Can not tell much, just share my personal experience:
Unsupported UM LTS installation was receiving upstream Ubuntu updates and became unstable finally. OS was freezing at random several times a day. The only cure was reboot. I could not find a trace of root cause in logs. Situation resolved by itself after version upgrade (in place).

2 Likes

You need a good clean with compressed air on your R710, re-seat psu's, ram, loose cables.

3 Likes

That's probably a good idea even without trouble. I haven't opened it in a couple of years.

2 Likes