User Tools

Site Tools


new_server

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
new_server [2017/07/23 18:42] joshnew_server [2018/01/16 10:35] (current) josh
Line 89: Line 89:
   Latency              2731us     267us     257us     122us      25us     122us   Latency              2731us     267us     257us     122us      25us     122us
   1.97,1.97,hathor,1,1500205275,3G,,1591,99,195864,9,192590,10,+++++,+++,2287940,56,+++++,+++,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,13792us,3537us,3623us,3168us,1356us,1846us,2731us,267us,257us,122us,25us,122us   1.97,1.97,hathor,1,1500205275,3G,,1591,99,195864,9,192590,10,+++++,+++,2287940,56,+++++,+++,16,,,,,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,13792us,3537us,3623us,3168us,1356us,1846us,2731us,267us,257us,122us,25us,122us
- 
- 
-====== Issues ====== 
- 
-===== APC UPS not killing power ===== 
- 
-I think this is a Fedora / systemd problem. apcupsd will detect the AC power failure and initiate a shutdown (shutdown -h -H now via apccontrol). But apccontrol killpower does not seem to be called. Also, during shutdown, systemd reports "A stop job is running for APC UPS Power Control Daemon for Linux". Not sure if this is because the apcupsd service was what kicked off the shutdown command, so it wasn't able to be stopped. 
  
 ====== Installation ====== ====== Installation ======
Line 140: Line 133:
 ====== Services ====== ====== Services ======
  
-^ Server ^ Services ^ Notes ^+^ Server ^ Services ^ OS ^ Hardware ^
 | anubis (host) | <WRAP> | anubis (host) | <WRAP>
 +  * apcupsd
   * NFS   * NFS
-  * KVM+  * KVM (libvirt-guests)
   * gmail backups   * gmail backups
   * NFS backups   * NFS backups
-</WRAP> | Fedora 26 Server |+</WRAP> | Fedora Server 26 | 32GB RAM; 12 CPU threads | 
 +| oneill (VM) | <WRAP> 
 +  * HTTP 
 +    * wiki 
 +    * tt-rss 
 +    * mythweb (proxy) 
 +    * cameras (proxy) 
 +    * cgit (proxy) 
 +</WRAP> | Ubuntu Server 16.04 | 2GB RAM; 1 VCPU | 
 +| carter (VM) | <WRAP> 
 +  * git-daemon 
 +  * gitolite git hosting over ssh 
 +  * cgit 
 +</WRAP> | Ubuntu Server 16.04 | 2GB RAM; 1 VCPU | 
 +| baal (VM) | <WRAP> 
 +  * openvpn 
 +  * squid proxy 
 +</WRAP> | Fedora Server 26 | 1GB RAM; 1 VCPU | 
 +| hathor (VM) | <WRAP> 
 +  * minetest 
 +</WRAP> | Ubuntu Server 16.04 | 2GB RAM; 1 VCPU | 
 +| ra (VM) | <WRAP> 
 +  * mythtv backend 
 +  * mythweb 
 +</WRAP> | Mythbuntu 16.04 | 4GB RAM; 2 VCPUs | 
 + 
 +====== Issues ====== 
 + 
 +===== APC UPS not killing power ===== 
 + 
 +I think this is a Fedora / systemd problem. apcupsd will detect the AC power failure and initiate a shutdown (shutdown -h -H now via apccontrol). But apccontrol killpower does not seem to be called. Also, during shutdown, systemd reports "A stop job is running for APC UPS Power Control Daemon for Linux". Not sure if this is because the apcupsd service was what kicked off the shutdown command, so it wasn't able to be stopped. 
 + 
 +Bug report filed: [[https://bugzilla.redhat.com/show_bug.cgi?id=1472062]] Problem of not cutting UPS power seems to be due to ''/etc/apcupsd/powerfail'' file not getting created which might be because of SELinux. Other part of the problem (hanging on stopping the apcupsd service) still remains but is worked around by changing the service timeout to 10s. 
 + 
 +===== M.2 PCIe NVMe drive disappearing ===== 
 + 
 +Three times now my server has stopped responding and shows a blank console. After resetting and going into the UEFI setup the Western Digital M.2 drive no longer shows up. After a poweroff and cold boot, the drive reappears in UEFI setup and I have to reset my boot settings to select it as the default boot entry. The system appears to run properly after this until the next time. 
 + 
 +The problem I'm observing seems very similar to this: [[https://superuser.com/questions/1194478/ssd-suddenly-becomes-unreadable-how-to-diagnose]] 
 + 
 +After the third time this happened, on 2017-12-13, I updated the UEFI firmware on the ASRock motherboard to version 3.30. The upgrade was successful, and after resetting my options (particularly re-enabling SVM and power on after AC loss), the system appears to be running properly again now. 
 + 
 +On 2017-12-15 the system froze again. I disabled "C6 Mode" in the UEFI setup and started up again. 
 + 
 +2017-12-17: I have not observed the problem since disabling "C6 Mode", but I have a feeling it will still come back and could be related to NVMe APST modes. Similar problem here: [[https://bbs.archlinux.org/viewtopic.php?id=232692]]. I added kernel parameter: 
 + 
 +<code> 
 +nvme_core.default_ps_max_latency_us=0 
 +</code> 
 + 
 +Now ''nvme get-feature -f 0x0c -H /dev/nvme0n1'' shows APST is disabled. 
 + 
 +Turned "C6 Mode" back on but have not observed the M.2 drive disappearing problem again with APST disabled. 
 + 
 +===== BUG: soft lockup ===== 
 + 
 +A few times after fixing the M.2 drive disappearing issue, my server has frozen. One of the times I caught the console output which listed several messages like "watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [worker:14788]", and one "INFO: rcu_sched detected stalls on CPUs/tasks:" ... "rcu_sched kthread starved for 19150377 jiffies!". I'm not sure what is causing this. 
 + 
 +On 2018-01-08, I upgraded from kernel 4.14.4 to 4.14.11 and added "consoleblank=0" to the kernel command line so if this happens again hopefully I will not lose console output. 
 + 
 +2018-01-09: Igor seems to be experiencing the same bug and pointed me to https://bugzilla.kernel.org/show_bug.cgi?id=196683. I will probably disable C6 in UEFI setup again.
  
 +2018-01-15: Got the freeze again after disabling C6 mode in setup. I looked deeper in setup options and found the buried "Global C-State Control" option that Igor and Mike had disabled so I disabled that as well. No freezes since then.
new_server.1500849777.txt.gz · Last modified: by josh