Actions

Scyld Clusterware - Admin info

From Montana Tech High Performance Computing

Revision as of 10:03, 18 September 2017 by Bdeng (talk | contribs) (Created page with "==Power== ===bpctl (soft shutdown/restart)=== This is the recommended way of restarting/shutting down nodes. *Shutdown **<code>bpctl -S all -P</code> *Restart **<code>bpctl -...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Power

bpctl (soft shutdown/restart)

This is the recommended way of restarting/shutting down nodes.

  • Shutdown
    • bpctl -S all -P
  • Restart
    • bpctl -S all -R

IPMI (hard shutdown, cold boot)

Not recommended, use as a last resort (ie, servers won't come online)

  • Get power status of nodes (just because the node is on, doesn't necessarily mean its been booted)
    • for i in {0..19};do ipmitool -H n$i-ipmi -U admin -P ****** power status;done
  • Force shutdown of all nodes (Not Recommended, use bpctl above if nodes are online, running. ONLY USE IF NODES ARE OFF)
    • for i in {0..19};do ipmitool -H n$i-ipmi -U admin -P ****** power off;done
  • Power on all nodes (You will need to do this if you used bpctl to shut a node down)
    • for i in {0..19};do ipmitool -H n$i-ipmi -U admin -P ****** power on;done

Logs

Logs from all compute nodes are collected on the management node in /var/log/messages

System Event Logs (SEL) can be collected locally or remotely with ipmitool:

ipmitool sel save /tmp/sel.save

ipmitool -H n0-ipmi -U admin -P ****** sel save /tmp/sel.n0

Torque server and schedule logs are stored in subdirectories in /var/spool/torque

Updating

Scyld Software

It is a good idea to check the latest release notes at http://www.penguincomputing.com/files/scyld-docs/CW6/ReleaseNotes.pdf

Certain packages should NOT be upgraded from CentOS or EPEL repositories. So far, these include:

  • beobootutils
  • beoconfig
  • beoconfig-devel
  • beoconfig-libs
  • beosi
  • bproc
  • bproc-devel
  • bproc-libs
  • bproc-python
  • kernel
  • kernel-devel
  • kernel-firmware
  • kernel-headers
  • kmod-aacraid
  • kmod-bproc
  • kmod-filecache
  • kmod-igb
  • kmod-task_packer
  • nodescripts
  • openmpi-scyld
  • openmpi-scyld-gnu
  • openmpi-scyld-intel
  • openmpi-scyld-pgi
  • scyld-doc
  • scyld-doc-HTML
  • scyld-doc-HTTPD
  • scyld-doc-PDF
  • scyld-doc-indexhtml
  • scyld-release
  • beonss-kickbackclient
  • beostat-sendstats
  • scyld-insight

YUM

To find a specific program, use yum search <searchterms>

Installing Packages

  • Install a package:
    • yum install <packagename>

Updating Packages

If upgrading to a new version of Scyld, then read and follow the release notes - http://www.penguincomputing.com/services-support/documentation/

  • To list all updates:
    • yum list updates
  • To list updates packages from a specific repository (ie, Fedora-EPEL)
    • yum --disablerepo "*" --enablerepo "Fedora-EPEL" list updates
  • To run a full system update
    • yum update
  • To update a specific package
    • yum install <packagename>

Removing Packages

  • Remove a package:
    • yum remove <packagename>

How to actually do an update

  • yum update

If the kernel was updated:

  • bpctl -S all -P
  • shutdown -r now (sometimes this will need to be entered twice)

If the kernel was not updated:

  • bpctl -S all -R

InfiniBand Problems after an update

When /etc/beowulf/init.d/15openib and 16ipoib are modified after an update, two changes maybe required: 15openib, line 45 - change Infiniband to Mellanox 16iboip, line 38 - change Infiniband to Mellanox Then reboot all nodes.