Cumulus Networks Product Bulletin 2016-01-26: Thermal Management Issues on Dell S4048-ON or Dell S6000-ON Switches Running Cumulus Linux

Follow

Issue

A Dell S6000-ON or Dell S4048-ON switch may reboot at a temperature lower than the system is rated for, and does so without logging any messages.

Environment

Hardware:

  • Dell S4048-ON or Dell S6000-ON

Software:

  • Cumulus Linux versions 2.5.5 and earlier

Root Cause

The thermal management software did not match the thermal management specifications of the hardware platforms. The onboard hardware temperature sensors have registers for programming temperature threshold values. The Cumulus Linux interpretation of this register, consistent with the Linux driver interpretation and the thermal sensor data sheet, was to initiate user warnings when the threshold was reached. However, the hardware interpreted this register by initiating a shutdown of the switch. The consequence of this was that when a certain high temperature was reached, the thermal chip shut down the switch. Not only was this not intended, it also didn't give the switch a chance to log the warning.

Resolution

The Dell and Cumulus Networks engineering teams collaborated to fully understand software and hardware expectations of thermal management, and Cumulus Linux refactored the thermal management software to match the latest S4048-ON thermal management specifications. These sensor temperatures can be monitored in the output of smonctl, and if any sensor reaches the maximum or critical thresholds, then a log message is sent to syslog.

To fix this issue, upgrade the switch to Cumulus Linux 2.5.6 or later.

Stay Informed

Subscribe to our product bulletin mailing list to learn about these announcements as soon as they're made.

Have more questions? Submit a request

Comments

Powered by Zendesk