{{table_of_contents}}
Issue
There may be slow throughput performance in a switch and RX error counters are incrementing, as well as possibly TX error counters. These error counters may be seen in the output of different commands:
cl-netstat
ip -s link show
ethtool -S
If you want to monitor the output of these commands to see the statistics live as they change, use the linux watch
command. For more information on using this command, please refer to the following article.
cl-netstat
Shows RX_ERR
RX error counters can be seen in the output of cl-netstat
as "RX_ERR", as shown below.
cumulus@switch$ cl-netstat Kernel Interface table Iface MTU Met RX_OK RX_ERR RX_DRP RX_OVR TX_OK TX_ERR TX_DRP TX_OVR Flg --------------------------------------------------------------------------------------------- eth0 1500 0 7361728 0 0 0 2030188 0 0 0 BMRU lo 16436 0 173 0 0 0 173 0 0 0 LRU swp1 9000 0 7669976333 15682741 1439 0 3035723493 0 0 0 BMRU swp2 9000 0 3023667770 10728822 978 0 9840616134 0 0 0 BMRU swp3 9000 0 24315580462 14877988 1307 0 80763548753 0 0 0 BMRU swp4 9000 0 13869960451 8452232 897 0 7477191326 0 0 0 BMRU
<Output is truncated>
For additional information on how to use the cl-netstat
command, please read this article.
ip -s link show
Shows RX errors
RX error counters can be seen in the output of ip -s link show
, as shown below.
cumulus@switch$ ip -s link show swp5 7: swp5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP mode DEFAULT qlen 500 link/ether 08:9e:01:ce:e0:6c brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 8552309 71000 1899 1 0 63108 TX: bytes packets errors dropped carrier collsns 1940799 15779 0 0 0 0
ethtool -S
Shows HwIfInErrors
RX error counters can be seen in the output of ethtool -S <interface>
as "HwIfInErrors", as shown below.
cumulus@switch$ ethtool -S swp1 NIC statistics: HwIfInOctets: 51883086875273 HwIfInUcastPkts: 7669711571 HwIfInBcastPkts: 0 HwIfInMcastPkts: 264791 HwIfOutOctets: 10590370555531 HwIfOutUcastPkts: 3035458717 HwIfOutMcastPkts: 264792 HwIfOutBcastPkts: 0 HwIfInDiscards: 1439 HwIfInL3Drops: 0 HwIfInBufferDrops: 1439 HwIfInAclDrops: 115 HwIfInDot3LengthErrors: 0 HwIfInErrors: 15682741 SoftInErrors: 0 SoftInDrops: 0 SoftInFrameErrors: 0 HwIfOutDiscards: 0 HwIfOutErrors: 0 HwIfOutQDrops: 0 HwIfOutNonQDrops: 0 SoftOutErrors: 0 SoftOutDrops: 0 SoftOutTxFifoFull: 0 HwIfOutQLen: 0
Environment
- Cumulus Linux, all versions
Overview
Cause of the Errors
These RX_ERR or HwIfInErrors indicate some Ethernet data frames are being corrupted somewhere along the transmission line, typically due to some bad cable or transceiver. These errors may be detected as part of the cyclic redundancy check (CRC) algorithm in the Frame Check Sequences (FCS) calculation.
When the switch receives a frame, it runs its own checksum on the frame and compares the resulting CRC value to the value in the Ethernet frame. If they are not equal, it means some bits were corrupted and thus the switch counts these as RX errors. In half-duplex mode, some FCS errors may be normal. In full-duplex mode, FCS errors are not normal.
Propagation of the Errors
When a platform detects an FCS error, what the platform does with the Ethernet frame depends on which switching mode is configured, one of either cut-through or store and forward. In cut-through mode, the frame with the FCS error may be propagated to the next switch. In store and forward mode, the frame with the FCS error will be discarded.
Cut-through Switching Mode
The cut-through mode of forwarding is used to minimize the latency (delay) through the switch by beginning the forwarding process before the entire packet has been received from the upstream sender. The data may begin to be transmitted while it is still being received on the inbound interface which minimizes the time the packet is held in the switch and thus minimizes delays in propagation. The disadvantage is that data frames with FCS errors may be propagated to the next hop because transmission out of the switch begins before the FCS error is detected. Since the next hop switch would have begun receiving this packet with no indication of a problem with the packet, it may also begin transmitting to its outbound interface before detecting the FCS error, thereby propagating the error even further.
Store and Forward Switching Mode
As the name implies, store and forward waits until the entire packet has been received and validated before starting the transmit process on the outbound interface. This allows the switch to verify that the received packet is valid before sending it onward, but it increases latency by holding each packet longer in buffers in the switch. It may also increase buffer utilization by having each packet utilize the resources for a longer period of time. If store and forward is configured, the platform is able to detect FCS errors before beginning transmission, and thus can discard the frame and not propagate the errors to the next hop.
Resolution
Replace the Bad Components
The frame corruption occurs because of some bad component somewhere in the data path, such as cables or transceivers. Trace the RX errors upstream across all the hops in the end-to-end data path
You can use lldpctl
to trace the ports upstream, hop-by-hop. Here is an example output:
cumulus@switch$ lldpctl ------------------------------------------------------------------------------- LLDP neighbors: ------------------------------------------------------------------------------- Interface: eth0, via: LLDP, RID: 1, Time: 0 day, 23:36:08 Chassis: ChassisID: mac 6c:64:1a:00:2f:54 SysName: backbone SysDescr: Cumulus Linux version 2.5.2 running on cel kennisis MgmtIP: 192.168.1.5 Capability: Bridge, on Capability: Router, on Port: PortID: ifname swp21 PortDescr: swp21 ------------------------------------------------------------------------------- Interface: swp1, via: LLDP, RID: 5, Time: 0 day, 05:51:40 Chassis: ChassisID: mac 08:9e:01:ce:e4:0c SysName: sw23 SysDescr: Cumulus Linux version 2.5.2 running on quanta ly2 MgmtIP: 192.168.2.30 Capability: Bridge, off Capability: Router, on Port: PortID: ifname swp7 PortDescr: swp7 -------------------------------------------------------------------------------
Once you have identified the source point, try replacing the cable or transceiver to resolve the component introducing the data corruption.
Change the Switching Mode
While cut-through forwarding decreases latency and buffer consumption, one of its disadvantages is that packets are not verified as valid before they begin transmission on the outbound interface. Thus forwarding may begin out the output interface before the FCS error is detected.
By changing from cut-through to store and forward mode of forwarding operation, each packet is verified as correct before the forwarding process begins, limiting the reach of any corrupt packets. This verification comes at the cost of potential increased latency and buffer consumption.
You will need to configure the switches in the data path, particularly:
- the switches upstream from the switch with the RX errors (i.e. "previous" switches in the data path) to eliminate the RX errors on the switch in question
- the switch showing the RX errors to prevent it from propagating the errors to the downstream switch (i.e. "next" switch in the data path)
To change the forwarding behavior from cut-through to store and forward on Trident and Trident-II based switches:
1. Perform the following command:
cumulus@switch$ sudo vi /etc/cumulus/datapath/traffic.conf
2. Search for the following line in the traffic.conf file:
# To enable cut-through forwarding cut_through_enable = true
3. Modify the value of cut_through_enable to false:
# To enable cut-through forwarding cut_through_enable = false
4. To let the change in forwarding mode take effect, restart switchd. Please note that restarting the switchd daemon is minimally disruptive.
cumulus@switch$ sudo service switchd restart
Caveats and Warnings
While these instructions are being provided on how to change the mode of operation for forwarding on a Cumulus Linux switch, the default setting of cut-though is the recommended value in almost every circumstance. If you make this change on a switch for testing purposes, you should continue to monitor its performance.
Comments