RX Error Counters and Slow Throughput Performance

Follow

{{table_of_contents}}

Issue

There may be slow throughput performance in a switch and RX error counters are incrementing, as well as possibly TX error counters.  These error counters may be seen in the output of different commands:

  • cl-netstat
  • ip -s link show
  • ethtool -S

If you want to monitor the output of these commands to see the statistics live as they change, use the linux watch command. For more information on using this command, please refer to the following article.

cl-netstat Shows RX_ERR

RX error counters can be seen in the output of cl-netstat as "RX_ERR", as shown below. 

cumulus@switch$ cl-netstat
Kernel Interface table
Iface   MTU Met         RX_OK RX_ERR RX_DRP RX_OVR        TX_OK TX_ERR TX_DRP TX_OVR    Flg
---------------------------------------------------------------------------------------------
eth0      1500 0      7361728      0      0 0           2030188      0      0      0 BMRU
lo       16436 0          173      0      0 0               173      0      0      0 LRU
swp1      9000 0   7669976333 15682741   1439 0      3035723493      0      0      0 BMRU
swp2      9000 0   3023667770 10728822    978 0      9840616134      0      0      0 BMRU
swp3      9000 0  24315580462 14877988   1307 0     80763548753      0      0      0 BMRU
swp4      9000 0  13869960451 8452232    897 0       7477191326      0      0      0 BMRU

<Output is truncated> 

For additional information on how to use the cl-netstat command, please read this article.

ip -s link show Shows RX errors

RX error counters can be seen in the output of ip -s link show, as shown below. 

cumulus@switch$ ip -s link show swp5
7: swp5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP mode DEFAULT qlen 500 
    link/ether 08:9e:01:ce:e0:6c brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast   
    8552309    71000    1899    1       0       63108  
    TX: bytes  packets  errors  dropped carrier collsns 
    1940799    15779    0       0       0       0 

ethtool -S Shows HwIfInErrors

RX error counters can be seen in the output of ethtool -S <interface> as "HwIfInErrors", as shown below. 

cumulus@switch$ ethtool -S swp1
NIC statistics:
     HwIfInOctets: 51883086875273
     HwIfInUcastPkts: 7669711571
     HwIfInBcastPkts: 0
     HwIfInMcastPkts: 264791
     HwIfOutOctets: 10590370555531
     HwIfOutUcastPkts: 3035458717
     HwIfOutMcastPkts: 264792
     HwIfOutBcastPkts: 0
     HwIfInDiscards: 1439
     HwIfInL3Drops: 0
     HwIfInBufferDrops: 1439
     HwIfInAclDrops: 115
     HwIfInDot3LengthErrors: 0
     HwIfInErrors: 15682741
     SoftInErrors: 0
     SoftInDrops: 0
     SoftInFrameErrors: 0
     HwIfOutDiscards: 0
     HwIfOutErrors: 0
     HwIfOutQDrops: 0
     HwIfOutNonQDrops: 0
     SoftOutErrors: 0
     SoftOutDrops: 0
     SoftOutTxFifoFull: 0
     HwIfOutQLen: 0 

Environment

  • Cumulus Linux, all versions

Overview

Cause of the Errors

These RX_ERR or HwIfInErrors indicate some Ethernet data frames are being corrupted somewhere along the transmission line, typically due to some bad cable or transceiver. These errors may be detected as part of the cyclic redundancy check (CRC) algorithm in the Frame Check Sequences (FCS) calculation.

When the switch receives a frame, it runs its own checksum on the frame and compares the resulting CRC value to the value in the Ethernet frame. If they are not equal, it means some bits were corrupted and thus the switch counts these as RX errors. In half-duplex mode, some FCS errors may be normal. In full-duplex mode, FCS errors are not normal.

Propagation of the Errors

When a platform detects an FCS error, what the platform does with the Ethernet frame depends on which switching mode is configured, one of either cut-through or store and forward. In cut-through mode, the frame with the FCS error may be propagated to the next switch. In store and forward mode, the frame with the FCS error will be discarded.

Cut-through Switching Mode

The cut-through mode of forwarding is used to minimize the latency (delay) through the switch by beginning the forwarding process before the entire packet has been received from the upstream sender. The data may begin to be transmitted while it is still being received on the inbound interface which minimizes the time the packet is held in the switch and thus minimizes delays in propagation. The disadvantage is that data frames with FCS errors may be propagated to the next hop because transmission out of the switch begins before the FCS error is detected. Since the next hop switch would have begun receiving this packet with no indication of a problem with the packet, it may also begin transmitting to its outbound interface before detecting the FCS error, thereby propagating the error even further.

Store and Forward Switching Mode

As the name implies, store and forward waits until the entire packet has been received and validated before starting the transmit process on the outbound interface.  This allows the switch to verify that the received packet is valid before sending it onward, but it increases latency by holding each packet longer in buffers in the switch.  It may also increase buffer utilization by having each packet utilize the resources for a longer period of time.  If store and forward is configured, the platform is able to detect FCS errors before beginning transmission, and thus can discard the frame and not propagate the errors to the next hop.

Resolution

Replace the Bad Components

The frame corruption occurs because of some bad component somewhere in the data path, such as cables or transceivers. Trace the RX errors upstream across all the hops in the end-to-end data path

You can use lldpctl to trace the ports upstream, hop-by-hop. Here is an example output:

cumulus@switch$ lldpctl 
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface:    eth0, via: LLDP, RID: 1, Time: 0 day, 23:36:08
  Chassis:     
    ChassisID:    mac 6c:64:1a:00:2f:54
    SysName:      backbone
    SysDescr:     Cumulus Linux version 2.5.2 running on cel kennisis
    MgmtIP:       192.168.1.5
    Capability:   Bridge, on
    Capability:   Router, on
  Port:        
    PortID:       ifname swp21
    PortDescr:    swp21
-------------------------------------------------------------------------------
Interface:    swp1, via: LLDP, RID: 5, Time: 0 day, 05:51:40
  Chassis:     
    ChassisID:    mac 08:9e:01:ce:e4:0c
    SysName:      sw23
    SysDescr:     Cumulus Linux version 2.5.2 running on quanta ly2
    MgmtIP:       192.168.2.30
    Capability:   Bridge, off
    Capability:   Router, on
  Port:        
    PortID:       ifname swp7
    PortDescr:    swp7
-------------------------------------------------------------------------------

Once you have identified the source point, try replacing the cable or transceiver to resolve the component introducing the data corruption.

Change the Switching Mode

While cut-through forwarding decreases latency and buffer consumption, one of its disadvantages is that packets are not verified as valid before they begin transmission on the outbound interface.  Thus forwarding may begin out the output interface before the FCS error is detected.

By changing from cut-through to store and forward mode of forwarding operation, each packet is verified as correct before the forwarding process begins, limiting the reach of any corrupt packets.  This verification comes at the cost of potential increased latency and buffer consumption.

You will need to configure the switches in the data path, particularly:

  • the switches upstream from the switch with the RX errors (i.e. "previous" switches in the data path) to eliminate the RX errors on the switch in question
  • the switch showing the RX errors to prevent it from propagating the errors to the downstream switch (i.e. "next" switch in the data path)

To change the forwarding behavior from cut-through to store and forward on Trident and Trident-II based switches:

1. Perform the following command:

cumulus@switch$ sudo vi /etc/cumulus/datapath/traffic.conf

2. Search for the following line in the traffic.conf file:

# To enable cut-through forwarding
cut_through_enable = true

3. Modify the value of cut_through_enable to false:

# To enable cut-through forwarding
cut_through_enable = false

4. To let the change in forwarding mode take effect, restart switchd.  Please note that restarting the switchd daemon is minimally disruptive.

cumulus@switch$ sudo service switchd restart

Caveats and Warnings

While these instructions are being provided on how to change the mode of operation for forwarding on a Cumulus Linux switch, the default setting of cut-though is the recommended value in almost every circumstance.  If you make this change on a switch for testing purposes, you should continue to monitor its performance.

 

 

 

 

Have more questions? Submit a request

Comments

Powered by Zendesk