When utilizing Linux hypervisors such as KVM or Xen with Intel 10GbE NICs there can be a problem where the server gets significantly less performance (2-3Gbps) than line rate when using VXLANs and the hypervisor or dom0 as the VXLAN endpoints.
This problem is not something related to Cumulus Linux, but something that comes up often when deploying Cumulus Linux in the data center. It is a problem with specific Intel NICs (on the host/server side) while utilizing VXLAN. It relates to Linux-based offerings with Intel 82599, the x520 and the x540 adapters).
This is related to a specific design decision made by the Intel ixgbe driver maintainers. The Intel 10GbE hardware has the ability to perform RSS (Receive Side Scaling) and spread the load of performing packet reception across multiple CPUs/queues.
When a fragmented UDP frame arrives at the host, Intel made the decision that all UDP frames with the fragmentation bit set would arrive on CPU/queue 0 rather than on any of the other CPUs/queues. Because this decision may result in out of order frames (which is especially bad when streaming video), Intel decided to not perform RSS on any UDP traffic. TCP traffic will not have this restriction by default. Since STT (Stateless Transport Tunneling), another network overlay protocol, has a header that looks like TCP, these encapsulated frames will be classified as TCP by the Intel hardware and RSS will be performed without noticing any loss of throughput. The fix for this is easy as long as the hypervisor or dom0 kernel + ethtool is modern enough to contain the patches to configure this setting.
At a minimum item #1 will need to be performed. If there is no change in throughput, check item #2 also:
Run the following command on the hypervisor or dom0 on each boot:
ethtool -N [device] rx-flow-hash udp4 sdfn
Expect to see output like this after running the command:
enabling UDP RSS: fragmented packets may arrive out of order to the stack above
This ethtool option will remove the default restriction placed on UDP traffic and will spread this traffic across all CPUs in the hypervisor or dom0. Checking network device statistics can be done with the command:
ethtool -S [device]
This command is a good way to check that all receive queues are being used when performing a multi-stream test or verify that they are not being used before running the ethtool command to change the rx-flow-hash settings.
If there is no change to the throughput in a multi-stream test, check to be sure that more than 1 vCPU is assigned to the dom0. A guide to changing these settings on Citrix Xenserver 6.2 can be found at http://support.citrix.com/article/CTX139714.