Hello list,
We're using nProbe to export flows information to kafka. We're listening
from two 10Gb interfaces that we merge with zbalance_ipc, and we split them
into 16 queues to have 16 nprobe instances.
The problem is we are seeing about 40% packet drops reported by
zbalance_ipc, so it looks like nprobe is not capable of reading and
processing all the traffic. The CPU usage is really high, and the load
average is over 25-30.
Merging both interfaces we're getting up to 5.5 Gbps, and 1.2 million
packets / second; and we're using i40e_zc driver.
Do you have any advice to try to improve this performance?
Does it make sense we're having packet drops with this amount of traffic,
and we're reaching the server limits? Or is any configuration we could tune
up to improve it?
Thanks in advance.
-- System:
nProbe: nProbe v.8.5.180625 (r6185)
System RAM: 64GB
System CPU: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 12 cores (6 cores,
2 threads per core)
System OS: CentOS Linux release 7.4.1708 (Core)
Linux Kernel: 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC
2018 x86_64 x86_64 x86_64 GNU/Linux
-- zbalance configuration:
zbalance_ipc -i p2p1,p2p2 -c 1 -n 16 -m 4 -a -p -l /var/tmp/zbalance.log -v
-w
-- nProbe configuration:
--interface=zc:1@0
--pid-file=/var/run/nprobe-zc1-00.pid
--dump-stats=/var/log/nprobe/zc1-00_flows_stats.txt
--kafka "192.168.0.1:9092,192.168.0.2:9092,192.168.0.3:9092;topic"
--collector=none
--idle-timeout=60
--snaplen=128
--aggregation=0/1/1/1/0/0/0
--all-collectors=0
--verbose=1
--dump-format=t
--vlanid-as-iface-idx=none
--hash-size=1024000
--flow-delay=1
--count-delay=10
--min-flow-size=0
--netflow-engine=0:0
--sample-rate=1:1
--as-list=/usr/share/ntopng/httpdocs/geoip/GeoIPASNum.dat
--city-list=/usr/share/ntopng/httpdocs/geoip/GeoLiteCity.dat
--flow-templ="%IPV4_SRC_ADDR %IPV4_DST_ADDR %IN_PKTS %IN_BYTES %OUT_PKTS
%OUT_BYTES %FIRST_SWITCHED %LAST_SWITCHED %L4_SRC_PORT %L4_DST_PORT
%TCP_FLAGS %PROTOCOL %SRC_TOS %SRC_AS %DST_AS %L7_PROTO %L7_PROTO_NAME
%SRC_IP_COUNTRY %SRC_IP_CITY %SRC_IP_LONG %SRC_IP_LAT %DST_IP_COUNTRY
%DST_IP_CITY %DST_IP_LONG %DST_IP_LAT %SRC_VLAN %DST_VLAN %DOT1Q_SRC_VLAN
%DOT1Q_DST_VLAN %DIRECTION %SSL_SERVER_NAME %SRC_AS_MAP %DST_AS_MAP
%HTTP_METHOD %HTTP_RET_CODE %HTTP_REFERER %HTTP_UA %HTTP_MIME %HTTP_HOST
%HTTP_SITE %UPSTREAM_TUNNEL_ID %UPSTREAM_SESSION_ID %DOWNSTREAM_TUNNEL_ID
%DOWNSTREAM_SESSION_ID %UNTUNNELED_PROTOCOL %UNTUNNELED_IPV4_SRC_ADDR
%UNTUNNELED_L4_SRC_PORT %UNTUNNELED_IPV4_DST_ADDR %UNTUNNELED_L4_DST_PORT
%GTPV2_REQ_MSG_TYPE %GTPV2_RSP_MSG_TYPE %GTPV2_C2S_S1U_GTPU_TEID
%GTPV2_C2S_S1U_GTPU_IP %GTPV2_S2C_S1U_GTPU_TEID %GTPV2_S5_S8_GTPC_TEID
%GTPV2_S2C_S1U_GTPU_IP %GTPV2_C2S_S5_S8_GTPU_TEID
%GTPV2_S2C_S5_S8_GTPU_TEID %GTPV2_C2S_S5_S8_GTPU_IP
%GTPV2_S2C_S5_S8_GTPU_IP %GTPV2_END_USER_IMSI %GTPV2_END_USER_MSISDN
%GTPV2_APN_NAME %GTPV2_ULI_MCC %GTPV2_ULI_MNC %GTPV2_ULI_CELL_TAC
%GTPV2_ULI_CELL_ID %GTPV2_RESPONSE_CAUSE %GTPV2_RAT_TYPE %GTPV2_PDN_IP
%GTPV2_END_USER_IMEI %GTPV2_C2S_S5_S8_GTPC_IP %GTPV2_S2C_S5_S8_GTPC_IP
%GTPV2_C2S_S5_S8_SGW_GTPU_TEID %GTPV2_S2C_S5_S8_SGW_GTPU_TEID
%GTPV2_C2S_S5_S8_SGW_GTPU_IP %GTPV2_S2C_S5_S8_SGW_GTPU_IP
%GTPV1_REQ_MSG_TYPE %GTPV1_RSP_MSG_TYPE %GTPV1_C2S_TEID_DATA
%GTPV1_C2S_TEID_CTRL %GTPV1_S2C_TEID_DATA %GTPV1_S2C_TEID_CTRL
%GTPV1_END_USER_IP %GTPV1_END_USER_IMSI %GTPV1_END_USER_MSISDN
%GTPV1_END_USER_IMEI %GTPV1_APN_NAME %GTPV1_RAT_TYPE %GTPV1_RAI_MCC
%GTPV1_RAI_MNC %GTPV1_RAI_LAC %GTPV1_RAI_RAC %GTPV1_ULI_MCC %GTPV1_ULI_MNC
%GTPV1_ULI_CELL_LAC %GTPV1_ULI_CELL_CI %GTPV1_ULI_SAC %GTPV1_RESPONSE_CAUSE
%SRC_FRAGMENTS %DST_FRAGMENTS %CLIENT_NW_LATENCY_MS %SERVER_NW_LATENCY_MS
%APPL_LATENCY_MS %RETRANSMITTED_IN_BYTES %RETRANSMITTED_IN_PKTS
%RETRANSMITTED_OUT_BYTES %RETRANSMITTED_OUT_PKTS %OOORDER_IN_PKTS
%OOORDER_OUT_PKTS %FLOW_ACTIVE_TIMEOUT %FLOW_INACTIVE_TIMEOUT %MIN_TTL
%MAX_TTL %IN_SRC_MAC %OUT_DST_MAC %PACKET_SECTION_OFFSET %FRAME_LENGTH
%SRC_TO_DST_MAX_THROUGHPUT %SRC_TO_DST_MIN_THROUGHPUT
%SRC_TO_DST_AVG_THROUGHPUT %DST_TO_SRC_MAX_THROUGHPUT
%DST_TO_SRC_MIN_THROUGHPUT %DST_TO_SRC_AVG_THROUGHPUT
%NUM_PKTS_UP_TO_128_BYTES %NUM_PKTS_128_TO_256_BYTES
%NUM_PKTS_256_TO_512_BYTES %NUM_PKTS_512_TO_1024_BYTES
%NUM_PKTS_1024_TO_1514_BYTES %NUM_PKTS_OVER_1514_BYTES %LONGEST_FLOW_PKT
%SHORTEST_FLOW_PKT %NUM_PKTS_TTL_EQ_1 %NUM_PKTS_TTL_2_5 %NUM_PKTS_TTL_5_32
%NUM_PKTS_TTL_32_64 %NUM_PKTS_TTL_64_96 %NUM_PKTS_TTL_96_128
%NUM_PKTS_TTL_128_160 %NUM_PKTS_TTL_160_192 %NUM_PKTS_TTL_192_224
%NUM_PKTS_TTL_224_255 %DURATION_IN %DURATION_OUT %TCP_WIN_MIN_IN
%TCP_WIN_MAX_IN %TCP_WIN_MSS_IN %TCP_WIN_SCALE_IN %TCP_WIN_MIN_OUT
%TCP_WIN_MAX_OUT %TCP_WIN_MSS_OUT %TCP_WIN_SCALE_OUT"
--flow-version=9
--tunnel
--smart-udp-frags
--
Regards,
David Notivol
dnotivol@gmail.com
We're using nProbe to export flows information to kafka. We're listening
from two 10Gb interfaces that we merge with zbalance_ipc, and we split them
into 16 queues to have 16 nprobe instances.
The problem is we are seeing about 40% packet drops reported by
zbalance_ipc, so it looks like nprobe is not capable of reading and
processing all the traffic. The CPU usage is really high, and the load
average is over 25-30.
Merging both interfaces we're getting up to 5.5 Gbps, and 1.2 million
packets / second; and we're using i40e_zc driver.
Do you have any advice to try to improve this performance?
Does it make sense we're having packet drops with this amount of traffic,
and we're reaching the server limits? Or is any configuration we could tune
up to improve it?
Thanks in advance.
-- System:
nProbe: nProbe v.8.5.180625 (r6185)
System RAM: 64GB
System CPU: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 12 cores (6 cores,
2 threads per core)
System OS: CentOS Linux release 7.4.1708 (Core)
Linux Kernel: 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC
2018 x86_64 x86_64 x86_64 GNU/Linux
-- zbalance configuration:
zbalance_ipc -i p2p1,p2p2 -c 1 -n 16 -m 4 -a -p -l /var/tmp/zbalance.log -v
-w
-- nProbe configuration:
--interface=zc:1@0
--pid-file=/var/run/nprobe-zc1-00.pid
--dump-stats=/var/log/nprobe/zc1-00_flows_stats.txt
--kafka "192.168.0.1:9092,192.168.0.2:9092,192.168.0.3:9092;topic"
--collector=none
--idle-timeout=60
--snaplen=128
--aggregation=0/1/1/1/0/0/0
--all-collectors=0
--verbose=1
--dump-format=t
--vlanid-as-iface-idx=none
--hash-size=1024000
--flow-delay=1
--count-delay=10
--min-flow-size=0
--netflow-engine=0:0
--sample-rate=1:1
--as-list=/usr/share/ntopng/httpdocs/geoip/GeoIPASNum.dat
--city-list=/usr/share/ntopng/httpdocs/geoip/GeoLiteCity.dat
--flow-templ="%IPV4_SRC_ADDR %IPV4_DST_ADDR %IN_PKTS %IN_BYTES %OUT_PKTS
%OUT_BYTES %FIRST_SWITCHED %LAST_SWITCHED %L4_SRC_PORT %L4_DST_PORT
%TCP_FLAGS %PROTOCOL %SRC_TOS %SRC_AS %DST_AS %L7_PROTO %L7_PROTO_NAME
%SRC_IP_COUNTRY %SRC_IP_CITY %SRC_IP_LONG %SRC_IP_LAT %DST_IP_COUNTRY
%DST_IP_CITY %DST_IP_LONG %DST_IP_LAT %SRC_VLAN %DST_VLAN %DOT1Q_SRC_VLAN
%DOT1Q_DST_VLAN %DIRECTION %SSL_SERVER_NAME %SRC_AS_MAP %DST_AS_MAP
%HTTP_METHOD %HTTP_RET_CODE %HTTP_REFERER %HTTP_UA %HTTP_MIME %HTTP_HOST
%HTTP_SITE %UPSTREAM_TUNNEL_ID %UPSTREAM_SESSION_ID %DOWNSTREAM_TUNNEL_ID
%DOWNSTREAM_SESSION_ID %UNTUNNELED_PROTOCOL %UNTUNNELED_IPV4_SRC_ADDR
%UNTUNNELED_L4_SRC_PORT %UNTUNNELED_IPV4_DST_ADDR %UNTUNNELED_L4_DST_PORT
%GTPV2_REQ_MSG_TYPE %GTPV2_RSP_MSG_TYPE %GTPV2_C2S_S1U_GTPU_TEID
%GTPV2_C2S_S1U_GTPU_IP %GTPV2_S2C_S1U_GTPU_TEID %GTPV2_S5_S8_GTPC_TEID
%GTPV2_S2C_S1U_GTPU_IP %GTPV2_C2S_S5_S8_GTPU_TEID
%GTPV2_S2C_S5_S8_GTPU_TEID %GTPV2_C2S_S5_S8_GTPU_IP
%GTPV2_S2C_S5_S8_GTPU_IP %GTPV2_END_USER_IMSI %GTPV2_END_USER_MSISDN
%GTPV2_APN_NAME %GTPV2_ULI_MCC %GTPV2_ULI_MNC %GTPV2_ULI_CELL_TAC
%GTPV2_ULI_CELL_ID %GTPV2_RESPONSE_CAUSE %GTPV2_RAT_TYPE %GTPV2_PDN_IP
%GTPV2_END_USER_IMEI %GTPV2_C2S_S5_S8_GTPC_IP %GTPV2_S2C_S5_S8_GTPC_IP
%GTPV2_C2S_S5_S8_SGW_GTPU_TEID %GTPV2_S2C_S5_S8_SGW_GTPU_TEID
%GTPV2_C2S_S5_S8_SGW_GTPU_IP %GTPV2_S2C_S5_S8_SGW_GTPU_IP
%GTPV1_REQ_MSG_TYPE %GTPV1_RSP_MSG_TYPE %GTPV1_C2S_TEID_DATA
%GTPV1_C2S_TEID_CTRL %GTPV1_S2C_TEID_DATA %GTPV1_S2C_TEID_CTRL
%GTPV1_END_USER_IP %GTPV1_END_USER_IMSI %GTPV1_END_USER_MSISDN
%GTPV1_END_USER_IMEI %GTPV1_APN_NAME %GTPV1_RAT_TYPE %GTPV1_RAI_MCC
%GTPV1_RAI_MNC %GTPV1_RAI_LAC %GTPV1_RAI_RAC %GTPV1_ULI_MCC %GTPV1_ULI_MNC
%GTPV1_ULI_CELL_LAC %GTPV1_ULI_CELL_CI %GTPV1_ULI_SAC %GTPV1_RESPONSE_CAUSE
%SRC_FRAGMENTS %DST_FRAGMENTS %CLIENT_NW_LATENCY_MS %SERVER_NW_LATENCY_MS
%APPL_LATENCY_MS %RETRANSMITTED_IN_BYTES %RETRANSMITTED_IN_PKTS
%RETRANSMITTED_OUT_BYTES %RETRANSMITTED_OUT_PKTS %OOORDER_IN_PKTS
%OOORDER_OUT_PKTS %FLOW_ACTIVE_TIMEOUT %FLOW_INACTIVE_TIMEOUT %MIN_TTL
%MAX_TTL %IN_SRC_MAC %OUT_DST_MAC %PACKET_SECTION_OFFSET %FRAME_LENGTH
%SRC_TO_DST_MAX_THROUGHPUT %SRC_TO_DST_MIN_THROUGHPUT
%SRC_TO_DST_AVG_THROUGHPUT %DST_TO_SRC_MAX_THROUGHPUT
%DST_TO_SRC_MIN_THROUGHPUT %DST_TO_SRC_AVG_THROUGHPUT
%NUM_PKTS_UP_TO_128_BYTES %NUM_PKTS_128_TO_256_BYTES
%NUM_PKTS_256_TO_512_BYTES %NUM_PKTS_512_TO_1024_BYTES
%NUM_PKTS_1024_TO_1514_BYTES %NUM_PKTS_OVER_1514_BYTES %LONGEST_FLOW_PKT
%SHORTEST_FLOW_PKT %NUM_PKTS_TTL_EQ_1 %NUM_PKTS_TTL_2_5 %NUM_PKTS_TTL_5_32
%NUM_PKTS_TTL_32_64 %NUM_PKTS_TTL_64_96 %NUM_PKTS_TTL_96_128
%NUM_PKTS_TTL_128_160 %NUM_PKTS_TTL_160_192 %NUM_PKTS_TTL_192_224
%NUM_PKTS_TTL_224_255 %DURATION_IN %DURATION_OUT %TCP_WIN_MIN_IN
%TCP_WIN_MAX_IN %TCP_WIN_MSS_IN %TCP_WIN_SCALE_IN %TCP_WIN_MIN_OUT
%TCP_WIN_MAX_OUT %TCP_WIN_MSS_OUT %TCP_WIN_SCALE_OUT"
--flow-version=9
--tunnel
--smart-udp-frags
--
Regards,
David Notivol
dnotivol@gmail.com