I am running with nprobe 8.2 in collector mode. I am currently designing a
collection infrastructure so I want to try to understand what nprobe is
doing internally as to better understand how data is being processed. I a
number of questions in regard to this. I have read the latest version of
the user guide PDF but still have some questions. I tried to organize my
questions in blocks to hopefully allow for easier commenting on each
question. This is fairly long but I figured asking this all together, in
context, would be better. Thanks in advance to whoever takes this on - I
really appreciate it. :)
Is there any detailed documentation on what is going on internally with
nprobe. In particular, I am using it as a collector to forward UDP netflow
v9 from our Cisco routers to Kafka. I am particularly interesting in
understanding some of these stats and what they "infer" is happening under
the hood:
19/Dec/2017 13:36:09 [nprobe.c:3202] Average traffic: [0.00 pps][All
Traffic 0 b/sec][IP Traffic 0 b/sec][ratio -nan]
19/Dec/2017 13:36:09 [nprobe.c:3210] Current traffic: [0.00 pps][0 b/sec]
19/Dec/2017 13:36:09 [nprobe.c:3216] Current flow export rate: [1818.5
flows/sec]
19/Dec/2017 13:36:09 [nprobe.c:3219] Flow drops: [export queue too
long=0][too many flows=0][ELK queue flow drops=0]
19/Dec/2017 13:36:09 [nprobe.c:3224] Export Queue: 0/512000 [0.0 %]
19/Dec/2017 13:36:09 [nprobe.c:3229] Flow Buckets:
[active=92792][allocated=92792][toBeExported=0]
19/Dec/2017 13:36:09 [nprobe.c:3235] Kafka [flows exported=366299/1818.5
flows/sec][msgs sent=366299/1.0 flows/msg][send errors=0]
19/Dec/2017 13:36:09 [nprobe.c:3260] Collector Threads: [167203 pkts@0]
19/Dec/2017 13:36:09 [nprobe.c:3052] Processed packets: 0 (max bucket
search: 8)
19/Dec/2017 13:36:09 [nprobe.c:3035] Fragment queue length: 0
19/Dec/2017 13:36:09 [nprobe.c:3061] Flow export stats: [0 bytes/0 pkts][0
flows/0 pkts sent]
19/Dec/2017 13:36:09 [nprobe.c:3068] Flow collection: [collected pkts:
167203][processed flows: 4561802]
19/Dec/2017 13:36:09 [nprobe.c:3071] Flow drop stats: [0 bytes/0 pkts][0
flows]
19/Dec/2017 13:36:09 [nprobe.c:3076] Total flow stats: [0 bytes/0 pkts][0
flows/0 pkts sent]
19/Dec/2017 13:36:09 [nprobe.c:3087] Kafka [flows exported=366299][msgs
sent=366299/1.0 flows/msg][send errors=0]
For these two stats:
Flow collection: [collected pkts: 167203][processed flows: 4561802]
Kafka [flows exported=366299][msgs sent=366299/1.0 flows/msg][send errors=0]
I am thinking they mean that 167203 UDP packets where received from routers
comprising a total of 4561802 individual flow records. However, is see only
366299 flows exported to Kafka. So, am I correct in assuming that nprobe is
doing some internal aggregation of flow records that is essentially
squashing the 4561802 received flow records into 366299 aggregates?
A follow on question to this, then, is related to:
Flow Buckets: [active=92792][allocated=92792][toBeExported=0]
What are these and how are they utilized? Again, I am assuming these are
hash buckets used for internal aggregation per the user's guide. I have
seen warning indicating that the allotment of these buckets are too small
and to expect drops. So, my guess is, based on flows/sec ingested, these
have to be sized appropriately to support the flow volume. Is that a
correct assumption?
I also notice that, when I start up nprobe in collector mode publishing to
Kafka, it takes about 30 or more seconds before any flows actually are
published to Kafka. This leads me to believe internal aggregations are
occurring that are delaying publishing of data. If I crank up the --verbose
to 2, I can see UDP packets being processed and then, after some time, I
start to see log messages indicating flows are being exported to Kafka. It
is not as much the latency issue I am concerned with here but rather just
understanding what is happening so that I can properly monitor and
configure/size the system.
Do these parameters impact the utilization of the flow buckets in collector
mode or just when running in sniffer mode? I ask because, I know the
routers are already doing aggregations meaning, accumulation counts for
flows over time before emitting a flow record that is active. Does it mean
that nprobe is then doing the same thing again for these flows and
essentially aggregating already aggregated flow records coming from my
routers?
[--lifetime-timeout|-t] <timeout> It specifies the maximum (seconds) flow
lifetime [default=120]
[--idle-timeout|-d] <timeout> It specifies the maximum (seconds) flow
idle lifetime [default=30]
[--queue-timeout|-l] <timeout> It specifies how long expired flows
(queued before delivery) are emitted [default=30]
Also, based on the assumption of aggregating already aggregated data and
the type of traffic on the network I am monitoring (lots of short-lived
transactions, like credit card swipe processing by vendors and DNS
lookups), does it even make sense to have nprobe aggregating this traffic
that I know is NOT going to consist of more than one flow record anyway?
The user document does not mention anything about monitoring nprobe
programmatically. What is the best way to monitor nprobe for internal
packet drops? I can get various OS stats from /proc/xxx, like UDP queue
size, drops, etc, but I need nprobe internal stats to round out the
picture. I see that there is information like this on stdout:
Flow drops: [export queue too long=0][too many flows=0][ELK queue flow
drops=0]
However, I want to monitor my nprobe instances with Nagios and generate
alerts on threshold checks as well as track utilization over time by
posting periodic stats to our InfluxDB/Grafana setup. Is there some way
(other than parsing stdout in a log) to gain programmatic access to these
stats for monitoring tools to use?
Regarding Kafka, the producer has many configuration options but only very
few are exposed for configuration in nprobe. Let me ask these one by one:
1. batch.size, linger.ms, buffer.memory - These are essential to
controlling batching in Kafka. nprobe has options --kafka-enable-batch
and--kafka-batch-len. However, these end up wrapping N messages into a JSON
array of size N and publishing that to Kafka. I feel this is a wrong
approach. Consider the downstream Kafka consumer. It expects to receive a
series of message off a topic. The format of those message should not
change due to batching. When batching is not enabled in nprobe, the
consumer sees a series of JSON dictionaries - each a single flow record.
When batching is enabled, the consumer now sees a series of JSON arrays,
each with N JSON dictionaries. IMO, the proper way to do this is to use the
Kafka configuration values to control batching. In that case, the producer
simply queues up messages (each a dictionary) and, when configured
thresholds are met, emits those messages. This results in a batch of
dictionaries being sent and the consumer ONLY sees dictionaries. Changing
the message structure due to batching complicates things for consumers and
is not a typical pattern in Kafka processing.
2. Options topic - Your documentation does not even mention this (nprobe
--help does) but I don't understand what it means? What is a Kafka options
topic?
3. Partitioning - If we want to perform stream process of netflow data,
then we want to ensure that all flow records from a given n-tuple are
placed on the same Kafka partition. We need to partition the data because
it is the only way to scale consumers in Kafka. If I want to perform some
aggregations on the data stream then I have to be sure that all netflow
records for a given conversation, for example, are on the same topic
partition. A simple example that will make that happen would be to use the
IPV4_SRC_ADDR field of the flow record as the partition key. Or, maybe an
N-tuple of (IPV4_SRC_ADDR, IPV4_DST_ADDR, L4_SRC_PORT, L4_SRC_PORT) as the
partition key. In Java, a producer would do this by hashing the string that
comprises the partition key desired then doing a hash % num-partitions to
figure out the partition to send the message on. I am guessing that nprobe
relies on the default partitioning scheme in the producer which is a simple
round-robin approach based on the number of partitions that exist for the
topic being used. This, however, would randomly distribute flow records for
a given conversation across multiple partitions and, therefore across
multiple consumers in a downstream consumer group. That would break the
aggregations. So, my request is that you consider allowing a configuration
option that enables the user to define the partition key. This might be
done, for example, by allowing the user to define a CSV list of template
fields to use to form the partition key string. You could just concatenate
them together and hash that value then modulo divide by the number of
partitions for the topic being used and use that to enable the producer to
publish on the appropriate topic partition. The gives the user the freedom
to define the partition key while making the implemention in nprobe fairly
generic. Maybe this could also be done via some sort of "partition plugin"
to make it even more extensible? How you considered any such capability.
Without such a capability, we will have to initially publish all flows on a
say "netflow-raw" topic (using round-robin) then consume this topic in a
consumer group only to republish it by repartitioning it (as described
above using some N-tuple of fields) only be then be consumer by another
consumer group who will be doing the aggregations and enrichments needed.
Sure, we can make it would but partitioning should "really" be done at the
source. The approach I just described necessarily doubles our broker
traffic which I would not like to have to do.
4. Producer Options in General - Why not just make them all
configurable? For example, allow the user to define a name=value config
file using any supported producer configuration options and provide the
path to the file as an nprobe Kafka configuration option. Then, when you
instantiate the producer in nprobe, read in those configuration values and
pass them into the producer. This gives the users access to all options
available and not just the current topic, acks, and compression values.
Miscellaneous Notes:
1. The v8.1 users guide lists "New Options --kafka-enable-batch
and--kafka-batch-len to batch flow export to kafka" but does not provide
any detailed documentation on these. Looks like someone forgot to add the
description of these later in the document
2. nprobe --help show this under the Kafka options: "<options topic>
Flow options topic" but the v8.1 user's guide gives no mention to it. I
have no idea what an options topic is.
collection infrastructure so I want to try to understand what nprobe is
doing internally as to better understand how data is being processed. I a
number of questions in regard to this. I have read the latest version of
the user guide PDF but still have some questions. I tried to organize my
questions in blocks to hopefully allow for easier commenting on each
question. This is fairly long but I figured asking this all together, in
context, would be better. Thanks in advance to whoever takes this on - I
really appreciate it. :)
Is there any detailed documentation on what is going on internally with
nprobe. In particular, I am using it as a collector to forward UDP netflow
v9 from our Cisco routers to Kafka. I am particularly interesting in
understanding some of these stats and what they "infer" is happening under
the hood:
19/Dec/2017 13:36:09 [nprobe.c:3202] Average traffic: [0.00 pps][All
Traffic 0 b/sec][IP Traffic 0 b/sec][ratio -nan]
19/Dec/2017 13:36:09 [nprobe.c:3210] Current traffic: [0.00 pps][0 b/sec]
19/Dec/2017 13:36:09 [nprobe.c:3216] Current flow export rate: [1818.5
flows/sec]
19/Dec/2017 13:36:09 [nprobe.c:3219] Flow drops: [export queue too
long=0][too many flows=0][ELK queue flow drops=0]
19/Dec/2017 13:36:09 [nprobe.c:3224] Export Queue: 0/512000 [0.0 %]
19/Dec/2017 13:36:09 [nprobe.c:3229] Flow Buckets:
[active=92792][allocated=92792][toBeExported=0]
19/Dec/2017 13:36:09 [nprobe.c:3235] Kafka [flows exported=366299/1818.5
flows/sec][msgs sent=366299/1.0 flows/msg][send errors=0]
19/Dec/2017 13:36:09 [nprobe.c:3260] Collector Threads: [167203 pkts@0]
19/Dec/2017 13:36:09 [nprobe.c:3052] Processed packets: 0 (max bucket
search: 8)
19/Dec/2017 13:36:09 [nprobe.c:3035] Fragment queue length: 0
19/Dec/2017 13:36:09 [nprobe.c:3061] Flow export stats: [0 bytes/0 pkts][0
flows/0 pkts sent]
19/Dec/2017 13:36:09 [nprobe.c:3068] Flow collection: [collected pkts:
167203][processed flows: 4561802]
19/Dec/2017 13:36:09 [nprobe.c:3071] Flow drop stats: [0 bytes/0 pkts][0
flows]
19/Dec/2017 13:36:09 [nprobe.c:3076] Total flow stats: [0 bytes/0 pkts][0
flows/0 pkts sent]
19/Dec/2017 13:36:09 [nprobe.c:3087] Kafka [flows exported=366299][msgs
sent=366299/1.0 flows/msg][send errors=0]
For these two stats:
Flow collection: [collected pkts: 167203][processed flows: 4561802]
Kafka [flows exported=366299][msgs sent=366299/1.0 flows/msg][send errors=0]
I am thinking they mean that 167203 UDP packets where received from routers
comprising a total of 4561802 individual flow records. However, is see only
366299 flows exported to Kafka. So, am I correct in assuming that nprobe is
doing some internal aggregation of flow records that is essentially
squashing the 4561802 received flow records into 366299 aggregates?
A follow on question to this, then, is related to:
Flow Buckets: [active=92792][allocated=92792][toBeExported=0]
What are these and how are they utilized? Again, I am assuming these are
hash buckets used for internal aggregation per the user's guide. I have
seen warning indicating that the allotment of these buckets are too small
and to expect drops. So, my guess is, based on flows/sec ingested, these
have to be sized appropriately to support the flow volume. Is that a
correct assumption?
I also notice that, when I start up nprobe in collector mode publishing to
Kafka, it takes about 30 or more seconds before any flows actually are
published to Kafka. This leads me to believe internal aggregations are
occurring that are delaying publishing of data. If I crank up the --verbose
to 2, I can see UDP packets being processed and then, after some time, I
start to see log messages indicating flows are being exported to Kafka. It
is not as much the latency issue I am concerned with here but rather just
understanding what is happening so that I can properly monitor and
configure/size the system.
Do these parameters impact the utilization of the flow buckets in collector
mode or just when running in sniffer mode? I ask because, I know the
routers are already doing aggregations meaning, accumulation counts for
flows over time before emitting a flow record that is active. Does it mean
that nprobe is then doing the same thing again for these flows and
essentially aggregating already aggregated flow records coming from my
routers?
[--lifetime-timeout|-t] <timeout> It specifies the maximum (seconds) flow
lifetime [default=120]
[--idle-timeout|-d] <timeout> It specifies the maximum (seconds) flow
idle lifetime [default=30]
[--queue-timeout|-l] <timeout> It specifies how long expired flows
(queued before delivery) are emitted [default=30]
Also, based on the assumption of aggregating already aggregated data and
the type of traffic on the network I am monitoring (lots of short-lived
transactions, like credit card swipe processing by vendors and DNS
lookups), does it even make sense to have nprobe aggregating this traffic
that I know is NOT going to consist of more than one flow record anyway?
The user document does not mention anything about monitoring nprobe
programmatically. What is the best way to monitor nprobe for internal
packet drops? I can get various OS stats from /proc/xxx, like UDP queue
size, drops, etc, but I need nprobe internal stats to round out the
picture. I see that there is information like this on stdout:
Flow drops: [export queue too long=0][too many flows=0][ELK queue flow
drops=0]
However, I want to monitor my nprobe instances with Nagios and generate
alerts on threshold checks as well as track utilization over time by
posting periodic stats to our InfluxDB/Grafana setup. Is there some way
(other than parsing stdout in a log) to gain programmatic access to these
stats for monitoring tools to use?
Regarding Kafka, the producer has many configuration options but only very
few are exposed for configuration in nprobe. Let me ask these one by one:
1. batch.size, linger.ms, buffer.memory - These are essential to
controlling batching in Kafka. nprobe has options --kafka-enable-batch
and--kafka-batch-len. However, these end up wrapping N messages into a JSON
array of size N and publishing that to Kafka. I feel this is a wrong
approach. Consider the downstream Kafka consumer. It expects to receive a
series of message off a topic. The format of those message should not
change due to batching. When batching is not enabled in nprobe, the
consumer sees a series of JSON dictionaries - each a single flow record.
When batching is enabled, the consumer now sees a series of JSON arrays,
each with N JSON dictionaries. IMO, the proper way to do this is to use the
Kafka configuration values to control batching. In that case, the producer
simply queues up messages (each a dictionary) and, when configured
thresholds are met, emits those messages. This results in a batch of
dictionaries being sent and the consumer ONLY sees dictionaries. Changing
the message structure due to batching complicates things for consumers and
is not a typical pattern in Kafka processing.
2. Options topic - Your documentation does not even mention this (nprobe
--help does) but I don't understand what it means? What is a Kafka options
topic?
3. Partitioning - If we want to perform stream process of netflow data,
then we want to ensure that all flow records from a given n-tuple are
placed on the same Kafka partition. We need to partition the data because
it is the only way to scale consumers in Kafka. If I want to perform some
aggregations on the data stream then I have to be sure that all netflow
records for a given conversation, for example, are on the same topic
partition. A simple example that will make that happen would be to use the
IPV4_SRC_ADDR field of the flow record as the partition key. Or, maybe an
N-tuple of (IPV4_SRC_ADDR, IPV4_DST_ADDR, L4_SRC_PORT, L4_SRC_PORT) as the
partition key. In Java, a producer would do this by hashing the string that
comprises the partition key desired then doing a hash % num-partitions to
figure out the partition to send the message on. I am guessing that nprobe
relies on the default partitioning scheme in the producer which is a simple
round-robin approach based on the number of partitions that exist for the
topic being used. This, however, would randomly distribute flow records for
a given conversation across multiple partitions and, therefore across
multiple consumers in a downstream consumer group. That would break the
aggregations. So, my request is that you consider allowing a configuration
option that enables the user to define the partition key. This might be
done, for example, by allowing the user to define a CSV list of template
fields to use to form the partition key string. You could just concatenate
them together and hash that value then modulo divide by the number of
partitions for the topic being used and use that to enable the producer to
publish on the appropriate topic partition. The gives the user the freedom
to define the partition key while making the implemention in nprobe fairly
generic. Maybe this could also be done via some sort of "partition plugin"
to make it even more extensible? How you considered any such capability.
Without such a capability, we will have to initially publish all flows on a
say "netflow-raw" topic (using round-robin) then consume this topic in a
consumer group only to republish it by repartitioning it (as described
above using some N-tuple of fields) only be then be consumer by another
consumer group who will be doing the aggregations and enrichments needed.
Sure, we can make it would but partitioning should "really" be done at the
source. The approach I just described necessarily doubles our broker
traffic which I would not like to have to do.
4. Producer Options in General - Why not just make them all
configurable? For example, allow the user to define a name=value config
file using any supported producer configuration options and provide the
path to the file as an nprobe Kafka configuration option. Then, when you
instantiate the producer in nprobe, read in those configuration values and
pass them into the producer. This gives the users access to all options
available and not just the current topic, acks, and compression values.
Miscellaneous Notes:
1. The v8.1 users guide lists "New Options --kafka-enable-batch
and--kafka-batch-len to batch flow export to kafka" but does not provide
any detailed documentation on these. Looks like someone forgot to add the
description of these later in the document
2. nprobe --help show this under the Kafka options: "<options topic>
Flow options topic" but the v8.1 user's guide gives no mention to it. I
have no idea what an options topic is.