Mailing List Archive: General questions and documentation of nprobe internals

General questions and documentation of nprobe internals

Dec 20, 2017, 4:25 AM

Post #1 of 5 (885 views)

I am running with nprobe 8.2 in collector mode. I am currently designing a
collection infrastructure so I want to try to understand what nprobe is
doing internally as to better understand how data is being processed. I a
number of questions in regard to this. I have read the latest version of
the user guide PDF but still have some questions. I tried to organize my
questions in blocks to hopefully allow for easier commenting on each
question. This is fairly long but I figured asking this all together, in
context, would be better. Thanks in advance to whoever takes this on - I
really appreciate it. :)

Is there any detailed documentation on what is going on internally with
nprobe. In particular, I am using it as a collector to forward UDP netflow
v9 from our Cisco routers to Kafka. I am particularly interesting in
understanding some of these stats and what they "infer" is happening under
the hood:

19/Dec/2017 13:36:09 [nprobe.c:3202] Average traffic: [0.00 pps][All
Traffic 0 b/sec][IP Traffic 0 b/sec][ratio -nan]

19/Dec/2017 13:36:09 [nprobe.c:3210] Current traffic: [0.00 pps][0 b/sec]

19/Dec/2017 13:36:09 [nprobe.c:3216] Current flow export rate: [1818.5
flows/sec]

19/Dec/2017 13:36:09 [nprobe.c:3219] Flow drops: [export queue too
long=0][too many flows=0][ELK queue flow drops=0]

19/Dec/2017 13:36:09 [nprobe.c:3224] Export Queue: 0/512000 [0.0 %]

19/Dec/2017 13:36:09 [nprobe.c:3229] Flow Buckets:
[active=92792][allocated=92792][toBeExported=0]

19/Dec/2017 13:36:09 [nprobe.c:3235] Kafka [flows exported=366299/1818.5
flows/sec][msgs sent=366299/1.0 flows/msg][send errors=0]

19/Dec/2017 13:36:09 [nprobe.c:3260] Collector Threads: [167203 pkts@0]

19/Dec/2017 13:36:09 [nprobe.c:3052] Processed packets: 0 (max bucket
search: 8)

19/Dec/2017 13:36:09 [nprobe.c:3035] Fragment queue length: 0

19/Dec/2017 13:36:09 [nprobe.c:3061] Flow export stats: [0 bytes/0 pkts][0
flows/0 pkts sent]

19/Dec/2017 13:36:09 [nprobe.c:3068] Flow collection: [collected pkts:
167203][processed flows: 4561802]

19/Dec/2017 13:36:09 [nprobe.c:3071] Flow drop stats: [0 bytes/0 pkts][0
flows]

19/Dec/2017 13:36:09 [nprobe.c:3076] Total flow stats: [0 bytes/0 pkts][0
flows/0 pkts sent]

19/Dec/2017 13:36:09 [nprobe.c:3087] Kafka [flows exported=366299][msgs
sent=366299/1.0 flows/msg][send errors=0]

For these two stats:

Flow collection: [collected pkts: 167203][processed flows: 4561802]

Kafka [flows exported=366299][msgs sent=366299/1.0 flows/msg][send errors=0]

I am thinking they mean that 167203 UDP packets where received from routers
comprising a total of 4561802 individual flow records. However, is see only
366299 flows exported to Kafka. So, am I correct in assuming that nprobe is
doing some internal aggregation of flow records that is essentially
squashing the 4561802 received flow records into 366299 aggregates?

A follow on question to this, then, is related to:

Flow Buckets: [active=92792][allocated=92792][toBeExported=0]

What are these and how are they utilized? Again, I am assuming these are
hash buckets used for internal aggregation per the user's guide. I have
seen warning indicating that the allotment of these buckets are too small
and to expect drops. So, my guess is, based on flows/sec ingested, these
have to be sized appropriately to support the flow volume. Is that a
correct assumption?

I also notice that, when I start up nprobe in collector mode publishing to
Kafka, it takes about 30 or more seconds before any flows actually are
published to Kafka. This leads me to believe internal aggregations are
occurring that are delaying publishing of data. If I crank up the --verbose
to 2, I can see UDP packets being processed and then, after some time, I
start to see log messages indicating flows are being exported to Kafka. It
is not as much the latency issue I am concerned with here but rather just
understanding what is happening so that I can properly monitor and
configure/size the system.

Do these parameters impact the utilization of the flow buckets in collector
mode or just when running in sniffer mode? I ask because, I know the
routers are already doing aggregations meaning, accumulation counts for
flows over time before emitting a flow record that is active. Does it mean
that nprobe is then doing the same thing again for these flows and
essentially aggregating already aggregated flow records coming from my
routers?

[--lifetime-timeout|-t] <timeout> It specifies the maximum (seconds) flow
lifetime [default=120]
[--idle-timeout|-d] <timeout> It specifies the maximum (seconds) flow
idle lifetime [default=30]
[--queue-timeout|-l] <timeout> It specifies how long expired flows
(queued before delivery) are emitted [default=30]

Also, based on the assumption of aggregating already aggregated data and
the type of traffic on the network I am monitoring (lots of short-lived
transactions, like credit card swipe processing by vendors and DNS
lookups), does it even make sense to have nprobe aggregating this traffic
that I know is NOT going to consist of more than one flow record anyway?

The user document does not mention anything about monitoring nprobe
programmatically. What is the best way to monitor nprobe for internal
packet drops? I can get various OS stats from /proc/xxx, like UDP queue
size, drops, etc, but I need nprobe internal stats to round out the
picture. I see that there is information like this on stdout:

Flow drops: [export queue too long=0][too many flows=0][ELK queue flow
drops=0]

However, I want to monitor my nprobe instances with Nagios and generate
alerts on threshold checks as well as track utilization over time by
posting periodic stats to our InfluxDB/Grafana setup. Is there some way
(other than parsing stdout in a log) to gain programmatic access to these
stats for monitoring tools to use?

Regarding Kafka, the producer has many configuration options but only very
few are exposed for configuration in nprobe. Let me ask these one by one:

1. batch.size, linger.ms, buffer.memory - These are essential to
controlling batching in Kafka. nprobe has options --kafka-enable-batch
and--kafka-batch-len. However, these end up wrapping N messages into a JSON
array of size N and publishing that to Kafka. I feel this is a wrong
approach. Consider the downstream Kafka consumer. It expects to receive a
series of message off a topic. The format of those message should not
change due to batching. When batching is not enabled in nprobe, the
consumer sees a series of JSON dictionaries - each a single flow record.
When batching is enabled, the consumer now sees a series of JSON arrays,
each with N JSON dictionaries. IMO, the proper way to do this is to use the
Kafka configuration values to control batching. In that case, the producer
simply queues up messages (each a dictionary) and, when configured
thresholds are met, emits those messages. This results in a batch of
dictionaries being sent and the consumer ONLY sees dictionaries. Changing
the message structure due to batching complicates things for consumers and
is not a typical pattern in Kafka processing.
2. Options topic - Your documentation does not even mention this (nprobe
--help does) but I don't understand what it means? What is a Kafka options
topic?
3. Partitioning - If we want to perform stream process of netflow data,
then we want to ensure that all flow records from a given n-tuple are
placed on the same Kafka partition. We need to partition the data because
it is the only way to scale consumers in Kafka. If I want to perform some
aggregations on the data stream then I have to be sure that all netflow
records for a given conversation, for example, are on the same topic
partition. A simple example that will make that happen would be to use the
IPV4_SRC_ADDR field of the flow record as the partition key. Or, maybe an
N-tuple of (IPV4_SRC_ADDR, IPV4_DST_ADDR, L4_SRC_PORT, L4_SRC_PORT) as the
partition key. In Java, a producer would do this by hashing the string that
comprises the partition key desired then doing a hash % num-partitions to
figure out the partition to send the message on. I am guessing that nprobe
relies on the default partitioning scheme in the producer which is a simple
round-robin approach based on the number of partitions that exist for the
topic being used. This, however, would randomly distribute flow records for
a given conversation across multiple partitions and, therefore across
multiple consumers in a downstream consumer group. That would break the
aggregations. So, my request is that you consider allowing a configuration
option that enables the user to define the partition key. This might be
done, for example, by allowing the user to define a CSV list of template
fields to use to form the partition key string. You could just concatenate
them together and hash that value then modulo divide by the number of
partitions for the topic being used and use that to enable the producer to
publish on the appropriate topic partition. The gives the user the freedom
to define the partition key while making the implemention in nprobe fairly
generic. Maybe this could also be done via some sort of "partition plugin"
to make it even more extensible? How you considered any such capability.
Without such a capability, we will have to initially publish all flows on a
say "netflow-raw" topic (using round-robin) then consume this topic in a
consumer group only to republish it by repartitioning it (as described
above using some N-tuple of fields) only be then be consumer by another
consumer group who will be doing the aggregations and enrichments needed.
Sure, we can make it would but partitioning should "really" be done at the
source. The approach I just described necessarily doubles our broker
traffic which I would not like to have to do.
4. Producer Options in General - Why not just make them all
configurable? For example, allow the user to define a name=value config
file using any supported producer configuration options and provide the
path to the file as an nprobe Kafka configuration option. Then, when you
instantiate the producer in nprobe, read in those configuration values and
pass them into the producer. This gives the users access to all options
available and not just the current topic, acks, and compression values.

Miscellaneous Notes:

1. The v8.1 users guide lists "New Options --kafka-enable-batch
and--kafka-batch-len to batch flow export to kafka" but does not provide
any detailed documentation on these. Looks like someone forgot to add the
description of these later in the document
2. nprobe --help show this under the Kafka options: "<options topic>
Flow options topic" but the v8.1 user's guide gives no mention to it. I
have no idea what an options topic is.

Re: General questions and documentation of nprobe internals [ In reply to ]

mainardi at ntop

Dec 21, 2017, 8:17 AM

Post #2 of 5 (880 views)

Permalink

Mark,

> On 20 Dec 2017, at 13:25, Mark Petronic <markpetronic@gmail.com> wrote:
>
> I am running with nprobe 8.2 in collector mode. I am currently designing a collection infrastructure so I want to try to understand what nprobe is doing internally as to better understand how data is being processed. I a number of questions in regard to this. I have read the latest version of the user guide PDF but still have some questions. I tried to organize my questions in blocks to hopefully allow for easier commenting on each question. This is fairly long but I figured asking this all together, in context, would be better. Thanks in advance to whoever takes this on - I really appreciate it. :)
>
> Is there any detailed documentation on what is going on internally with nprobe. In particular, I am using it as a collector to forward UDP netflow v9 from our Cisco routers to Kafka. I am particularly interesting in understanding some of these stats and what they "infer" is happening under the hood:
>
> 19/Dec/2017 13:36:09 [nprobe.c:3202] Average traffic: [0.00 pps][All Traffic 0 b/sec][IP Traffic 0 b/sec][ratio -nan]
> 19/Dec/2017 13:36:09 [nprobe.c:3210] Current traffic: [0.00 pps][0 b/sec]
> 19/Dec/2017 13:36:09 [nprobe.c:3216] Current flow export rate: [1818.5 flows/sec]
> 19/Dec/2017 13:36:09 [nprobe.c:3219] Flow drops: [export queue too long=0][too many flows=0][ELK queue flow drops=0]
> 19/Dec/2017 13:36:09 [nprobe.c:3224] Export Queue: 0/512000 [0.0 %]
> 19/Dec/2017 13:36:09 [nprobe.c:3229] Flow Buckets: [active=92792][allocated=92792][toBeExported=0]
> 19/Dec/2017 13:36:09 [nprobe.c:3235] Kafka [flows exported=366299/1818.5 flows/sec][msgs sent=366299/1.0 flows/msg][send errors=0]
> 19/Dec/2017 13:36:09 [nprobe.c:3260] Collector Threads: [167203 pkts@0]
> 19/Dec/2017 13:36:09 [nprobe.c:3052] Processed packets: 0 (max bucket search: 8)
> 19/Dec/2017 13:36:09 [nprobe.c:3035] Fragment queue length: 0
> 19/Dec/2017 13:36:09 [nprobe.c:3061] Flow export stats: [0 bytes/0 pkts][0 flows/0 pkts sent]
> 19/Dec/2017 13:36:09 [nprobe.c:3068] Flow collection: [collected pkts: 167203][processed flows: 4561802]
> 19/Dec/2017 13:36:09 [nprobe.c:3071] Flow drop stats: [0 bytes/0 pkts][0 flows]
> 19/Dec/2017 13:36:09 [nprobe.c:3076] Total flow stats: [0 bytes/0 pkts][0 flows/0 pkts sent]
> 19/Dec/2017 13:36:09 [nprobe.c:3087] Kafka [flows exported=366299][msgs sent=366299/1.0 flows/msg][send errors=0]
>
> For these two stats:
>
> Flow collection: [collected pkts: 167203][processed flows: 4561802]
> Kafka [flows exported=366299][msgs sent=366299/1.0 flows/msg][send errors=0]
>
> I am thinking they mean that 167203 UDP packets where received from routers comprising a total of 4561802 individual flow records. However, is see only 366299 flows exported to Kafka. So, am I correct in assuming that nprobe is doing some internal aggregation of flow records that is essentially squashing the 4561802 received flow records into 366299 aggregates?

Yes it does some internal aggregations based on the 5-tuple and the timeouts configured (see. --lifetime-timeout and --idle-timeout).

If you want to disable any internal aggregation use option --disable-cache

>
> A follow on question to this, then, is related to:
>
> Flow Buckets: [active=92792][allocated=92792][toBeExported=0]
>
> What are these and how are they utilized? Again, I am assuming these are hash buckets used for internal aggregation per the user's guide.

Correct

> I have seen warning indicating that the allotment of these buckets are too small and to expect drops. So, my guess is, based on flows/sec ingested, these have to be sized appropriately to support the flow volume. Is that a correct assumption?

Correct. Increase it with --hash-size if necessary.

>
> I also notice that, when I start up nprobe in collector mode publishing to Kafka, it takes about 30 or more seconds before any flows actually are published to Kafka. This leads me to believe internal aggregations are occurring that are delaying publishing of data.

Yes, this depend on the timeouts configured so you should use -disable-cache if you want nProbe to simply act as a transparent proxy

> If I crank up the --verbose to 2, I can see UDP packets being processed and then, after some time, I start to see log messages indicating flows are being exported to Kafka. It is not as much the latency issue I am concerned with here but rather just understanding what is happening so that I can properly monitor and configure/size the system.
>
> Do these parameters impact the utilization of the flow buckets in collector mode or just when running in sniffer mode? I ask because, I know the routers are already doing aggregations meaning, accumulation counts for flows over time before emitting a flow record that is active. Does it mean that nprobe is then doing the same thing again for these flows and essentially aggregating already aggregated flow records coming from my routers?
>
> [--lifetime-timeout|-t] <timeout> It specifies the maximum (seconds) flow lifetime [default=120]
> [--idle-timeout|-d] <timeout> It specifies the maximum (seconds) flow idle lifetime [default=30]
> [--queue-timeout|-l] <timeout> It specifies how long expired flows (queued before delivery) are emitted [default=30]
>
>
> Also, based on the assumption of aggregating already aggregated data and the type of traffic on the network I am monitoring (lots of short-lived transactions, like credit card swipe processing by vendors and DNS lookups), does it even make sense to have nprobe aggregating this traffic that I know is NOT going to consist of more than one flow record anyway?
>
> The user document does not mention anything about monitoring nprobe programmatically. What is the best way to monitor nprobe for internal packet drops? I can get various OS stats from /proc/xxx, like UDP queue size, drops, etc, but I need nprobe internal stats to round out the picture. I see that there is information like this on stdout:
>
> Flow drops: [export queue too long=0][too many flows=0][ELK queue flow drops=0]

use -b=1 to have traffic stats periodically output to the terminal or log file.

>
> However, I want to monitor my nprobe instances with Nagios and generate alerts on threshold checks as well as track utilization over time by posting periodic stats to our InfluxDB/Grafana setup. Is there some way (other than parsing stdout in a log) to gain programmatic access to these stats for monitoring tools to use?
>
> Regarding Kafka, the producer has many configuration options but only very few are exposed for configuration in nprobe. Let me ask these one by one:
>
> batch.size, linger.ms <http://linger.ms/>, buffer.memory - These are essential to controlling batching in Kafka. nprobe has options --kafka-enable-batch and--kafka-batch-len. However, these end up wrapping N messages into a JSON array of size N and publishing that to Kafka. I feel this is a wrong approach. Consider the downstream Kafka consumer. It expects to receive a series of message off a topic. The format of those message should not change due to batching. When batching is not enabled in nprobe, the consumer sees a series of JSON dictionaries - each a single flow record. When batching is enabled, the consumer now sees a series of JSON arrays, each with N JSON dictionaries. IMO, the proper way to do this is to use the Kafka configuration values to control batching. In that case, the producer simply queues up messages (each a dictionary) and, when configured thresholds are met, emits those messages. This results in a batch of dictionaries being sent and the consumer ONLY sees dictionaries. Changing the message structure due to batching complicates things for consumers and is not a typical pattern in Kafka processing.
Our experiments proved a 4x speedup with the implemented batching.
> Options topic - Your documentation does not even mention this (nprobe --help does) but I don't understand what it means? What is a Kafka options topic?
It contains messages for the netflow option templates
> Partitioning - If we want to perform stream process of netflow data, then we want to ensure that all flow records from a given n-tuple are placed on the same Kafka partition. We need to partition the data because it is the only way to scale consumers in Kafka. If I want to perform some aggregations on the data stream then I have to be sure that all netflow records for a given conversation, for example, are on the same topic partition. A simple example that will make that happen would be to use the IPV4_SRC_ADDR field of the flow record as the partition key. Or, maybe an N-tuple of (IPV4_SRC_ADDR, IPV4_DST_ADDR, L4_SRC_PORT, L4_SRC_PORT) as the partition key. In Java, a producer would do this by hashing the string that comprises the partition key desired then doing a hash % num-partitions to figure out the partition to send the message on. I am guessing that nprobe relies on the default partitioning scheme in the producer which is a simple round-robin approach based on the number of partitions that exist for the topic being used. This, however, would randomly distribute flow records for a given conversation across multiple partitions and, therefore across multiple consumers in a downstream consumer group. That would break the aggregations. So, my request is that you consider allowing a configuration option that enables the user to define the partition key. This might be done, for example, by allowing the user to define a CSV list of template fields to use to form the partition key string. You could just concatenate them together and hash that value then modulo divide by the number of partitions for the topic being used and use that to enable the producer to publish on the appropriate topic partition. The gives the user the freedom to define the partition key while making the implemention in nprobe fairly generic. Maybe this could also be done via some sort of "partition plugin" to make it even more extensible? How you considered any such capability. Without such a capability, we will have to initially publish all flows on a say "netflow-raw" topic (using round-robin) then consume this topic in a consumer group only to republish it by repartitioning it (as described above using some N-tuple of fields) only be then be consumer by another consumer group who will be doing the aggregations and enrichments needed. Sure, we can make it would but partitioning should "really" be done at the source. The approach I just described necessarily doubles our broker traffic which I would not like to have to do.
I see. Being able to control which data ends up into which shard will gives a lot more flexibility. Currently this feature is not implemented but if you contact us privately we can try and work together toward a more controllable hashing of the flows for kafka.

Simone
> Producer Options in General - Why not just make them all configurable? For example, allow the user to define a name=value config file using any supported producer configuration options and provide the path to the file as an nprobe Kafka configuration option. Then, when you instantiate the producer in nprobe, read in those configuration values and pass them into the producer. This gives the users access to all options available and not just the current topic, acks, and compression values.
>
> Miscellaneous Notes:
> The v8.1 users guide lists "New Options --kafka-enable-batch and--kafka-batch-len to batch flow export to kafka" but does not provide any detailed documentation on these. Looks like someone forgot to add the description of these later in the document
> nprobe --help show this under the Kafka options: "<options topic> Flow options topic" but the v8.1 user's guide gives no mention to it. I have no idea what an options topic is.
> _______________________________________________
> Ntop-misc mailing list
> Ntop-misc@listgateway.unipi.it
> http://listgateway.unipi.it/mailman/listinfo/ntop-misc

Re: General questions and documentation of nprobe internals [ In reply to ]

deri at ntop

Jan 1, 2018, 9:28 AM

Post #3 of 5 (864 views)

Permalink

Hi Mark,
sorry for the late reply but we;ve been in vacation lately

Please see below

> On 20 Dec 2017, at 13:25, Mark Petronic <markpetronic@gmail.com> wrote:
>
> I am running with nprobe 8.2 in collector mode. I am currently designing a collection infrastructure so I want to try to understand what nprobe is doing internally as to better understand how data is being processed. I a number of questions in regard to this. I have read the latest version of the user guide PDF but still have some questions. I tried to organize my questions in blocks to hopefully allow for easier commenting on each question. This is fairly long but I figured asking this all together, in context, would be better. Thanks in advance to whoever takes this on - I really appreciate it. :)
>
> Is there any detailed documentation on what is going on internally with nprobe. In particular, I am using it as a collector to forward UDP netflow v9 from our Cisco routers to Kafka. I am particularly interesting in understanding some of these stats and what they "infer" is happening under the hood:
>
> 19/Dec/2017 13:36:09 [nprobe.c:3202] Average traffic: [0.00 pps][All Traffic 0 b/sec][IP Traffic 0 b/sec][ratio -nan]
> 19/Dec/2017 13:36:09 [nprobe.c:3210] Current traffic: [0.00 pps][0 b/sec]
> 19/Dec/2017 13:36:09 [nprobe.c:3216] Current flow export rate: [1818.5 flows/sec]
> 19/Dec/2017 13:36:09 [nprobe.c:3219] Flow drops: [export queue too long=0][too many flows=0][ELK queue flow drops=0]
> 19/Dec/2017 13:36:09 [nprobe.c:3224] Export Queue: 0/512000 [0.0 %]
> 19/Dec/2017 13:36:09 [nprobe.c:3229] Flow Buckets: [active=92792][allocated=92792][toBeExported=0]
> 19/Dec/2017 13:36:09 [nprobe.c:3235] Kafka [flows exported=366299/1818.5 flows/sec][msgs sent=366299/1.0 flows/msg][send errors=0]
> 19/Dec/2017 13:36:09 [nprobe.c:3260] Collector Threads: [167203 pkts@0]
> 19/Dec/2017 13:36:09 [nprobe.c:3052] Processed packets: 0 (max bucket search: 8)
> 19/Dec/2017 13:36:09 [nprobe.c:3035] Fragment queue length: 0
> 19/Dec/2017 13:36:09 [nprobe.c:3061] Flow export stats: [0 bytes/0 pkts][0 flows/0 pkts sent]
> 19/Dec/2017 13:36:09 [nprobe.c:3068] Flow collection: [collected pkts: 167203][processed flows: 4561802]
> 19/Dec/2017 13:36:09 [nprobe.c:3071] Flow drop stats: [0 bytes/0 pkts][0 flows]
> 19/Dec/2017 13:36:09 [nprobe.c:3076] Total flow stats: [0 bytes/0 pkts][0 flows/0 pkts sent]
> 19/Dec/2017 13:36:09 [nprobe.c:3087] Kafka [flows exported=366299][msgs sent=366299/1.0 flows/msg][send errors=0]
>
> For these two stats:
>
> Flow collection: [collected pkts: 167203][processed flows: 4561802]
> Kafka [flows exported=366299][msgs sent=366299/1.0 flows/msg][send errors=0]
>
> I am thinking they mean that 167203 UDP packets where received from routers comprising a total of 4561802 individual flow records. However, is see only 366299 flows exported to Kafka. So, am I correct in assuming that nprobe is doing some internal aggregation of flow records that is essentially squashing the 4561802 received flow records into 366299 aggregates?
Yes your assumption is correct. If you want to avoid that please use --disable-cache

>
> A follow on question to this, then, is related to:
>
> Flow Buckets: [active=92792][allocated=92792][toBeExported=0]
>
> What are these and how are they utilized? Again, I am assuming these are hash buckets used for internal aggregation per the user's guide. I have seen warning indicating that the allotment of these buckets are too small and to expect drops. So, my guess is, based on flows/sec ingested, these have to be sized appropriately to support the flow volume. Is that a correct assumption?

When you see these messages we need to investigate. This happens when too many flows fall into the same hash bucket for instance. Enlarging the hash (-w) can help if too small compared to the number of collected flows, but for replying more in detail I need some extra context
>
> I also notice that, when I start up nprobe in collector mode publishing to Kafka, it takes about 30 or more seconds before any flows actually are published to Kafka. This leads me to believe internal aggregations are occurring that are delaying publishing of data. If I crank up the --verbose to 2, I can see UDP packets being processed and then, after some time, I start to see log messages indicating flows are being exported to Kafka. It is not as much the latency issue I am concerned with here but rather just understanding what is happening so that I can properly monitor and configure/size the system.
Yes correct. By default flows are aggregated in the cache and as you write below the minimum timeout is 30 sec
>
> Do these parameters impact the utilization of the flow buckets in collector mode or just when running in sniffer mode? I ask because, I know the routers are already doing aggregations meaning, accumulation counts for flows over time before emitting a flow record that is active. Does it mean that nprobe is then doing the same thing again for these flows and essentially aggregating already aggregated flow records coming from my routers?
>
> [--lifetime-timeout|-t] <timeout> It specifies the maximum (seconds) flow lifetime [default=120]
> [--idle-timeout|-d] <timeout> It specifies the maximum (seconds) flow idle lifetime [default=30]
> [--queue-timeout|-l] <timeout> It specifies how long expired flows (queued before delivery) are emitted [default=30]
>
They affect the cache regardless of the mode (collector or probe). As you use the cache (unless --disable-cache is used) these defaults also apply to you
>
> Also, based on the assumption of aggregating already aggregated data and the type of traffic on the network I am monitoring (lots of short-lived transactions, like credit card swipe processing by vendors and DNS lookups), does it even make sense to have nprobe aggregating this traffic that I know is NOT going to consist of more than one flow record anyway?

The answer depends on the environment you are monitoring.
>
> The user document does not mention anything about monitoring nprobe programmatically. What is the best way to monitor nprobe for internal packet drops? I can get various OS stats from /proc/xxx, like UDP queue size, drops, etc, but I need nprobe internal stats to round out the picture. I see that there is information like this on stdout:
>
> Flow drops: [export queue too long=0][too many flows=0][ELK queue flow drops=0]
>
> However, I want to monitor my nprobe instances with Nagios and generate alerts on threshold checks as well as track utilization over time by posting periodic stats to our InfluxDB/Grafana setup. Is there some way (other than parsing stdout in a log) to gain programmatic access to these stats for monitoring tools to use?

Nobody has asked this before so in short no API is available. Instead people use --dump-stats to generate dumps, or the /proc stats. If they are not enough please file a ticket on https://github.com/ntop/nProbe/issues <https://github.com/ntop/nProbe/issues> and explain what you you need. Please one ticket per request.

>
> Regarding Kafka, the producer has many configuration options but only very few are exposed for configuration in nprobe. Let me ask these one by one:
>
> batch.size, linger.ms <http://linger.ms/>, buffer.memory - These are essential to controlling batching in Kafka. nprobe has options --kafka-enable-batch and--kafka-batch-len. However, these end up wrapping N messages into a JSON array of size N and publishing that to Kafka. I feel this is a wrong approach. Consider the downstream Kafka consumer. It expects to receive a series of message off a topic. The format of those message should not change due to batching. When batching is not enabled in nprobe, the consumer sees a series of JSON dictionaries - each a single flow record. When batching is enabled, the consumer now sees a series of JSON arrays, each with N JSON dictionaries. IMO, the proper way to do this is to use the Kafka configuration values to control batching. In that case, the producer simply queues up messages (each a dictionary) and, when configured thresholds are met, emits those messages. This results in a batch of dictionaries being sent and the consumer ONLY sees dictionaries. Changing the message structure due to batching complicates things for consumers and is not a typical pattern in Kafka processing.
> Options topic - Your documentation does not even mention this (nprobe --help does) but I don't understand what it means? What is a Kafka options topic?
> Partitioning - If we want to perform stream process of netflow data, then we want to ensure that all flow records from a given n-tuple are placed on the same Kafka partition. We need to partition the data because it is the only way to scale consumers in Kafka. If I want to perform some aggregations on the data stream then I have to be sure that all netflow records for a given conversation, for example, are on the same topic partition. A simple example that will make that happen would be to use the IPV4_SRC_ADDR field of the flow record as the partition key. Or, maybe an N-tuple of (IPV4_SRC_ADDR, IPV4_DST_ADDR, L4_SRC_PORT, L4_SRC_PORT) as the partition key. In Java, a producer would do this by hashing the string that comprises the partition key desired then doing a hash % num-partitions to figure out the partition to send the message on. I am guessing that nprobe relies on the default partitioning scheme in the producer which is a simple round-robin approach based on the number of partitions that exist for the topic being used. This, however, would randomly distribute flow records for a given conversation across multiple partitions and, therefore across multiple consumers in a downstream consumer group. That would break the aggregations. So, my request is that you consider allowing a configuration option that enables the user to define the partition key. This might be done, for example, by allowing the user to define a CSV list of template fields to use to form the partition key string. You could just concatenate them together and hash that value then modulo divide by the number of partitions for the topic being used and use that to enable the producer to publish on the appropriate topic partition. The gives the user the freedom to define the partition key while making the implemention in nprobe fairly generic. Maybe this could also be done via some sort of "partition plugin" to make it even more extensible? How you considered any such capability. Without such a capability, we will have to initially publish all flows on a say "netflow-raw" topic (using round-robin) then consume this topic in a consumer group only to republish it by repartitioning it (as described above using some N-tuple of fields) only be then be consumer by another consumer group who will be doing the aggregations and enrichments needed. Sure, we can make it would but partitioning should "really" be done at the source. The approach I just described necessarily doubles our broker traffic which I would not like to have to do.
> Producer Options in General - Why not just make them all configurable? For example, allow the user to define a name=value config file using any supported producer configuration options and provide the path to the file as an nprobe Kafka configuration option. Then, when you instantiate the producer in nprobe, read in those configuration values and pass them into the producer. This gives the users access to all options available and not just the current topic, acks, and compression values.
>
> Miscellaneous Notes:
> The v8.1 users guide lists "New Options --kafka-enable-batch and--kafka-batch-len to batch flow export to kafka" but does not provide any detailed documentation on these. Looks like someone forgot to add the description of these later in the document
> nprobe --help show this under the Kafka options: "<options topic> Flow options topic" but the v8.1 user's guide gives no mention to it. I have no idea what an options topic is.
As of the above notes on Kafka, I let my colleague Simone answer you who is the kafka export in our team.

Simone can you please answer Mark, and if there are changed to be made (I think so from what I understand) file individual tickets?

Thanks Luca

> _______________________________________________
> Ntop-misc mailing list
> Ntop-misc@listgateway.unipi.it
> http://listgateway.unipi.it/mailman/listinfo/ntop-misc

Re: General questions and documentation of nprobe internals [ In reply to ]

mainardi at ntop

Jan 2, 2018, 2:49 AM

Post #4 of 5 (863 views)

Permalink

Mark,

>
>>
>> Regarding Kafka, the producer has many configuration options but only very few are exposed for configuration in nprobe. Let me ask these one by one:
>>
>> batch.size, linger.ms <http://linger.ms/>, buffer.memory - These are essential to controlling batching in Kafka. nprobe has options --kafka-enable-batch and--kafka-batch-len. However, these end up wrapping N messages into a JSON array of size N and publishing that to Kafka. I feel this is a wrong approach. Consider the downstream Kafka consumer. It expects to receive a series of message off a topic. The format of those message should not change due to batching. When batching is not enabled in nprobe, the consumer sees a series of JSON dictionaries - each a single flow record. When batching is enabled, the consumer now sees a series of JSON arrays, each with N JSON dictionaries. IMO, the proper way to do this is to use the Kafka configuration values to control batching. In that case, the producer simply queues up messages (each a dictionary) and, when configured thresholds are met, emits those messages. This results in a batch of dictionaries being sent and the consumer ONLY sees dictionaries. Changing the message structure due to batching complicates things for consumers and is not a typical pattern in Kafka processing.
This is a good suggestion. Due to backward-compatibility reasons, when batching is enabled nProbe accumulates flows and then outputs them as a single kafka message. I agree that we should use the librdkafka batching features directly and make sure flows are just plain JSON dictionaries not concatenated into an array. Please, file an issue at https://github.com/ntop/nProbe/issues <https://github.com/ntop/nProbe/issues> and we will try to accommodate this request.
>> Options topic - Your documentation does not even mention this (nprobe --help does) but I don't understand what it means? What is a Kafka options topic?
This topic is used to send data records for the NetFlow Options Template: https://www.plixer.com/blog/network-traffic-analysis/netflow-overview-netflow-v9-options-template/ <https://www.plixer.com/blog/network-traffic-analysis/netflow-overview-netflow-v9-options-template/>
>> Partitioning - If we want to perform stream process of netflow data, then we want to ensure that all flow records from a given n-tuple are placed on the same Kafka partition. We need to partition the data because it is the only way to scale consumers in Kafka. If I want to perform some aggregations on the data stream then I have to be sure that all netflow records for a given conversation, for example, are on the same topic partition. A simple example that will make that happen would be to use the IPV4_SRC_ADDR field of the flow record as the partition key. Or, maybe an N-tuple of (IPV4_SRC_ADDR, IPV4_DST_ADDR, L4_SRC_PORT, L4_SRC_PORT) as the partition key. In Java, a producer would do this by hashing the string that comprises the partition key desired then doing a hash % num-partitions to figure out the partition to send the message on. I am guessing that nprobe relies on the default partitioning scheme in the producer which is a simple round-robin approach based on the number of partitions that exist for the topic being used.
Currently, nProbe sends to all available partitions in round-robin.
>> This, however, would randomly distribute flow records for a given conversation across multiple partitions and, therefore across multiple consumers in a downstream consumer group. That would break the aggregations. So, my request is that you consider allowing a configuration option that enables the user to define the partition key. This might be done, for example, by allowing the user to define a CSV list of template fields to use to form the partition key string. You could just concatenate them together and hash that value then modulo divide by the number of partitions for the topic being used and use that to enable the producer to publish on the appropriate topic partition. The gives the user the freedom to define the partition key while making the implemention in nprobe fairly generic. Maybe this could also be done via some sort of "partition plugin" to make it even more extensible? How you considered any such capability. Without such a capability, we will have to initially publish all flows on a say "netflow-raw" topic (using round-robin) then consume this topic in a consumer group only to republish it by repartitioning it (as described above using some N-tuple of fields) only be then be consumer by another consumer group who will be doing the aggregations and enrichments needed. Sure, we can make it would but partitioning should "really" be done at the source. The approach I just described necessarily doubles our broker traffic which I would not like to have to do.
As already said, this is an interesting feature that we are willing to consider for implementation. Feel free to contact us privately so we can try and work together to make the hashing controllable.
>> Producer Options in General - Why not just make them all configurable? For example, allow the user to define a name=value config file using any supported producer configuration options and provide the path to the file as an nprobe Kafka configuration option. Then, when you instantiate the producer in nprobe, read in those configuration values and pass them into the producer. This gives the users access to all options available and not just the current topic, acks, and compression values.
Yes, this is definitely useful as well. Use the same issue tracker mentioned above and we will prioritise the activity.

Regards,

Simone

>>
>> Miscellaneous Notes:
>> The v8.1 users guide lists "New Options --kafka-enable-batch and--kafka-batch-len to batch flow export to kafka" but does not provide any detailed documentation on these. Looks like someone forgot to add the description of these later in the document
>> nprobe --help show this under the Kafka options: "<options topic> Flow options topic" but the v8.1 user's guide gives no mention to it. I have no idea what an options topic is.
> As of the above notes on Kafka, I let my colleague Simone answer you who is the kafka export in our team.
>
> Simone can you please answer Mark, and if there are changed to be made (I think so from what I understand) file individual tickets?
>
> Thanks Luca
>
>
>> _______________________________________________
>> Ntop-misc mailing list
>> Ntop-misc@listgateway.unipi.it <mailto:Ntop-misc@listgateway.unipi.it>
>> http://listgateway.unipi.it/mailman/listinfo/ntop-misc <http://listgateway.unipi.it/mailman/listinfo/ntop-misc>

Re: General questions and documentation of nprobe internals [ In reply to ]

markpetronic at gmail

Jan 2, 2018, 5:56 AM

Post #5 of 5 (863 views)

Permalink

Thank you for all the responses and suggestions. As you requested, I will
open tracking tickets for the issues that you suggested. Again, very much
appreciate the timely and thorough responses - they are very helpful.

On Mon, Jan 1, 2018 at 12:28 PM, Luca Deri <deri@ntop.org> wrote:

> Hi Mark,
> sorry for the late reply but we;ve been in vacation lately
>
> Please see below
>
> On 20 Dec 2017, at 13:25, Mark Petronic <markpetronic@gmail.com> wrote:
>
> I am running with nprobe 8.2 in collector mode. I am currently designing a
> collection infrastructure so I want to try to understand what nprobe is
> doing internally as to better understand how data is being processed. I a
> number of questions in regard to this. I have read the latest version of
> the user guide PDF but still have some questions. I tried to organize my
> questions in blocks to hopefully allow for easier commenting on each
> question. This is fairly long but I figured asking this all together, in
> context, would be better. Thanks in advance to whoever takes this on - I
> really appreciate it. :)
>
> Is there any detailed documentation on what is going on internally with
> nprobe. In particular, I am using it as a collector to forward UDP netflow
> v9 from our Cisco routers to Kafka. I am particularly interesting in
> understanding some of these stats and what they "infer" is happening under
> the hood:
>
> 19/Dec/2017 13:36:09 [nprobe.c:3202] Average traffic: [0.00 pps][All
> Traffic 0 b/sec][IP Traffic 0 b/sec][ratio -nan]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3210] Current traffic: [0.00 pps][0 b/sec]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3216] Current flow export rate: [1818.5
> flows/sec]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3219] Flow drops: [export queue too
> long=0][too many flows=0][ELK queue flow drops=0]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3224] Export Queue: 0/512000 [0.0 %]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3229] Flow Buckets:
> [active=92792][allocated=92792][toBeExported=0]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3235] Kafka [flows exported=366299/1818.5
> flows/sec][msgs sent=366299/1.0 flows/msg][send errors=0]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3260] Collector Threads: [167203 pkts@0]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3052] Processed packets: 0 (max bucket
> search: 8)
>
> 19/Dec/2017 13:36:09 [nprobe.c:3035] Fragment queue length: 0
>
> 19/Dec/2017 13:36:09 [nprobe.c:3061] Flow export stats: [0 bytes/0 pkts][0
> flows/0 pkts sent]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3068] Flow collection: [collected pkts:
> 167203][processed flows: 4561802]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3071] Flow drop stats: [0 bytes/0 pkts][0
> flows]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3076] Total flow stats: [0 bytes/0 pkts][0
> flows/0 pkts sent]
>
> 19/Dec/2017 13:36:09 [nprobe.c:3087] Kafka [flows exported=366299][msgs
> sent=366299/1.0 flows/msg][send errors=0]
>
>
> For these two stats:
>
> Flow collection: [collected pkts: 167203][processed flows: 4561802]
>
> Kafka [flows exported=366299][msgs sent=366299/1.0 flows/msg][send
> errors=0]
>
>
> I am thinking they mean that 167203 UDP packets where received from
> routers comprising a total of 4561802 individual flow records. However, is
> see only 366299 flows exported to Kafka. So, am I correct in assuming that
> nprobe is doing some internal aggregation of flow records that is
> essentially squashing the 4561802 received flow records into 366299
> aggregates?
>
> Yes your assumption is correct. If you want to avoid that please use
> --disable-cache
>
>
> A follow on question to this, then, is related to:
>
> Flow Buckets: [active=92792][allocated=92792][toBeExported=0]
>
>
> What are these and how are they utilized? Again, I am assuming these are
> hash buckets used for internal aggregation per the user's guide. I have
> seen warning indicating that the allotment of these buckets are too small
> and to expect drops. So, my guess is, based on flows/sec ingested, these
> have to be sized appropriately to support the flow volume. Is that a
> correct assumption?
>
>
> When you see these messages we need to investigate. This happens when too
> many flows fall into the same hash bucket for instance. Enlarging the hash
> (-w) can help if too small compared to the number of collected flows, but
> for replying more in detail I need some extra context
>
>
> I also notice that, when I start up nprobe in collector mode publishing to
> Kafka, it takes about 30 or more seconds before any flows actually are
> published to Kafka. This leads me to believe internal aggregations are
> occurring that are delaying publishing of data. If I crank up the --verbose
> to 2, I can see UDP packets being processed and then, after some time, I
> start to see log messages indicating flows are being exported to Kafka. It
> is not as much the latency issue I am concerned with here but rather just
> understanding what is happening so that I can properly monitor and
> configure/size the system.
>
> Yes correct. By default flows are aggregated in the cache and as you write
> below the minimum timeout is 30 sec
>
>
> Do these parameters impact the utilization of the flow buckets in
> collector mode or just when running in sniffer mode? I ask because, I know
> the routers are already doing aggregations meaning, accumulation counts for
> flows over time before emitting a flow record that is active. Does it mean
> that nprobe is then doing the same thing again for these flows and
> essentially aggregating already aggregated flow records coming from my
> routers?
>
> [--lifetime-timeout|-t] <timeout> It specifies the maximum (seconds)
> flow lifetime [default=120]
> [--idle-timeout|-d] <timeout> It specifies the maximum (seconds)
> flow idle lifetime [default=30]
> [--queue-timeout|-l] <timeout> It specifies how long expired flows
> (queued before delivery) are emitted [default=30]
>
> They affect the cache regardless of the mode (collector or probe). As you
> use the cache (unless --disable-cache is used) these defaults also apply
> to you
>
>
> Also, based on the assumption of aggregating already aggregated data and
> the type of traffic on the network I am monitoring (lots of short-lived
> transactions, like credit card swipe processing by vendors and DNS
> lookups), does it even make sense to have nprobe aggregating this traffic
> that I know is NOT going to consist of more than one flow record anyway?
>
>
> The answer depends on the environment you are monitoring.
>
>
> The user document does not mention anything about monitoring nprobe
> programmatically. What is the best way to monitor nprobe for internal
> packet drops? I can get various OS stats from /proc/xxx, like UDP queue
> size, drops, etc, but I need nprobe internal stats to round out the
> picture. I see that there is information like this on stdout:
>
> Flow drops: [export queue too long=0][too many flows=0][ELK queue flow
> drops=0]
>
> However, I want to monitor my nprobe instances with Nagios and generate
> alerts on threshold checks as well as track utilization over time by
> posting periodic stats to our InfluxDB/Grafana setup. Is there some way
> (other than parsing stdout in a log) to gain programmatic access to these
> stats for monitoring tools to use?
>
>
> Nobody has asked this before so in short no API is available. Instead
> people use --dump-stats to generate dumps, or the /proc stats. If they
> are not enough please file a ticket on https://github.com/ntop/
> nProbe/issues and explain what you you need. Please one ticket per
> request.
>
>
> Regarding Kafka, the producer has many configuration options but only very
> few are exposed for configuration in nprobe. Let me ask these one by one:
>
>
> 1. batch.size, linger.ms, buffer.memory - These are essential to
> controlling batching in Kafka. nprobe has options --kafka-enable-batch
> and--kafka-batch-len. However, these end up wrapping N messages into a JSON
> array of size N and publishing that to Kafka. I feel this is a wrong
> approach. Consider the downstream Kafka consumer. It expects to receive a
> series of message off a topic. The format of those message should not
> change due to batching. When batching is not enabled in nprobe, the
> consumer sees a series of JSON dictionaries - each a single flow record.
> When batching is enabled, the consumer now sees a series of JSON arrays,
> each with N JSON dictionaries. IMO, the proper way to do this is to use the
> Kafka configuration values to control batching. In that case, the producer
> simply queues up messages (each a dictionary) and, when configured
> thresholds are met, emits those messages. This results in a batch of
> dictionaries being sent and the consumer ONLY sees dictionaries. Changing
> the message structure due to batching complicates things for consumers and
> is not a typical pattern in Kafka processing.
> 2. Options topic - Your documentation does not even mention this
> (nprobe --help does) but I don't understand what it means? What is a Kafka
> options topic?
> 3. Partitioning - If we want to perform stream process of netflow
> data, then we want to ensure that all flow records from a given n-tuple are
> placed on the same Kafka partition. We need to partition the data because
> it is the only way to scale consumers in Kafka. If I want to perform some
> aggregations on the data stream then I have to be sure that all netflow
> records for a given conversation, for example, are on the same topic
> partition. A simple example that will make that happen would be to use the
> IPV4_SRC_ADDR field of the flow record as the partition key. Or, maybe an
> N-tuple of (IPV4_SRC_ADDR, IPV4_DST_ADDR, L4_SRC_PORT, L4_SRC_PORT) as the
> partition key. In Java, a producer would do this by hashing the string that
> comprises the partition key desired then doing a hash % num-partitions to
> figure out the partition to send the message on. I am guessing that nprobe
> relies on the default partitioning scheme in the producer which is a simple
> round-robin approach based on the number of partitions that exist for the
> topic being used. This, however, would randomly distribute flow records for
> a given conversation across multiple partitions and, therefore across
> multiple consumers in a downstream consumer group. That would break the
> aggregations. So, my request is that you consider allowing a configuration
> option that enables the user to define the partition key. This might be
> done, for example, by allowing the user to define a CSV list of template
> fields to use to form the partition key string. You could just concatenate
> them together and hash that value then modulo divide by the number of
> partitions for the topic being used and use that to enable the producer to
> publish on the appropriate topic partition. The gives the user the freedom
> to define the partition key while making the implemention in nprobe fairly
> generic. Maybe this could also be done via some sort of "partition plugin"
> to make it even more extensible? How you considered any such capability.
> Without such a capability, we will have to initially publish all flows on a
> say "netflow-raw" topic (using round-robin) then consume this topic in a
> consumer group only to republish it by repartitioning it (as described
> above using some N-tuple of fields) only be then be consumer by another
> consumer group who will be doing the aggregations and enrichments needed.
> Sure, we can make it would but partitioning should "really" be done at the
> source. The approach I just described necessarily doubles our broker
> traffic which I would not like to have to do.
> 4. Producer Options in General - Why not just make them all
> configurable? For example, allow the user to define a name=value config
> file using any supported producer configuration options and provide the
> path to the file as an nprobe Kafka configuration option. Then, when you
> instantiate the producer in nprobe, read in those configuration values and
> pass them into the producer. This gives the users access to all options
> available and not just the current topic, acks, and compression values.
>
>
> Miscellaneous Notes:
>
> 1. The v8.1 users guide lists "New Options --kafka-enable-batch
> and--kafka-batch-len to batch flow export to kafka" but does not provide
> any detailed documentation on these. Looks like someone forgot to add the
> description of these later in the document
> 2. nprobe --help show this under the Kafka options: "<options topic>
> Flow options topic" but the v8.1 user's guide gives no mention to it. I
> have no idea what an options topic is.
>
> As of the above notes on Kafka, I let my colleague Simone answer you who
> is the kafka export in our team.
>
> Simone can you please answer Mark, and if there are changed to be made (I
> think so from what I understand) file individual tickets?
>
> Thanks Luca
>
>
> _______________________________________________
> Ntop-misc mailing list
> Ntop-misc@listgateway.unipi.it
> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
>
>
>
> _______________________________________________
> Ntop-misc mailing list
> Ntop-misc@listgateway.unipi.it
> http://listgateway.unipi.it/mailman/listinfo/ntop-misc
>