Mailing List Archive: Anyone running a cluster on K8s?

Hi there,

On Mon, 12 Sep 2022, Eric Tykwinski via clamav-users wrote:

> I’ve been more and more moving things over to K8s from Docker ...

Could you explain that a bit more for me? My understanding was that
Kubernetes and Docker were more than a little bit complementary. [1]

Disclaimer: I've never actually used any of this new-fangled stuff [2]
and I'm wondering if we might be able to help each other here.

> just wondering if anyone is running a stateful set, IE I only want 1
> server to run freshclam, but use the same defs for all other clamd

Maybe if I give you my understanding of how things hang together for
clamd it will help.

As far as clamd is concerned, the signature database is read-only. It
resides in a single directory. For the 'official' signatures that's
at least three files - main, daily and bytecode - but if you have e.g.
third-party signatures and/or your own Yara rules in the database, it
can be many; there are 82 at the moment in our own database directory.

On startup the clamd daemon reads the whole thing and builds in-memory
a somewhat optimized representation of what it's found. The in-memory
representation is, as far as the engine is concerned, itself also then
read-only. In will consume of the order of a gigabyte, so it can take
a while to build it during which time the engine can't scan anything. [3]

The freshclam utility is normally what changes files in the database
directory, but third-party tools exist which also do that and you can
even do it manually if you wish. I used to do that all the time for
my Yara rules, but after years of pain I've given up on ClamAV's Yara
implementation and now use a separate Yara engine with separate rules.
It's much more efficient, easier to work with, and I haven't yet found
anything in Yara 4.2.2 which behaves other than exactly as documented.

> I’m assuming I can just put Example in freshclam.conf, and send a
> clamdscan —reload to the service to hit them all?

After the database has been read, clamd does not read any of the files
again until it's time to reload the whole thing. This can be because
clamd itself detects some change in one of the files in the directory
(there's an internal timeout specified in clamd's configuration file)
or because an instruction is sent to clamd via the socket on which it
is listening. You can command a 'RELOAD' using clamdscan or simply by
sending the command to the socket, e.g. using 'telnet' or 'socat' from
the command line after modifying signature files. After an update by
freshclam it sends the command if so configured. All the methods of
causing a reload have exactly the same effect. They aren't mutually
exclusive, but I don't know what might happen if you tried to use two
of them at once. :)

I haven't thought about what advantage if any might result from using
multiple clamd daemons running in containers as compared with running
more threads on a single clamd server. My gut feel is that it would
probably be more efficient just to run more threads. By now I guess
that's a pretty well tested approach, and there have been issues with
containers but I'm not well informed about them. The issues on github
are probably the best place to look for that kind of thing.

Now I'm going into the realms of conjecture. Because you might have
multiple clamd daemons running on a single host sharing resources with
some containerization method, and because the in-memory representation
of the signature database is AFAIK read-only, it stands to reason that
you might carry the extraction beyond containerization, sharing memory
between all the daemons. I'd bet that would take serious coding if it
were to be done explicitly, but you might be able to get it to work by
accident (almost) in a container environment. The problem I see is
that each time an instance of clamd reloads its database it will write
all over its in-memory representation and mess up any optimizations of
the copy-on-write variety that the OS has probably already done. But
fundamentally, as far as the scanner is concerned, the ruleset is just
a pointer to a structure.

[1] https://containerjournal.com/topics/container-ecosystems/kubernetes-vs-docker-a-primer/
[2] I use VMs. I feel I don't know nearly enough about containers to use them safely. [4]
[3] For this reason a recent improvement has been that the engine can have
a second, entirely separate in-memory representation, which it can build
while using the first for scanning. This means that while it is building
the second it can use up to twice as much memory, but after the second is
built, the first chunk of memory will be freed and returned to the OS. For
that reason, I think you might *not* want all your clamds to reload at once.
[4] e.g. https://containerjournal.com/features/the-state-of-k8s-software-supply-chain-attacks/

HTH

A final question: You're putting quite a bit of work into this. Do
you have a feel for the probability that clamd will find what you're
asking it to look for?

--

73,
Ged.
_______________________________________________

Manage your clamav-users mailing list subscription / unsubscribe:
https://lists.clamav.net/mailman/listinfo/clamav-users

Help us build a comprehensive ClamAV guide:
https://github.com/Cisco-Talos/clamav-documentation

https://docs.clamav.net/#mailing-lists-and-chat