I wonder why you’re seeing 90% CPU use when you use a file channel. I would expect high disk I/O. To counter, I have on a single server 4 spool dir sources, each going to a separate file channel. Also on an SSD based server. I do not see any CPU or even disk IO utilization. I am pushing about 10 million events per day across all 4 sources and has been running reliably for 2 years now.
I would always use a file channel, any memory channel runs the risk of data loss if the node were to fail. I would be as worried about the local node failing seeing that a 3 node kafka cluster losing 2 nodes before it would lose quorum.
Not sure what your data source is, if you can add more flume nodes of course that would help.
Have you given ample heap space, seeing maybe GC’s causing the high CPU?
I'm currently plan to migrate from Flume 0.9 to Flume-ng 1.6, but I'm having troubles trying to find a reliable setup for this one.
My sink is a 3 nodes Kafka cluster. I must avoid to lose events in case the main sink is down, broken or unreachable for a while.
In Flume 0.9, I use a memory channel with the store on failure feature, which starts writing events on the local disk in case the target sink is not available.
In Flume-ng 1.6 the same behaviour would be accomplished by setting up a Spillable memory channel, but the problem with this solution is written in the end of the channel's description: "This channel is currently experimental and not recommended for use in production."
In Flume-ng 1.6, it's possible to setup a pool of Failover sinks. So, I was thinking to hypothetically configure a File Roll as Secondary sink in case the Primary is down. However, once the Primary sink would be back online, the data placed on the Secondary sink (local disk) won't be automatically pushed on the Primary one.
Another option would be setting up a file channel: write each event on the disk and then sink. Without mentioning that I don't love the idea to write/delete each single event continuously on a SSD, this setup is taking 90% of CPU. The same exactly configuration but using a memory channel takes 3%.
Other solutions to evaluate ?