Are you running on EBS or ephemeral storage? I have seen IO being slow on AWS when EBS with provisioned IO is not used. This might be what you are seeing.

Also what do you see as checkpoint size when the channel starts up?

To be clear, we have this load handled across 3 EC2 instances running Flume so each individually we are asking to handle 3.3k (5k).  With 16GB of data in the channel, I would have expected the replay to be faster.

Our capacity setting is:

agent-1.channels.trdbuy-bid-req-ch1.capacity = 100000000

Our current channel size can not be accessed because it still is in this odd 'replay' mode.  There's not logs, but the cpu is cranking on the flume node and the avro source ports have not yet opened.  The pattern we see is that after anywhere from 15-30 minutes, the ports magically open and we can continue.

This is because we are logging around 10k messages/second and did not want to lose any data during brief interruptions.

How large is your channel (and how long does it take to replay?)

For the record, we are using Flume 1.4.0 packaged with CDH5.0.2

    We are repeatedly running into cases where the replays of from a
    file channel going to HDFS take an eternity.

    I've read this thread

    but I just am not convinced that our checkpoints are constantly
    being corrupted.

    We are seeing messages such as:

    20 Aug 2014 03:52:26,849 INFO  [lifecycleSupervisor-1-2]

      - Reading checkpoint metadata from

    How can it be that this takes so long?