flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brock Noland <br...@cloudera.com>
Subject Re: checkpoint lifecycle
Date Thu, 30 Jan 2014 14:38:57 GMT
On Thu, Jan 30, 2014 at 8:16 AM, Umesh Telang <Umesh.Telang@bbc.co.uk>wrote:

>  Hi Brock,
>  Our heap size is 2GB.

That is not enough heap for 150M events. It's 150 million * 32 bytes =
4.5GB + say 100-500MB for the rest of Flume.

>  Thanks for the advice on data directories. Could you please let me know
> the heuristic for that?   (e.g. 1 data directory per N-sized channel where
> N is...)

File channel at present cannot utilize an entire disk from a IO
perspective, that is why I suggest multiple disks. Of course you'll want to
ensure that you have enough disk to support a full channel, but that is a
different discussion (avg event size * channel size).

>  Thanks also for suggesting back up checkpoints - are these something
> that increases the integrity of Flume's execution in an automatic fashion,
> or does it aid in some form of manual recovery?

Automatic. If flume is killed or shutdown during a checkpoint that
checkpoint is invalid and unless a backup checkpoint exists a full replay
will have to take place. Furthermore, without FLUME-2155 full replays are
very time consuming under certain conditions.

>  Re: FLUME-2155, I've scanned through it, and will read it in more
> detail. I'm not sure about the unit of measurement for some of the metrics
> (milliseconds?), but is there any guidance as to at which order of
> magnitude (10^4, 10^6 or 10^8 ?) the channel size causes the replay issue
> to become apparent?

It's not purely about channel size. Specifically it's about:

1) Large channel size
2) Having a large number of events in your channel (queue depth)
3) Having run the channel for some time such that old WAL's were cleaned up
(causing there to be removes for which no event exists)
4) Performing a full replay in these conditions

Generally I wouldn't go over a 1M channel size without backup checkpoint,
this change, or both. There are more details here:



View raw message