flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Horrocks <ch...@hor.rocks>
Subject Re: Understand JMS source + HDFS sink batch management
Date Wed, 16 Nov 2016 15:34:05 GMT
Hi Roberto,

Setting the roll intervals to 0 will stop the sink rolling the files in HDFS. Try setting
hdfs.rollCount to the number of messages you want to roll the file on (I.e. The number of
messages per file). Bare in mind setting this low will result in higher HDFS overhead.

Chris Horrocks

On Wed, Nov 16, 2016 at 10:35 am, Roberto Coluccio <'roberto.coluccio@eng.it'> wrote:

Hello folks,

I'm testing a Flume agent defined by a topology made of :

JMS source (Tibco implementation) -> memory channel -> hdfs sink

The JMS source has:

- my_agent.sources.my_source.batchSize = 100

The memory channel has:

- my_agent.channels.my_channel.capacity = 100

The HDFS sink has:

- my_agent.sinks.my_sink.hdfs.batchSize = 100
- my_agent.sinks.my_sink.hdfs.rollCount = 0
- my_agent.sinks.my_sink.hdfs.rollInterval = 0
- my_agent.sinks.my_sink.hdfs.idleTimeout = 0

I don't understand how/why new files on HDFS are created/closed. In fact, when I:

- launch the agent (JMS queue empty)
- push a new text message on the JMS queue

It happens that a new file is created by the HDFS, but not yet closed (as I expect). BUT,
when I

3. push again a new text message on the JMS queue

regardles how much time I waited to perform step 3, the HDFS sink closes the previously open
file, then open a new one for the new incoming message consumed from the queue and processed
through the channel.

This way, files will always have 1 and only 1 message inside them. I was expecting that number
to be 100, according to the configuration mentioned above.

Any hints?

Best regards,

View raw message