flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Shreedharan <hshreedha...@cloudera.com>
Subject Re: HDFS Sink performance
Date Thu, 23 Jul 2015 17:33:28 GMT
This is interesting. I believe Johny is actually looking into this
performance issue.


Thanks,
Hari

On Thu, Jul 23, 2015 at 9:27 AM, lohit <lohit.vijayarenu@gmail.com> wrote:

> Majority of messages need not be persisted to disk for us. So, we are
> interested in MemoryChannel.
> There has been gradual performance degradation from 1.3.1 -> 1.4.0 ->
> 1.6.0.
> See this graph below, were I have a constant stream of messages (blue
> line). While this is happening I swap different versions of flumes for
> agent.
> Orange line shows messages dropped. (Flat line is when data is streamed to
> HDFS) and I have marked flat lines with different versions.
>
>
>
> 2015-07-22 19:48 GMT-07:00 Roshan Naik <roshan@hortonworks.com>:
>
>>
>>  My guess is that most of you will probably use File channel in
>> production with HDFS sink? In which scenario the common observation seems
>> to be that the File channel becomes the primary bottleneck. Going by
>> Robert's observations too seems to have dropped also since v1.3.
>>
>>  Robert,  can u confirm how many data dirs  were used for your readings
>> with FCh ?
>>
>>  -roshan
>>
>>
>>
>>   From: lohit <lohit.vijayarenu@gmail.com>
>> Reply-To: "user@flume.apache.org" <user@flume.apache.org>
>> Date: Wednesday, July 22, 2015 3:01 PM
>> To: "user@flume.apache.org" <user@flume.apache.org>
>>
>> Subject: Re: HDFS Sink performance
>>
>>   Thanks for sharing these number Robert. Curious, I did the same
>> experiment.
>> Flume 1.3.1 version has higher throughput than 1.6.0 (I was able to get
>> sustained 60MB/s with Flume 1.3.1)
>> No config or setup change, just changing flume version shows this
>> difference. We should probably look at change set between 1.3.1 and 1.5 to
>> see if there was any obvious changes.
>>
>> 2015-07-22 14:00 GMT-07:00 Robert B Hamilton <robert.hamilton@gm.com>:
>>
>>> Here is a comparison between versions 1.3, 1.5, and 1.6.
>>> I would estimate that error bars are plus or minus 15%.
>>>
>>> All parameters are identical, as between runs all I change is the
>>> version of flume.
>>> Lohit’s numbers are fairly consistent with this, because if we double
>>> the sinks from my 4 to his 8 and assuming linear scalability we would
>>> expect to get somewhere close to 30-40MB/s.
>>>
>>> It looks like the drop off is more pronounced for the larger event
>>> size.  This is of concern to us because we are looking at this for a high
>>> volume feed with message sizes up to 80 kB.
>>>
>>> ------------------------------------------
>>> HDFSx4 sink, Memory channel
>>> --------------------------------------
>>> Payload     V1.3      v1.5     v1.6
>>> (kB)              MB/s
>>> ----------      -----     -----    -----
>>> 1                    27         17         20
>>> 25                  56         15         15
>>>
>>>
>>>
>>> From: Hari Shreedharan [mailto:hshreedharan@cloudera.com]
>>> Sent: Wednesday, July 22, 2015 1:27 PM
>>> To: user@flume.apache.org
>>> Subject: Re: HDFS Sink performance
>>>
>>> That is a bit disconcerting. Are you using the same HDFS setup and same
>>> config for both tests? Would it be possible for you to take a look at Flume
>>> 1.6.0? Such drops in performance should be taken care of.
>>>
>>>
>>>
>>> Thanks,
>>> Hari
>>>
>>> On Wed, Jul 22, 2015 at 11:04 AM, Robert B Hamilton <
>>> robert.hamilton@gm.com> wrote:
>>> My mailer totally scrambled the numbers, probably by inserting special
>>> characters.
>>> Sorry, here are the actual results....
>>>
>>> All rates in MB/s
>>> Payload in KB
>>>
>>> Flume 1.3.1
>>> Payload   rate memchRate Fch
>>> 25                  34                      29
>>> 25                  31                  27.6
>>> 25                  50                  23.3
>>> 25                  46.5                  27.2
>>> 50                  31.3                  23.8
>>> 50                  37.4                  31.3
>>> 50                  32.3                  31.8
>>> 80                  30.5                  25.8
>>> 80                  46.2                  25.2
>>> 80                  39.1                  25.8
>>> 80                  56.5                  25.1
>>>
>>> Flume 1.5.
>>> Payload  rate memchRate Fch
>>> 25                  18.7                  15.6
>>> 50                  18.3                  17.3
>>> 80                  18.4                   15.6
>>>
>>> -----Original Message-----
>>> From: Robert B Hamilton [mailto:robert.hamilton@gm.com]
>>> Sent: Wednesday, July 22, 2015 11:00 AM
>>>  To: user@flume.apache.org
>>> Subject: RE: HDFS Sink performance
>>>
>>>  I only see that kind of throughput for event sizes of 25kB to 50kB or
>>> larger.
>>>
>>> These particular tests are done on flume version 1.3.1.
>>> But because you asked,  I thought to do a few quick runs on 1.5.0.1 and
>>> added those results below.  The results are significantly different for 1.5
>>> and I wonder if this is a cause for concern.
>>>
>>> None of this has been peer reviewed so it should be considered as
>>> tentative.
>>>
>>> As to the HDD, here is result of a quick and dirty dd test.
>>>
>>>   dd if=/dev/zero of=100M bs=1M count=100 conv=fsync oflag=sync
>>>    104857600 bytes (105 MB) copied, 0.685646 s, 153 MB/s
>>>
>>>
>>> Source data: each record consists of random ascii strings of constant
>>> length (25k,50k,or 80k depending on the run).
>>> Source: spooldir
>>> Channel: file channel single dataDir, or memory channel.
>>> Sink: four HDFS, SequenceFile, Text, Batch size=10, rollInterval=20
>>> seconds.
>>>
>>> Batch size was kept small because of memory channel capacity. Increasing
>>> batch size for file channel did not improve performance so I kept it at 10.
>>>
>>> Here I have numbers for some runs where the payload is varied from
>>> 25K,50K, and 80K. I include memory channel for comparison.
>>>
>>> Multiple runs were peformed for each event size. As you can see the
>>> throughput can vary from run to run because these particular measurements
>>> were done on an environment that is not tightly controlled.  Think of them
>>> as "in situ" measurements :)
>>>
>>> Flume 1.3.1 memory channel and file channel
>>> -------------------------------------------------------
>>> Payload  Rate memch Rate(filechl)
>>> (kB)(MB/s)       (MB/s)
>>> -----------------------------------------------------
>>> 253429
>>> 253127.6
>>> 255023.3
>>> 2546.527.2
>>> 5031.223.8
>>> 5037.431.3
>>> 5032.331.8
>>> 8030.525.8
>>> 8046.225.2
>>> 8039.125.8
>>> 8056.525.1
>>>
>>>
>>> Flume 1.5 File Channel and Memory Channel
>>> ---------------------------------------------------
>>> Event size  Rate memch Rate filech
>>> (KB)        (MB/s)  (MB/s)
>>> ---------------------------------------------------
>>> 2518.715.6
>>> 5018.317.3
>>> 8018.415.6
>>>
>>> -----Original Message-----
>>>  From: Roshan Naik [mailto:roshan@hortonworks.com]
>>> Sent: Friday, July 17, 2015 6:21 PM
>>> To: user@flume.apache.org
>>> Subject: Re: HDFS Sink performance
>>>
>>> I Updated the Flume wiki with my measurements. Also added section with
>>> Hive sink measurements.
>>>
>>>
>>> https://cwiki.apache.org/confluence/display/FLUME/Performance+Measurements+
>>> -+round+2
>>> <https://cwiki.apache.org/confluence/display/FLUME/Performance+Measurements+-+round+2>
>>>
>>>
>>> @Robert:
>>>   What sort of a HDD are you using ?
>>>   What is event size ?
>>>   Which version of flume ?
>>>
>>> -roshan
>>>
>>>
>>>
>>>
>>> On 7/17/15 12:51 PM, "Robert B Hamilton" <robert.hamilton@gm.com> wrote:
>>>
>>> >Our testing has shown up to 60MB/s to HDFS if we use up to 8 or 10
>>> >sinks per agent, and with a file channel with a single dataDir.
>>> >
>>> >
>>> >From: lohit [mailto:lohit.vijayarenu@gmail.com]
>>> >Sent: Wednesday, July 15, 2015 11:11 AM
>>> >To: user@flume.apache.org
>>>  >Subject: HDFS Sink performance
>>> >
>>> >Hello,
>>> >
>>> >Does anyone have some numbers which they can share around HDFS sink
>>> >performance. From our testing, for single sink writing to HDFS
>>> >(CompressedStream) and reading from MemoryChannel can only do about
>>> >35000 events per second (each event is about 1K) in size. After
>>> >compression this turns out to be ~10MB/s write stream to HDFS file.
>>> >Which is pretty low. Our configuration looks like this
>>> >
>>> >agent.sinks.hdfsSink.type = hdfs
>>> >agent.sinks.hdfsSink.channel = memoryChannel
>>> >agent.sinks.hdfsSink.hdfs.path = /tmp/lohit
>>> >agent.sinks.hdfsSink.hdfs.codeC = lzo
>>> >agent.sinks.hdfsSink.hdfs.fileType = CompressedStream
>>> >agent.sinks.hdfsSink.hdfs.writeFormat = Writable
>>> >agent.sinks.hdfsSink.hdfs.rollInterval = 3600
>>> >agent.sinks.hdfsSink.hdfs.rollSize = 1073741824
>>> >agent.sinks.hdfsSink.hdfs.rollCount = 0
>>> >agent.sinks.hdfsSink.hdfs.batchSize = 10000
>>> >agent.sinks.hdfsSink.hdfs.txnEventMax = 10000
>>> >
>>> >agent.channels.memoryChannel.type = memory
>>> >
>>> >agent.channels.memoryChannel.capacity = 3000000
>>> >agent.channels.memoryChannel.transactionCapacity = 10000
>>> >
>>> >--
>>> >Have a Nice Day!
>>> >Lohit
>>> >
>>> >
>>> >Nothing in this message is intended to constitute an electronic
>>> >signature unless a specific statement to the contrary is included in
>>> this message.
>>> >
>>> >Confidentiality Note: This message is intended only for the person or
>>> >entity to which it is addressed. It may contain confidential and/or
>>> >privileged material. Any review, transmission, dissemination or other
>>> >use, or taking of any action in reliance upon this message by persons
>>> >or entities other than the intended recipient is prohibited and may be
>>> >unlawful. If you received this message in error, please contact the
>>> >sender and delete it from your computer.
>>>
>>>
>>>
>>> Nothing in this message is intended to constitute an electronic
>>> signature unless a specific statement to the contrary is included in this
>>> message.
>>>
>>> Confidentiality Note: This message is intended only for the person or
>>> entity to which it is addressed. It may contain confidential and/or
>>> privileged material. Any review, transmission, dissemination or other use,
>>> or taking of any action in reliance upon this message by persons or
>>> entities other than the intended recipient is prohibited and may be
>>> unlawful. If you received this message in error, please contact the sender
>>> and delete it from your computer.
>>>
>>>
>>> Nothing in this message is intended to constitute an electronic
>>> signature unless a specific statement to the contrary is included in this
>>> message.
>>>
>>> Confidentiality Note: This message is intended only for the person or
>>> entity to which it is addressed. It may contain confidential and/or
>>> privileged material. Any review, transmission, dissemination or other use,
>>> or taking of any action in reliance upon this message by persons or
>>> entities other than the intended recipient is prohibited and may be
>>> unlawful. If you received this message in error, please contact the sender
>>> and delete it from your computer.
>>>
>>>
>>>
>>> Nothing in this message is intended to constitute an electronic
>>> signature unless a specific statement to the contrary is included in this
>>> message.
>>>
>>> Confidentiality Note: This message is intended only for the person or
>>> entity to which it is addressed. It may contain confidential and/or
>>> privileged material. Any review, transmission, dissemination or other use,
>>> or taking of any action in reliance upon this message by persons or
>>> entities other than the intended recipient is prohibited and may be
>>> unlawful. If you received this message in error, please contact the sender
>>> and delete it from your computer.
>>>
>>
>>
>>
>>  --
>> Have a Nice Day!
>> Lohit
>>
>
>
>
> --
> Have a Nice Day!
> Lohit
>

Mime
View raw message