flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brock Noland <br...@cloudera.com>
Subject Re: Flume throughput correlation with RAM
Date Wed, 10 Oct 2012 16:05:27 GMT
OK your disk that is giving you 40KB/second is telling you the truth
and the faster disk is lying to you. Look up "fsync lies" to see what
I am referring to.

A spinning disk can do 100 fsync operations per second (this is done
at the end of every batch). That is how I estimated your event size,
40KB/second is doing 40KB / 100 =  409 bytes.

Once again, if you want increased performance, you should increase the
batch size.

Brock

On Wed, Oct 10, 2012 at 11:00 AM, Jagadish Bihani
<jagadish.bihani@pubmatic.com> wrote:
> Hi
>
> Yes. It is around 480 - 500 bytes.
>
>
> On 10/10/2012 09:24 PM, Brock Noland wrote:
>>
>> How big are your events? Average about 400 bytes?
>>
>> Brock
>>
>> On Wed, Oct 10, 2012 at 5:11 AM, Jagadish Bihani
>> <jagadish.bihani@pubmatic.com> wrote:
>>>
>>> Hi
>>>
>>> Thanks for the inputs Brock. After doing several experiments
>>> eventually problem boiled down to disks.
>>>
>>>   -- But I had used the same configuration (so all software components
>>> are
>>> same in all 3 machines)
>>> on all 3 machines.
>>> -- In User guide it is written that if multiple file channel instances
>>> are
>>> active on the same agent then
>>> different disks are preferable. But in my case only one file channel is
>>> active per agent.
>>> -- Only one pattern I observed that on the machines where I got better
>>> performance have multiple disks.
>>> But I don't understand how that will help if I have only 1 active file
>>> channel.
>>> -- What is the impact of the type of disk/disk device driver on
>>> performance?
>>> I mean I don't understand
>>> with 1 disk I am getting 40 KB/sec and with other 2 MB/sec.
>>>
>>> Could you please elaborate on File channel and disks correlation.
>>>
>>> Regards,
>>> Jagadish
>>>
>>>
>>> On 10/09/2012 08:01 PM, Brock Noland wrote:
>>>
>>> Hi,
>>>
>>> Using file channel, in terms of performance, the number and type of
>>> disks is going to be much more predictive of performance than CPU or
>>> RAM. Note that consumer level drives/controllers will give you much
>>> "better" performance because they lie to you about when your data is
>>> actually written to the drive. If you search for "fsync lies" you'll
>>> find more information on this.
>>>
>>> You probably want to increase the batch size to get better performance.
>>>
>>> Brock
>>>
>>> On Tue, Oct 9, 2012 at 2:46 AM, Jagadish Bihani
>>> <jagadish.bihani@pubmatic.com> wrote:
>>>
>>> Hi
>>>
>>> My flume setup is:
>>>
>>> Source Agent : cat source - File Channel - Avro Sink
>>> Dest Agent :     avro source - File Channel - HDFS Sink.
>>>
>>> There is only 1 source agent and 1 destination agent.
>>>
>>> I measure throughput as amount of data written to HDFS per second.
>>> ( I have rolling interval 30 sec; so If 60 MB file is generated in 30 sec
>>> the
>>> throughput is : -- 2 MB/sec ).
>>>
>>> I have run source agent on various machines with different hardware
>>> configurations :
>>> (In all cases I run flume agent with JAVA OPTIONS as
>>> "-DJAVA_OPTS="-Xms500m -Xmx1g -Dcom.sun.management.jmxremote
>>> -XX:MaxDirectMemorySize=2g")
>>>
>>> JDK is 32 bit.
>>>
>>> Experiment 1:
>>> =====
>>> RAM : 16 GB
>>> Processor: Intel Xeon E5620 @ 2.40 GHz (16 cores).
>>> 64 bit Processor with 64 bit Kernel.
>>> Throughput: 2 MB/sec
>>>
>>> Experiment 2:
>>> ======
>>> RAM : 4 GB
>>> Processor: Intel Xeon E5504  @ 2.00GHz (4 cores). 32 bit Processor
>>> 64 bit Processor with 32 bit Kernel.
>>> Throughput : 30 KB/sec
>>>
>>> Experiment 3:
>>> ======
>>> RAM : 8 GB
>>> Processor:Intel Xeon E5520 @ 2.27 GHz (16 cores).32 bit Processor
>>> 64 bit Processor with 32 bit Kernel.
>>> Throughput : 80 KB/sec
>>>
>>>   -- So as can be seen there is huge difference in the throughput with
>>> same
>>> configuration but
>>> different hardware.
>>> -- In the first case where throughput is more RES is around 160 MB in
>>> other
>>> cases it is in
>>> the range of 40 MB - 50 MB.
>>>
>>> Can anybody please give insights that why there is this huge difference
>>> in
>>> the throughput?
>>> What is the correlation between RAM and filechannel/HDFS sink performance
>>> and also
>>> with 32-bit/64 bit kernel?
>>>
>>> Regards,
>>> Jagadish
>>>
>>>
>>>
>>
>>
>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Mime
View raw message