flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brock Noland <br...@cloudera.com>
Subject Re: Flume throughput correlation with RAM
Date Wed, 10 Oct 2012 18:00:28 GMT
Hi,

On Wed, Oct 10, 2012 at 11:22 AM, Jagadish Bihani
<jagadish.bihani@pubmatic.com> wrote:
> Hi Brock
>
> I will surely look into 'fsync lies'.
>
> But as per my experiments I think "file channel" is causing the issue.
> Because on those 2 machines (one with higher throughput and other with
> lower)
> I did following experiment:
>
> cat Source -memory channel - file sink.
>
> Now with this setup I got same throughput on both the machines. (around 3
> MB/sec)
> Now as I have used "File sink" it should also do "fsync" at some point of
> time.
> 'File Sink' and 'File Channel' both do disk writes.
> So if there is differences in disk behaviour then even in the 'File Sink' it
> should be visible.
>
> Am I missing something here?

File sink does not call fsync.

>
> Regards,
> Jagadish
>
>
>
> On 10/10/2012 09:35 PM, Brock Noland wrote:
>>
>> OK your disk that is giving you 40KB/second is telling you the truth
>> and the faster disk is lying to you. Look up "fsync lies" to see what
>> I am referring to.
>>
>> A spinning disk can do 100 fsync operations per second (this is done
>> at the end of every batch). That is how I estimated your event size,
>> 40KB/second is doing 40KB / 100 =  409 bytes.
>>
>> Once again, if you want increased performance, you should increase the
>> batch size.
>>
>> Brock
>>
>> On Wed, Oct 10, 2012 at 11:00 AM, Jagadish Bihani
>> <jagadish.bihani@pubmatic.com> wrote:
>>>
>>> Hi
>>>
>>> Yes. It is around 480 - 500 bytes.
>>>
>>>
>>> On 10/10/2012 09:24 PM, Brock Noland wrote:
>>>>
>>>> How big are your events? Average about 400 bytes?
>>>>
>>>> Brock
>>>>
>>>> On Wed, Oct 10, 2012 at 5:11 AM, Jagadish Bihani
>>>> <jagadish.bihani@pubmatic.com> wrote:
>>>>>
>>>>> Hi
>>>>>
>>>>> Thanks for the inputs Brock. After doing several experiments
>>>>> eventually problem boiled down to disks.
>>>>>
>>>>>    -- But I had used the same configuration (so all software components
>>>>> are
>>>>> same in all 3 machines)
>>>>> on all 3 machines.
>>>>> -- In User guide it is written that if multiple file channel instances
>>>>> are
>>>>> active on the same agent then
>>>>> different disks are preferable. But in my case only one file channel
is
>>>>> active per agent.
>>>>> -- Only one pattern I observed that on the machines where I got better
>>>>> performance have multiple disks.
>>>>> But I don't understand how that will help if I have only 1 active file
>>>>> channel.
>>>>> -- What is the impact of the type of disk/disk device driver on
>>>>> performance?
>>>>> I mean I don't understand
>>>>> with 1 disk I am getting 40 KB/sec and with other 2 MB/sec.
>>>>>
>>>>> Could you please elaborate on File channel and disks correlation.
>>>>>
>>>>> Regards,
>>>>> Jagadish
>>>>>
>>>>>
>>>>> On 10/09/2012 08:01 PM, Brock Noland wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Using file channel, in terms of performance, the number and type of
>>>>> disks is going to be much more predictive of performance than CPU or
>>>>> RAM. Note that consumer level drives/controllers will give you much
>>>>> "better" performance because they lie to you about when your data is
>>>>> actually written to the drive. If you search for "fsync lies" you'll
>>>>> find more information on this.
>>>>>
>>>>> You probably want to increase the batch size to get better performance.
>>>>>
>>>>> Brock
>>>>>
>>>>> On Tue, Oct 9, 2012 at 2:46 AM, Jagadish Bihani
>>>>> <jagadish.bihani@pubmatic.com> wrote:
>>>>>
>>>>> Hi
>>>>>
>>>>> My flume setup is:
>>>>>
>>>>> Source Agent : cat source - File Channel - Avro Sink
>>>>> Dest Agent :     avro source - File Channel - HDFS Sink.
>>>>>
>>>>> There is only 1 source agent and 1 destination agent.
>>>>>
>>>>> I measure throughput as amount of data written to HDFS per second.
>>>>> ( I have rolling interval 30 sec; so If 60 MB file is generated in 30
>>>>> sec
>>>>> the
>>>>> throughput is : -- 2 MB/sec ).
>>>>>
>>>>> I have run source agent on various machines with different hardware
>>>>> configurations :
>>>>> (In all cases I run flume agent with JAVA OPTIONS as
>>>>> "-DJAVA_OPTS="-Xms500m -Xmx1g -Dcom.sun.management.jmxremote
>>>>> -XX:MaxDirectMemorySize=2g")
>>>>>
>>>>> JDK is 32 bit.
>>>>>
>>>>> Experiment 1:
>>>>> =====
>>>>> RAM : 16 GB
>>>>> Processor: Intel Xeon E5620 @ 2.40 GHz (16 cores).
>>>>> 64 bit Processor with 64 bit Kernel.
>>>>> Throughput: 2 MB/sec
>>>>>
>>>>> Experiment 2:
>>>>> ======
>>>>> RAM : 4 GB
>>>>> Processor: Intel Xeon E5504  @ 2.00GHz (4 cores). 32 bit Processor
>>>>> 64 bit Processor with 32 bit Kernel.
>>>>> Throughput : 30 KB/sec
>>>>>
>>>>> Experiment 3:
>>>>> ======
>>>>> RAM : 8 GB
>>>>> Processor:Intel Xeon E5520 @ 2.27 GHz (16 cores).32 bit Processor
>>>>> 64 bit Processor with 32 bit Kernel.
>>>>> Throughput : 80 KB/sec
>>>>>
>>>>>    -- So as can be seen there is huge difference in the throughput with
>>>>> same
>>>>> configuration but
>>>>> different hardware.
>>>>> -- In the first case where throughput is more RES is around 160 MB in
>>>>> other
>>>>> cases it is in
>>>>> the range of 40 MB - 50 MB.
>>>>>
>>>>> Can anybody please give insights that why there is this huge difference
>>>>> in
>>>>> the throughput?
>>>>> What is the correlation between RAM and filechannel/HDFS sink
>>>>> performance
>>>>> and also
>>>>> with 32-bit/64 bit kernel?
>>>>>
>>>>> Regards,
>>>>> Jagadish
>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Mime
View raw message