flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bojan Kostić <blood9ra...@gmail.com>
Subject Re: Can Flume handle +100k events per seccond?
Date Wed, 06 Nov 2013 09:39:15 GMT
It was late when i wrote last mail, and my explanation was not clear.
I will illustrate:
20 servers, every one with 60 different log files.
I was thinking that I could have this kind of structure on hdfs:
/logs/server0/logstat0.log
/logs/server0/logstat1.log
.
.
.
/logs/server20/logstat0.log
.
.
.

But from your info I see that I can't do that.
I could try to add server id column in every file and then aggregate files
from all files servers to one file
/logs/logstat0.log
/logs/logstat1.log
.
.
.

But again I should have 60 sinks.
On Nov 6, 2013 2:02 AM, "Roshan Naik" <roshan@hortonworks.com> wrote:

> I assume you mean  you have 120 source files to be streamed into HDFS.
> There is not a 1-1 correspondence between source files and destination
> hdfs files.  If they are on the same host, you can have them all picked up
> through one source, once channel and one hdfs sink... winding up in a
> single hdfs file.
>
> In case you have a config with multiple HDFS sinks (part of a single agent
> or spanning multiple agents) you want to ensure each HDFS sink writes to a
> separate file in HDFS.
>
>
> On Tue, Nov 5, 2013 at 4:41 PM, Bojan Kostić <blood9raven@gmail.com>wrote:
>
>> Hallo Roshan,
>>
>> Thanks for response.
>> Bit I am now confused. If I have 120 files, do I need to configure 120
>> sinks/sources/channels separately? Or I have missed something in the docs.
>> Maybe I should use Fan out flow? But then again I must set 120 params.
>>
>> Best regards.
>> On Nov 5, 2013 8:47 PM, "Roshan Naik" <roshan@hortonworks.com> wrote:
>>
>>> yes. to avoid them clobbering each other's writes.
>>>
>>>
>>> On Tue, Nov 5, 2013 at 4:34 AM, Bojan Kostić <blood9raven@gmail.com>wrote:
>>>
>>>> Sorry for late response. But I lost this email somehow.
>>>>
>>>> Thanks for the read, it is nice start even it is old.
>>>> And the numbers are really promising.
>>>>
>>>> I'm testing memory chanel, there is like 20 data sources(log servers)
>>>> with 60 different files each.
>>>>
>>>> My RPC client app is basic like in examples. But it have load balancing
>>>> for two flume agents which are writing data to hdfs.
>>>>
>>>> I think I read somewhere that you should have one sink per file. Is
>>>> that true?
>>>>
>>>> Best regards, and sorry again for late response.
>>>>  On Oct 22, 2013 8:50 AM, "Juhani Connolly" <
>>>> juhani_connolly@cyberagent.co.jp> wrote:
>>>>
>>>>> Hi Bojan,
>>>>>
>>>>> This is pretty old, but Mike did some testing on performance about an
>>>>> year and a half ago:
>>>>>
>>>>> https://cwiki.apache.org/confluence/display/FLUME/
>>>>> Flume+NG+Syslog+Performance+Test+2012-04-30
>>>>>
>>>>> He was getting a max of 70k events/sec on a single machine.
>>>>>
>>>>> Thing is, this is a result of a huge number of variables:
>>>>> - Parallelization of flows allows better parallel processing
>>>>> - Use of memory channel as opposed to a slower consistent channel.
>>>>> - Possibly the source. I have no idea how you wrote your app
>>>>> - Batching of events is important. Also are all events written to one
>>>>> file? Or are they split over many? Every file is separately processed.
>>>>> - Network congestion, your hdfs setup
>>>>>
>>>>> Reaching 100k events per second is definitely possible. The resources
>>>>> you need for it will vary significantly depending on how your setup is.
The
>>>>> more HA type features you use, the slower delivery is likely to become.
On
>>>>> the flipside, allowing fairly lax conditions that have a small potential
>>>>> for data loss(on crash for example memory channel contents are gone)
will
>>>>> allow for close to 100k even on a single machine.
>>>>>
>>>>> On 10/14/2013 09:00 PM, Bojan Kostić wrote:
>>>>>
>>>>>> Hi, this is my first post here. But i play with flume for some time
>>>>>> now.
>>>>>> My question is how well flume scale?
>>>>>> Can Flume ingest +100k events per seccond? Has anyone tried something
>>>>>> like this?
>>>>>>
>>>>>> I created simple test and results are really slow.
>>>>>> I wrote simple app with rpc client with fallback using flume sdk
>>>>>> which is reading dummy log file.
>>>>>> In the end i have two flume agents which are writing to hdfs.
>>>>>> rollInterval = 60
>>>>>> And in hdfs i get files with ~12MB.
>>>>>>
>>>>>> Do i need to use some complex topology with 3 tier?
>>>>>> How many flume agents should write to hdfs?
>>>>>>
>>>>>> Best regards.
>>>>>>
>>>>>
>>>>>
>>>
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity
>>> to which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
>>
>>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Mime
View raw message