flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jagadish Bihani <jagadish.bih...@pubmatic.com>
Subject Re: In flume-ng is there any advantages of 2-tier topology in a cluster of 30-40 nodes?
Date Fri, 01 Feb 2013 13:27:37 GMT

So to conclude from our discussion:

1. Flume 2-tier architecture is useful (mainly if number of nodes in a 
cluster are large.)  as it wont
stress out the network and also will be soft on namenode (wont be 
troubling it with too many connections).
Also term 'large' will vary from cluster to cluster and will depend upon 
network characteristics.

2. But in terms of HA; it doesn't add anything. ( I  stressed on this 
because everywhere "collector" agent
was being taken for granted and I thought there were some subtle 
benefits of 2-tier architecture other
than what is written in point 1).


On 02/01/2013 06:45 PM, Alexander Alten-Lorenz wrote:
> Ah, I missed your response. Inline
> On Jan 30, 2013, at 3:43 PM, Jagadish Bihani <jagadish.bihani@pubmatic.com> wrote:
>> Hi
>> Thanks  Alexander for the reply.
>> I have added my thoughts in line.
>> On 01/30/2013 11:56 AM, Alexander Alten-Lorenz wrote:
>>> Hi,
>>> If the agents (Tier 1) have access to HDFS, each single client can put data into
HDFS. But this doesn't make really sense, instead you want different files from different
hosts in a structured view (maybe per host a directory, the contents inside split into buckets).
>> -- But if number of clients are lesser (say 30-40) why doesn't it make sense to write
>> Because ultimately purpose is to deliver the source data to HDFS directly. (say in
a single HDFS directory).
> Since every agent writes his own file into HDFS, if this doesn't matter, it will work.
For 40 or less agents the impact for you HDFS could be going high, depends on the delivery
from the agents and can stress out your network. Are the agents on a HDFS node directly or
connect they over a dedicated network?
>>> When you implement a Tier 2 (maybe 2 or more servers who has access to HDFS),
you can have more features like loadbalancing, HA and mirrored sinks, as example (one sink
put the data into HDFS, the other sink into a other system for backup maybe). For stability
and reliability a Tier 2 architecture is recommend. And made some things easier ;)
>> -- I didnt get the point how we get HA and load balancing using 2 tiers.  e.g.
>> 1. If HDFS goes down then both in 1 tier case and 2 tier
>> case channel will grow until its maximum size.
>> 2. If in 1-tier scenario one node goes down then its data wont reach HDFS.
>> Similarly in 2 tier scenario : if a node from 1st tier goes down then its data
>> wont reach HDFS.
> Thats correct. Here you have no advantage from a Tier2 solution.
>> Could you please elaborate if I am missing something?
>>> Cheers,
>>>   Alex
>>> On Jan 30, 2013, at 7:05 AM, Jagadish Bihani <jagadish.bihani@pubmatic.com>
>>>> Hi
>>>> In our scenario there are around 30 machines from which we want to put data
into HDFS.
>>>> Now the approach we thought of initially was:
>>>> 1. First tier  : Agent which collect data from source then pass it to avro
>>>> 2. Second tier:  Lets call those agents 'collectors' which collect data from
First tier agents and then dump it to HDFS.
>>>> (Second tier agents are fewer in number say 4:1)
>>>> Instead of above topology if I simply use HDFS sink in first tier agents.
It can serve the purpose.
>>>> And also number of nodes are lesser (say 30) that won't hurt HDFS namenode
too much compared
>>>> to if number of nodes were say 1000.
>>>> But apart from that I don't say any advantage of adding the 2nd tier.
>>>> Is there any advantage I am missing in terms of failover, HDFS performance
or any other parameter?
>>>> Regards,
>>>> Jagadish
>>> --
>>> Alexander Alten-Lorenz
>>> http://mapredit.blogspot.com
>>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
> --
> Alexander Alten-Lorenz
> http://mapredit.blogspot.com
> German Hadoop LinkedIn Group: http://goo.gl/N8pCF

View raw message