flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Horrocks <ch...@hor.rocks>
Subject Re: Where to put the flume agents within a cluster
Date Sat, 24 Jun 2017 06:37:19 GMT
Ive seen this before. If you put a flume agent on a worker node that is running a HDFS data
node, and asusming you are using flume to write into HDFS, you will find that the worker that
has the flume agent on it will be the data node chosen to house the (first replica of the)
data. This may slightly warp the distribution of data across your workers (up to the HDFS
balancer limit anyway) & have an impact on locality. This is due to the bias that various
hadoop services have in electing for (box) local instances of a service rather than engage
in expensive operations like copy data across the network. Simple fix is to add some edge
nodes that run nothing but flume.
DNS RR seems a clunky way of load sharing btw. If you can get the data into something like
Kafka the flume kafka source's consumer group will equally distribute assignments of the partitions
for the topic in question.

On Fri, Jun 23, 2017 at 7:50 pm, Guyle M. Taber <guyle@gmtech.net> wrote:

> We have a 32 data node Hadoop cluster that receives incoming flume data via three data
nodes acting as flume agents. We’re using round robin DNS entries to spread incoming flume
data from various external architectures to the three flume agents on those three data nodes.
It seems like historically, the three data nodes that are the flume agents always have many
more blocks than other data nodes, so I’m wondering what the best approach for placement
of flume agents would be within a cluster. Should all data nodes in the cluster be flume nodes,
or should the flume agent be placed on a name node or other non-data node? Thanks for any
View raw message