flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Lord <jl...@cloudera.com>
Subject Re: Flume Configuration & topology approach
Date Thu, 03 Apr 2014 13:36:37 GMT

Are you using memory channel? You mention you are getting OOME but you
don't even say what the heap you are setting on the flume jvm is?

Don't run an agent on the namenode. Occasionally you will see folks
installing an agent on one of the datanodes in the cluster but its not
typically recommended. It's fine to install the agent on your webserver but
perhaps a more scaleable approach would be to dedicate two servers to flume
agents. This will allow you to load balance your writes into the flume
pipeline at some point. As you scale you will not want to have every agent
writing to hdfs so at some point you may consider adding a collector tier
that will aggregate the flow and reduce the connections going into your
hdfs cluster.


On Thu, Apr 3, 2014 at 6:20 AM, Mohit Durgapal <durgapalmohit@gmail.com>wrote:

> Hi,
> We are setting up a flume cluster but we are facing some issues related to
> heap size (out of memory). Is there a standard configuration for a standard
> load?
> If there is can you suggest what would it be for the load stats given
> below?
> Also, we are not sure what topology to go ahead with in our use case.
> We basically have two web servers which can generate logs at the speed of
> 2000 entries per second. Each entry of size around 137Bytes.
> Currently we have used rsyslog( writing to a tcp port) to which a php
> script writes these logs to. And we are running a local flume agent on each
> webserver , these local agents listen to a tcp port and put data directly
> in hdfs.
>  So localhost:tcpport is the "flume source " and "hdfs" is the flume sink.
> I am confused between three approaches:
> Approach 1: Web Server, RSyslog & Flume Agent on same machine  and a Flume
> collector running on the Namenode in hadoop cluster, to collect the data
> and dump into hdfs.
> Approach 2: Web Server, RSyslog on same machine  and a Flume collector
> (listening on a remote port for events written by rsyslog on web
> server)running on the Namenode in hadoop cluster, to collect the data and
> dump into hdfs.
> Approach 3: Web Server, RSyslog & Flume Agent on same machine. And all
> agents writing directly to the hdfs.
> Also, we are using hive, so we are writing directly into partitioned
> directories. So we want to think of an approach that allows us to write on
> Hourly partitions.
> I hope that's not too vague.
> Regards
> Mohit Durgapal

View raw message