flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Durgapal <durgapalmo...@gmail.com>
Subject Re: Flume Configuration & topology approach
Date Thu, 03 Apr 2014 18:27:35 GMT
Hi Jeff,

Yes, I am using the memory channel, and that's because I want it to be more
reliable and not miss any events/messages.
As I've read in flume documentation that the memory channel is fast but
there could be a chance of missing events if the in-memory buffer fills up.

I am sorry for not mentioning the heap settings but I was running it with
default vm settings which I increased later(to 1GB), after that I did not
get the OOME. But then again I am not sure what is the right setting or
maybe this is more like a hit n trial setting depending on our data load
and environment.

So as per your suggestion, I need to consider having two dedicated machines
for running flume agents for two web servers and one for collector?  We
have just started working on flume and I think your suggestion really makes
sense because we are pretty sure that it is going to scale.

Also, we are using rsyslog to log to a tcp port on localhost and flume
listening to that tcp port on the same machine. Is that a good and reliable
design? We tried the exec source with tail -F command on the log file but I
guess that's not a very dependable(also mentioned in the flume
documentation) way as it fetches all the rows from the file if flume
restarts. Also, I am a little skeptical of the logrotate cron that rotates
the logs as I did a few test and found a lot of problems with it.

Where as rsyslog tcp option provides an option of dumping data to local
disk if the tcp queue gets full. So even if flume goes down we don't lose
the data.

One more thing, I just installed cloudera manager a week back. But I have
done all testing using flume from command line. I want to know if I could
use cloudera manager to install and manage flume instances in the new
machines. It'd be great to have one UI to manage all the agents and
collector nodes and even change their configurations.

So we are very much beginners in this field, any suggestions or
recommendations are welcome. Thanks for your help :)


On Thu, Apr 3, 2014 at 7:06 PM, Jeff Lord <jlord@cloudera.com> wrote:

> Mohit,
> Are you using memory channel? You mention you are getting OOME but you
> don't even say what the heap you are setting on the flume jvm is?
> Don't run an agent on the namenode. Occasionally you will see folks
> installing an agent on one of the datanodes in the cluster but its not
> typically recommended. It's fine to install the agent on your webserver but
> perhaps a more scaleable approach would be to dedicate two servers to flume
> agents. This will allow you to load balance your writes into the flume
> pipeline at some point. As you scale you will not want to have every agent
> writing to hdfs so at some point you may consider adding a collector tier
> that will aggregate the flow and reduce the connections going into your
> hdfs cluster.
> -Jeff
> On Thu, Apr 3, 2014 at 6:20 AM, Mohit Durgapal <durgapalmohit@gmail.com>wrote:
>> Hi,
>> We are setting up a flume cluster but we are facing some issues related
>> to heap size (out of memory). Is there a standard configuration for a
>> standard load?
>> If there is can you suggest what would it be for the load stats given
>> below?
>> Also, we are not sure what topology to go ahead with in our use case.
>> We basically have two web servers which can generate logs at the speed of
>> 2000 entries per second. Each entry of size around 137Bytes.
>> Currently we have used rsyslog( writing to a tcp port) to which a php
>> script writes these logs to. And we are running a local flume agent on each
>> webserver , these local agents listen to a tcp port and put data directly
>> in hdfs.
>>  So localhost:tcpport is the "flume source " and "hdfs" is the flume
>> sink.
>> I am confused between three approaches:
>> Approach 1: Web Server, RSyslog & Flume Agent on same machine  and a
>> Flume collector running on the Namenode in hadoop cluster, to collect the
>> data and dump into hdfs.
>> Approach 2: Web Server, RSyslog on same machine  and a Flume collector
>> (listening on a remote port for events written by rsyslog on web
>> server)running on the Namenode in hadoop cluster, to collect the data and
>> dump into hdfs.
>> Approach 3: Web Server, RSyslog & Flume Agent on same machine. And all
>> agents writing directly to the hdfs.
>> Also, we are using hive, so we are writing directly into partitioned
>> directories. So we want to think of an approach that allows us to write on
>> Hourly partitions.
>> I hope that's not too vague.
>> Regards
>> Mohit Durgapal

View raw message