flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Durgapal <durgapalmo...@gmail.com>
Subject Flume Configuration & topology approach
Date Thu, 03 Apr 2014 13:20:15 GMT
Hi,

We are setting up a flume cluster but we are facing some issues related to
heap size (out of memory). Is there a standard configuration for a standard
load?

If there is can you suggest what would it be for the load stats given below?

Also, we are not sure what topology to go ahead with in our use case.

We basically have two web servers which can generate logs at the speed of
2000 entries per second. Each entry of size around 137Bytes.

Currently we have used rsyslog( writing to a tcp port) to which a php
script writes these logs to. And we are running a local flume agent on each
webserver , these local agents listen to a tcp port and put data directly
in hdfs.

 So localhost:tcpport is the "flume source " and "hdfs" is the flume sink.

I am confused between three approaches:

Approach 1: Web Server, RSyslog & Flume Agent on same machine  and a Flume
collector running on the Namenode in hadoop cluster, to collect the data
and dump into hdfs.

Approach 2: Web Server, RSyslog on same machine  and a Flume collector
(listening on a remote port for events written by rsyslog on web
server)running on the Namenode in hadoop cluster, to collect the data and
dump into hdfs.


Approach 3: Web Server, RSyslog & Flume Agent on same machine. And all
agents writing directly to the hdfs.


Also, we are using hive, so we are writing directly into partitioned
directories. So we want to think of an approach that allows us to write on
Hourly partitions.

I hope that's not too vague.



Regards
Mohit Durgapal

Mime
View raw message