flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Percy <mpe...@apache.org>
Subject Re: Writing to HDFS from multiple HDFS agents (separate machines)
Date Fri, 15 Mar 2013 01:46:46 GMT
Hi Gary,
All the suggestions in this thread are good. Something else to consider is
that adding multiple HDFS sinks pulling from the same channel is a
recommended practice to maximize performance (competing consumers pattern).
In that case, not only would it be a good idea to put the data into
directories that are specific to the hostname of the Flume agent writing to
HDFS, you will also need to do something like number the HDFS sink path (or
filePrefix) to indicate which HDFS sink wrote out the event, in order to
prevent name collisions.


# add hostname interceptor to your source as described above

# hdfs sinks...
agent.sinks.hdfs-1.path = /some/path/%{host}/1/web-events
# … snip ...
agent.sinks.hdfs-2.path = /some/path/%{host}/2/web-events
# … etc ...

Hope that helps.


On Thu, Mar 14, 2013 at 3:34 PM, Gary Malouf <malouf.gary@gmail.com> wrote:

> To be clear, I am referring to the segregating of data from different
> flume sinks as opposed to the original source of the event.  Having said
> that, it sounds like your approach is the easiest.
> -Gary
> On Thu, Mar 14, 2013 at 5:54 PM, Gary Malouf <malouf.gary@gmail.com>wrote:
>> Hi guys,
>> I'm new to flume (hdfs for that metter), using the version packaged with
>> CDH4 (1.3.0) and was wondering how others are maintaining different file
>> names being written to per HDFS sink.
>> My initial thought is to create a separate sub-directory in hdfs for each
>> sink - though I feel like the better way is to somehow prefix each file
>> with a unique sink id.  Are there any patterns that others are following
>> for this?
>> -Gary

View raw message