flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chen Wang <chen.apache.s...@gmail.com>
Subject Re: seeking help on flume cluster deployment
Date Fri, 10 Jan 2014 06:09:37 GMT
Ashish,
Interesting enough, i was initially doing 1 too, and had a working version.
But finally I give it up because in my bolt i have to flush to hdfs either
when data reaching certain size or a timer times out, which is exactly what
flume can offer. Also it has some complexity of grouping entries within the
same partition while with flume it is a piece of cake.

Thank you so much for all you guys's input. It helped me a lot !
Chen



On Thu, Jan 9, 2014 at 10:00 PM, Ashish <paliwalashish@gmail.com> wrote:

> Got it!
>
> My first reaction was to use HDFS bolt to write data directly to HDFS, but
> couldn't find an implementation for the same. My knowledge is limited for
> Storm.
> If the data is already flowing through Storm, you got two options
> 1. Write a bolt to dump data to HDFS
> 2. Write a Flume bolt using RPC client as recommended in thread, and reuse
> Flume's capabilities.
>
> If you already have Flume installation running, #2 is quickest way of
> running. Otherwise also, installing and running Flume is like a walk in the
> park :)
>
> You can also watch related discussion on
> https://issues.apache.org/jira/browse/FLUME-1286. There is some good info
> in the JIRA.
>
> thanks
> ashish
>
>
>
>
> On Fri, Jan 10, 2014 at 11:08 AM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>
>> Ashish,
>> Since we already use storm for other real time processing, i thus want to
>> re utilize it. The biggest advantage for me of using storm in this case is
>> that i could use storm's spout to read from our socket server continuously,
>> and the storm framework can ensure it never stops. Meantime, i can also
>> easily filter out /translate the data in bolt before sending to flume. For
>> this piece of data stream, right now my first step is to get it into hdfs,
>> but will add real time processing soon.
>> Does that make sense to you?
>> Thanks,
>> Chen
>>
>>
>> On Thu, Jan 9, 2014 at 9:29 PM, Ashish <paliwalashish@gmail.com> wrote:
>>
>>> Why do you need Storm? Are you doing any real time processing? If not,
>>> IMHO, avoid Storm.
>>>
>>> Can use something like this
>>>
>>> Socket -> Load Balanced RPC Client -> Flume Topology with HA
>>>
>>> What Application level protocol are you using at Socket level?
>>>
>>>
>>> On Fri, Jan 10, 2014 at 10:50 AM, Chen Wang <chen.apache.solr@gmail.com>wrote:
>>>
>>>> Jeff, Joao,
>>>> Thanks for the pointer!
>>>> I think i am getting close here:
>>>> 1. set up a cluster of flume agent with redundancies, source as avro,
>>>> sink as HDFS.
>>>> 2 use storm(not quite necessary) to read from our socket server, then
>>>> in the bolt, using flume client (load balancing rpc client) to send the
>>>> event to the agent set up in step 1.
>>>>
>>>> Then I thus get all the benefit of storm and flume. Does this set up
>>>> look right to you?
>>>> thank you very much,
>>>> Chen
>>>>
>>>>
>>>> On Thu, Jan 9, 2014 at 8:58 PM, Joao Salcedo <joao.salcedo@gmail.com>wrote:
>>>>
>>>>> Hi Chen,
>>>>>
>>>>> Maybe it would be worth checking this
>>>>>
>>>>> http://flume.apache.org/FlumeDeveloperGuide.html#loadbalancing-rpc-client
>>>>>
>>>>> Regards,
>>>>>
>>>>> Joao
>>>>>
>>>>>
>>>>> On Fri, Jan 10, 2014 at 3:50 PM, Jeff Lord <jlord@cloudera.com>
wrote:
>>>>>
>>>>>> Have you taken a look at the load balancing rpc client?
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 9, 2014 at 8:43 PM, Chen Wang <chen.apache.solr@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Jeff,
>>>>>>> I have read this ppt at the beginning, but didn't find solution
to
>>>>>>> my user case. To simplify my case, I only have 1 data source(composed
of 5
>>>>>>> socket server)  and i am looking for a fault tolerant deployment
of flume,
>>>>>>> that can read from this single data source and sink to hdfs in
fault
>>>>>>> tolerant mode: when one node dies, another flume node can pick
up and
>>>>>>> continue;
>>>>>>> Thanks,
>>>>>>> Chen
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 9, 2014 at 7:49 PM, Jeff Lord <jlord@cloudera.com>wrote:
>>>>>>>
>>>>>>>> Chen,
>>>>>>>>
>>>>>>>> Have you taken a look at this presentation on Planning and
>>>>>>>> Deploying Flume from ApacheCon?
>>>>>>>>
>>>>>>>>
>>>>>>>> http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf
>>>>>>>>
>>>>>>>> It may have the answers you need.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Jeff
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jan 9, 2014 at 7:24 PM, Chen Wang <
>>>>>>>> chen.apache.solr@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks Saurabh.
>>>>>>>>> If that is the case, I am actually thinking about using
storm
>>>>>>>>> spout to talk to our socket server so that the storm
cluster can take care
>>>>>>>>> of the reading socket server part. Then in each storm
node, start a flume
>>>>>>>>> agent, listening on a RPC port and write to HDFS(with
fail over) .Then in
>>>>>>>>> the storm bolt, simply send the data to RPC so that flume
can get it.
>>>>>>>>> How do you think of this setup? It takes care of both
failover on
>>>>>>>>> the source(by storm) and on the sink(by flume) But It
looks a little
>>>>>>>>> complicated for me.
>>>>>>>>> Chen
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B <
>>>>>>>>> qna.list.141211@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Chen,
>>>>>>>>>>
>>>>>>>>>> I think Flume doesn't have a way to configure multiple
sources
>>>>>>>>>> pointing to same data source. Of course you can do
that, but you will end
>>>>>>>>>> up with duplicate data. Flume offers fail over at
the sink level.
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang <
>>>>>>>>>> chen.apache.solr@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ok. so after more researching:) It seems that
what i need is the
>>>>>>>>>>> failover for agent source, (not fail over for
sink):
>>>>>>>>>>> If one agent dies, another same kind of agent
will start running.
>>>>>>>>>>> Does flume support this scenario?
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Chen
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 9, 2014 at 3:12 PM, Chen Wang <
>>>>>>>>>>> chen.apache.solr@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> After reading more docs, it seems that if
I want to achieve my
>>>>>>>>>>>> goal, i have to do the following:
>>>>>>>>>>>> 1. Having one agent with the custom source
running on one node.
>>>>>>>>>>>> This agent reads from those 5 socket server,
and sink to some kind of
>>>>>>>>>>>> sink(maybe another socket?)
>>>>>>>>>>>> 2. On another(or more) machines, setting
up collectors that
>>>>>>>>>>>> read from the agent sink in 1, and sink to
hdfs.
>>>>>>>>>>>> 3. Having a master node managing nodes in
1,2.
>>>>>>>>>>>>
>>>>>>>>>>>> But it seems to be overskilled in my case:
in 1, i can already
>>>>>>>>>>>> sink to hdfs. Since the data available at
socket server are much faster
>>>>>>>>>>>> than the data translation part.  I want to
be able to later add more nodes
>>>>>>>>>>>> to do the translation job. so what is the
correct setup?
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Chen
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang
<
>>>>>>>>>>>> chen.apache.solr@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Guys,
>>>>>>>>>>>>> In my environment, the client is 5 socket
servers. Thus i
>>>>>>>>>>>>> wrote a custom source spawning 5 threads
reading from each of them
>>>>>>>>>>>>> infinitely,and the sink is hdfs(hive
table). The work fine by running flume-ng
>>>>>>>>>>>>> agent.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But how can i deploy this in distributed
mode(cluster)? I am
>>>>>>>>>>>>> confused about the 3 ties(agent,collector,storage)
mentioned in the doc.
>>>>>>>>>>>>> Does it apply to my case? How can I separate
my agent/collect/storage?
>>>>>>>>>>>>> Apparently i can only have one agent
running: multiple agent will result in
>>>>>>>>>>>>> getting duplicates from the socket server.
But I want that if one agent
>>>>>>>>>>>>> dies, other agent can take it up. I would
also like to be able to add
>>>>>>>>>>>>> horizontal scalability for writing to
hdfs. How can I achieve all this?
>>>>>>>>>>>>>
>>>>>>>>>>>>> thank you very much for your advice.
>>>>>>>>>>>>> Chen
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Mailing List Archives,
>>>>>>>>>> QnaList.com
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> thanks
>>> ashish
>>>
>>> Blog: http://www.ashishpaliwal.com/blog
>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>>
>>
>>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Mime
View raw message