From flume-user-return-394-apmail-incubator-flume-user-archive=incubator.apache.org@incubator.apache.org Thu Oct 20 14:49:29 2011 Return-Path: X-Original-To: apmail-incubator-flume-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-flume-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 373437124 for ; Thu, 20 Oct 2011 14:49:29 +0000 (UTC) Received: (qmail 81397 invoked by uid 500); 20 Oct 2011 14:49:29 -0000 Delivered-To: apmail-incubator-flume-user-archive@incubator.apache.org Received: (qmail 81352 invoked by uid 500); 20 Oct 2011 14:49:28 -0000 Mailing-List: contact flume-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: flume-user@incubator.apache.org Delivered-To: mailing list flume-user@incubator.apache.org Received: (qmail 81344 invoked by uid 99); 20 Oct 2011 14:49:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Oct 2011 14:49:28 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mailtokamal@gmail.com designates 209.85.216.182 as permitted sender) Received: from [209.85.216.182] (HELO mail-qy0-f182.google.com) (209.85.216.182) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Oct 2011 14:49:22 +0000 Received: by qyg14 with SMTP id 14so3196648qyg.6 for ; Thu, 20 Oct 2011 07:49:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=pbDmw12ZKEL35tt5kJ/Wqn13kZA/QlMLvrd884wX7n8=; b=Fcxg95AKpT6oCrYIQP2BQQ9oVZPGpDPDOHGYIqaZtPDdz65yTnv3sMg1mrr90gDRZX 0onw7iyb/+TQ50+09e1hq8oiRImvknociRiLl/LQ79Smn4GgMvBxxYOY5E7h/sAUTRPs HdLWPj+HNRYnWGSeAysy5Ckav7zz8Fs0GUrDk= Received: by 10.182.235.72 with SMTP id uk8mr171156obc.75.1319122141116; Thu, 20 Oct 2011 07:49:01 -0700 (PDT) MIME-Version: 1.0 Received: by 10.182.187.68 with HTTP; Thu, 20 Oct 2011 07:48:40 -0700 (PDT) In-Reply-To: References: From: Kamal Bahadur Date: Thu, 20 Oct 2011 07:48:40 -0700 Message-ID: Subject: Re: flume dying on InterruptException (nanos) To: flume-user@incubator.apache.org Content-Type: multipart/alternative; boundary=14dae93b6356a2f80504afbc0ecc X-Virus-Checked: Checked by ClamAV on apache.org --14dae93b6356a2f80504afbc0ecc Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I agree with Prasad's solution. Since we are going to use different backend= s (I use Cassandra) to store data, we cannot have some fixed time there. Thanks, Kamal On Wed, Oct 19, 2011 at 6:08 PM, Prasad Mujumdar wrot= e: > > hmm ... I am wondering if the Trigger thread should just bail out witho= ut > resetting trigger if it can't get hold of the lock in 1 sec. The next app= end > or next trigger should take care of rotating the files .. > > thanks > Prasad > > > On Wed, Oct 19, 2011 at 1:42 PM, Cameron Gandevia wr= ote: > >> We recently modified the RollSink to hide our problem by giving it a few >> seconds to finish writing before rolling. We are going to test it out an= d if >> it fixes our issue we will provide a patch later today. >> On Oct 19, 2011 1:27 PM, "AD" wrote: >> >>> Yea, i am using Hbase sink, so i guess its possible something is gettin= g >>> hung up there and causing the collector to die. The number of file >>> descriptors seems more than safe under the limit. >>> >>> On Wed, Oct 19, 2011 at 3:16 PM, Cameron Gandevia = wrote: >>> >>>> We were seeing the same issue when our HDFS instance was overloaded an= d >>>> taking over a second to respond. I assume if whatever backend is down = the >>>> collector will die and need to be restarted when it becomes available = again? >>>> Doesn't seem very reliable >>>> >>>> >>>> On Wed, Oct 19, 2011 at 8:13 AM, Ralph Goers < >>>> ralph.goers@dslextreme.com> wrote: >>>> >>>>> We saw this problem when it was taking more than 1 second for a >>>>> response from writing to Cassandra (our back end). A single long res= ponse >>>>> will kill the collector. We had to revert back to the version of Flu= me that >>>>> uses syncrhonization instead of read/write locking to get around this= . >>>>> >>>>> Ralph >>>>> >>>>> On Oct 18, 2011, at 1:55 PM, AD wrote: >>>>> >>>>> > Hello, >>>>> > >>>>> > My collector keeps dying with the following error, is this a known >>>>> issue? Any idea how to prevent or find out what is causing it ? is >>>>> format("%{nanos}" an issue ? >>>>> > >>>>> > 2011-10-17 23:16:33,957 INFO >>>>> com.cloudera.flume.core.connector.DirectDriver: Connector logicalNode >>>>> flume1-18 exited with error: null >>>>> > java.lang.InterruptedException >>>>> > at >>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos= (AbstractQueuedSynchronizer.java:1246) >>>>> > at >>>>> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.tryLock(R= eentrantReadWriteLock.java:1009) >>>>> > at >>>>> com.cloudera.flume.handlers.rolling.RollSink.close(RollSink.java:296) >>>>> > at >>>>> com.cloudera.flume.core.EventSinkDecorator.close(EventSinkDecorator.j= ava:67) >>>>> > at >>>>> com.cloudera.flume.core.EventSinkDecorator.close(EventSinkDecorator.j= ava:67) >>>>> > >>>>> > >>>>> > source: collectorSource("35853") >>>>> > sink: regexAll("^([0-9.]+)\\s\\[([0-9a-zA-z\\/: >>>>> -]+)\\]\\s([A-Z]+)\\s([a-zA-Z0-9.:]+)\\s\"([^\\s]+)\"\\s([0-9]+)\\s([= 0-9]+)\\s\"([^\\s]+)\"\\s\"([a-zA-Z0-9\\/()_ >>>>> -;]+)\"\\s(hit|miss)\\s([0-9.]+)","hbase_remote_host","hbase_request_= date","hbase_request_method","hbase_request_host","hbase_request_url","hbas= e_response_status","hbase_response_bytes","hbase_referrer","hbase_user_agen= t","hbase_cache_hitmiss","hbase_origin_firstbyte") >>>>> format("%{nanos}:") split(":", 0, "hbase_") format("%{node}:") >>>>> split(":",0,"hbase_node") digest("MD5","hbase_md5") collector(10000) = { >>>>> attr2hbase("apache_logs","f1","","hbase_") } >>>>> >>>>> >>>> >>>> >>>> -- >>>> Thanks >>>> >>>> Cameron Gandevia >>>> >>> >>> > --14dae93b6356a2f80504afbc0ecc Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I agree with Prasad's solution. Since we are going to use different bac= kends (I use Cassandra) to store data, we cannot have some fixed time there= .

Thanks,
Kamal

On Wed, Oct 19,= 2011 at 6:08 PM, Prasad Mujumdar <prasadm@cloudera.com> wrote:
=C2=A0
=C2=A0 hmm ... I am wondering if = the Trigger thread should just bail out without resetting trigger if it can= 't get hold of the lock in 1 sec. The next append or next trigger shoul= d take care of rotating the files ..

thanks
Prasad


On Wed, Oct 19, 2011 at 1:42 P= M, Cameron Gandevia <cgandevia@gmail.com> wrote:

We recently modified the RollSink to hide our problem by giving it a few= seconds to finish writing before rolling. We are going to test it out and = if it fixes our issue we will provide a patch later today.

On Oct 19, 2011 1:27 PM, "AD" <straightflush@gmai= l.com> wrote:
Yea, i am using Hbase sink, so i guess its possible something is getting hu= ng up there and causing the collector to die. The number of file descriptor= s seems more than safe under the limit.

On Wed, Oct 19, 2011 at 3:16 PM, Cameron Gandevia <cgandevia@gmail.com> wrote:
We were seeing the same issue when our HDFS instance was overloaded and tak= ing over a second to respond. I assume if whatever backend is down the coll= ector will die and need to be restarted when it becomes available again? Do= esn't seem very reliable=C2=A0


On Wed, Oct 19, 2011 at 8:13 AM, Ralph Goers= <ralph.goers@dslextreme.com> wrote:
We saw this problem when it was taking more than 1 second for a response fr= om writing to Cassandra (our back end). =C2=A0A single long response will k= ill the collector. =C2=A0We had to revert back to the version of Flume that= uses syncrhonization instead of read/write locking to get around this.

Ralph

On Oct 18, 2011, at 1:55 PM, AD wrote:

> Hello,
>
> =C2=A0My collector keeps dying with the following error, is this a kno= wn issue? Any idea how to prevent or find out what is causing it ? =C2=A0is= format("%{nanos}" an issue ?
>
> 2011-10-17 23:16:33,957 INFO com.cloudera.flume.core.connector.DirectD= river: Connector logicalNode flume1-18 exited with error: null
> java.lang.InterruptedException
> =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.locks.AbstractQueuedSynch= ronizer.tryAcquireNanos(AbstractQueuedSynchronizer.java:1246)
> =C2=A0 =C2=A0 =C2=A0 at java.util.concurrent.locks.ReentrantReadWriteL= ock$WriteLock.tryLock(ReentrantReadWriteLock.java:1009)
> =C2=A0 =C2=A0 =C2=A0 at com.cloudera.flume.handlers.rolling.RollSink.c= lose(RollSink.java:296)
> =C2=A0 =C2=A0 =C2=A0 at com.cloudera.flume.core.EventSinkDecorator.clo= se(EventSinkDecorator.java:67)
> =C2=A0 =C2=A0 =C2=A0 at com.cloudera.flume.core.EventSinkDecorator.clo= se(EventSinkDecorator.java:67)
>
>
> source: =C2=A0collectorSource("35853")
> sink: =C2=A0regexAll("^([0-9.]+)\\s\\[([0-9a-zA-z\\/: -]+)\\]\\s(= [A-Z]+)\\s([a-zA-Z0-9.:]+)\\s\"([^\\s]+)\"\\s([0-9]+)\\s([0-9]+)\= \s\"([^\\s]+)\"\\s\"([a-zA-Z0-9\\/()_ -;]+)\"\\s(hit|mi= ss)\\s([0-9.]+)","hbase_remote_host","hbase_request_dat= e","hbase_request_method","hbase_request_host",&qu= ot;hbase_request_url","hbase_response_status","hbase_re= sponse_bytes","hbase_referrer","hbase_user_agent",= "hbase_cache_hitmiss","hbase_origin_firstbyte") format(= "%{nanos}:") split(":", 0, "hbase_") format(&= quot;%{node}:") split(":",0,"hbase_node") digest(&= quot;MD5","hbase_md5") collector(10000) { attr2hbase("a= pache_logs","f1","","hbase_") }




<= /div>--
Thanks

Cameron Gandevia



--14dae93b6356a2f80504afbc0ecc--