flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Shreedharan <hshreedha...@cloudera.com>
Subject Re: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"
Date Fri, 15 Aug 2014 23:57:59 GMT
Can you try the 1.5 release? There were a few fixes that went in.

Mangtani, Kushal wrote:
>
> Apache flume 1.4 Tarball
> ------------------------------------------------------------------------
> *From:* Hari Shreedharan [hshreedharan@apache.org]
> *Sent:* Friday, August 15, 2014 9:27 AM
> *To:* user@flume.apache.org
> *Subject:* Re: File Channel Exception "Failed to obtain lock for
> writing to the log.Try increasing the log write timeout value"
>
> What version of Flume are you using?
>
>
> On Tue, Aug 12, 2014 at 1:51 PM, Mangtani, Kushal
> <Kushal.Mangtani@viasat.com <mailto:Kushal.Mangtani@viasat.com>> wrote:
>
> Bumping this up; to make sure someone answers this.
>
> P.S: let me know if i need to post these questions on a seperate
> thread.
>
> Thanks,
> Kushal Mangtani
>
> ------------------------------------------------------------------------
> *From:* Mangtani, Kushal
> *Sent:* Friday, August 08, 2014 12:39 PM
> *To:* user@flume.apache.org <mailto:user@flume.apache.org>
> *Subject:* RE: File Channel Exception "Failed to obtain lock for
> writing to the log.Try increasing the log write timeout value"
>
> Hello FlumeTeam,
>
> I have recently seen a bug/weird behaviour in File Channel. I am
> using FileChannel in my prod env; so save me from hickups in my
> prod. Recently, I got my file Channel Full.
> So; the only ways of fixing this was:
>
> 1. restart the flume process.
> 2. twaek the transactionCapacity of fileChannel.
>
> i went with 1) .However, after doing so; my flume ps was stuck and
> the logs were:
>
> 08 Aug 2014 19:03:54,014 INFO [lifecycleSupervisor-1-4]
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:597)
> - File position exceeds the threshold: 1623195647, position:
> 1623195649
>
> 08 Aug 2014 19:03:54,015 INFO [lifecycleSupervisor-1-4]
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:608)
> - Encountered EOF at 1623195649 in
> /usr/lib/flume-ng/datastore/channel1/logs/log-5802
>
>
> Looks like for some reason FilePointer was at a position > than
> the FileSize. Ultimately; I had to delete the
> logs,checkpoint,backup-checkpoint for my flume process to process
> events.
>
> Sp; the whole purpose of FileChannel i.e better durability vs
> average performance was defeated here.
>
>
> Questions:
>
> 1. Is there something I can have done to preserve this data Loss ?
> 2. Also; I believ Flume-ng is push -pull mechanism; where source
> pushes events to channels and sinks pulls events from channels
> which is contradictory to flume-og (push only mechanism).
> Correct me if im wrong? Was there a reason for this push-pull
> architecture in flume-land ?
>
> Thanks,
> Kushal Mangtani
>
> ------------------------------------------------------------------------
> *From:* Hari Shreedharan [hshreedharan@cloudera.com
> <mailto:hshreedharan@cloudera.com>]
> *Sent:* Friday, February 28, 2014 11:38 AM
> *To:* user@flume.apache.org <mailto:user@flume.apache.org>
> *Subject:* Re: File Channel Exception "Failed to obtain lock for
> writing to the log.Try increasing the log write timeout value"
>
> It is currently in trunk, so it will be in flume 1.5
>
>
> Thanks,
> Hari
>
> On Friday, February 28, 2014 at 11:30 AM, Mangtani, Kushal wrote:
>
>>
>> Hari,
>>
>> Thanks for the feedback.This was really helpful. I am going to
>> use provisioned IO for a while to make sure the exception does
>> not comes back.
>>
>> Also, from the comments section of the Jira ticket given below, I
>> noticed that you were able to identify the reason of the
>> exception perhaps old logs are never deleted. Are you guys going
>> to put a patch to in flume 1.5 so that this exception is resolved?
>>
>> -Kushal mangtani
>>
>> *From:*Hari Shreedharan [mailto:hshreedharan@cloudera.com]
>> *Sent:* Thursday, February 27, 2014 11:19 AM
>> *To:* user@flume.apache.org <mailto:user@flume.apache.org>
>> *Subject:* Re: File Channel Exception "Failed to obtain lock for
>> writing to the log.Try increasing the log write timeout value"
>>
>> See https://issues.apache.org/jira/browse/FLUME-2307
>> <https://urldefense.proofpoint.com/v1/url?u=https://issues.apache.org/jira/browse/FLUME-2307&k=OWT%2FB14AE7ysJN06F7d2nQ%3D%3D%0A&r=Ige9%2FQENXuGqSGiXpuvHakVLuIySu7e10oNaj%2FGB%2B0I%3D%0A&m=PM9%2FMPLJ2TJ%2Fh%2BBMW%2BqpQ1UrxcZbZNPwx5%2FdhkJpEaw%3D%0A&s=91453e467ee8ed73fb29bace503614ae8091d624bdba0f77dedaf43b18e46c41>
>>
>>
>> This jira removed the write-timeout, but that only makes sure
>> that there is no transaction in limbo. The real reason like I
>> said is slow IO. Try using provisioned IO for better throughput.
>>
>> Thanks,
>>
>> Hari
>>
>> On Thursday, February 27, 2014 at 10:48 AM, Mangtani, Kushal wrote:
>>
>> Hari,
>>
>> Thanks for the prompt reply. The current file channel’s
>> write-timeout = 30 sec .EBS drive current capacity = 200 GB .
>> The rate of writes is 60 events/min; where each event is
>> approx. 40 KB.
>>
>> I am thinking of increase file channel write-timeout to 60
>> sec. What do you suggest?
>>
>> Also,one strange thing I noticed all the flume-collectors
>> also get the same exception.However, all have a separate ebs
>> drive. Any inputs?
>>
>> Thanks,
>>
>> Kushal Mangtani
>>
>> *From:*Hari Shreedharan [mailto:hshreedharan@cloudera.com]
>> *Sent:* Thursday, February 27, 2014 10:35 AM
>> *To:* user@flume.apache.org <mailto:user@flume.apache.org>
>> *Subject:* Re: File Channel Exception "Failed to obtain lock
>> for writing to the log.Try increasing the log write timeout
>> value"
>>
>> For now, increase the file channel’s write-timeout parameter
>> to around 30 or so (basically file channel is timing out
>> while writing to disk). But the basic problem you are seeing
>> is that your EBS instance is very slow and IO is taking too
>> long. You either need to increase your EBS IO capacity, or
>> reduce the rate or writes.
>>
>> Thanks,
>>
>> Hari
>>
>> On Thursday, February 27, 2014 at 10:28 AM, Mangtani, Kushal
>> wrote:
>>
>> *From:*Mangtani, Kushal
>> *Sent:* Wednesday, February 26, 2014 4:51 PM
>> *To:* 'user@flume.apache.org
>> <mailto:user@flume.apache.org>';
>> 'user-subscribe@flume.apache.org
>> <mailto:user-subscribe@flume.apache.org>'
>> *Cc:* Rangnekar, Rohit; 'dev@flume.apache.org
>> <mailto:dev@flume.apache.org>'
>> *Subject:* File Channel Exception "Failed to obtain lock
>> for writing to the log.Try increasing the log write
>> timeout value"
>>
>> Hi,
>>
>> I'm using Flume-Ng 1.4 cdh4.4 Tarball for collecting
>> aggregated logs.
>>
>> I am running a 2 tier(agent,collector) Flume
>> Configuration with custom plugins. There are
>> approximately 20 agents (receiving data) and 6 collector
>> flume (writing to HDFS) machines all running
>> independenly. However, I have been facing some File
>> Channel Exceptions on the collector side. The agent
>> appears to be working fine.
>>
>> Error stacktrace:
>>
>> org.apache.flume.ChannelException: Failed to obtain lock
>> for writing to the log. Try increasing the log write
>> timeout value. [channel=c2]
>>
>> at
>> org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doRollback(FileChannel.java:621)
>>
>> at
>> org.apache.flume.channel.BasicTransactionSemantics.rollback(BasicTransactionSemantics.java:168)
>>
>> at
>> org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:421)
>>
>> at
>> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
>>
>> at
>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>>
>> …..
>>
>> And I keep on getting the same error
>>
>> P.S :This same exception is repated in most of the flume
>> collector machines.But, not at the same duration. There
>> is usually a difference of a couple of hours or more.
>>
>> 1. HDFS sinks are written in the Amazon EC2 cloud instance.
>>
>> 2. datadir and checkpoint dir of file channel in all
>> flume collector instances are mounted to a separate
>> hadoop ebs drive .This makes sure that two separate
>> collectors do not overlap their log and checkpoint dir.
>> There is a symbolic link i.e /usr/lib/flume-ng/datasource
>> à/hadoop/ebs/mnt-1
>>
>> 3.The Flume works fine for a couple of days and all the
>> agent,collector are initialized properly without exceptions.
>>
>> Questions:
>>
>> Exception “Failed to obtain lock for writing to the log.
>> Try increasing the log write timeout value .
>> [channel=c2]” . According to the documentation, such an
>> exception occurs only if two processes are acceesing the
>> same file/directory. However, each channel is configured
>> separately so No two channels should access the same dir.
>> Hence, this exception does not indicates anything. Please
>> correct me, if im wrong.
>>
>> Also, HDFS.CallTimeout – indicates calling HDFS for
>> open,write operations. If no response within a duration,
>> it timeouts. And , if its timeouts; it closes the File.
>> Please correct me, if im wrong. Also, if there is a way
>> to specify the number of retries before it closes the file?
>>
>> Your inputs/suggestions will be thoroughly appreciated.
>>
>> Regards
>>
>> Kushal Mangtani
>>
>> Software Engineer
>>
>
>

Mime
View raw message