From flume-user-return-362-apmail-incubator-flume-user-archive=incubator.apache.org@incubator.apache.org Sun Oct 16 15:36:45 2011 Return-Path: X-Original-To: apmail-incubator-flume-user-archive@minotaur.apache.org Delivered-To: apmail-incubator-flume-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E2C697301 for ; Sun, 16 Oct 2011 15:36:44 +0000 (UTC) Received: (qmail 28592 invoked by uid 500); 16 Oct 2011 15:36:44 -0000 Delivered-To: apmail-incubator-flume-user-archive@incubator.apache.org Received: (qmail 28550 invoked by uid 500); 16 Oct 2011 15:36:44 -0000 Mailing-List: contact flume-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: flume-user@incubator.apache.org Delivered-To: mailing list flume-user@incubator.apache.org Received: (qmail 28542 invoked by uid 99); 16 Oct 2011 15:36:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Oct 2011 15:36:43 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE X-Spam-Check-By: apache.org Received-SPF: unknown (athena.apache.org: error in processing during lookup of Guy.Doulberg@conduit.com) Received: from [64.78.22.17] (HELO EXHUB017-2.exch017.msoutlookonline.net) (64.78.22.17) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Oct 2011 15:36:36 +0000 Received: from [192.168.23.84] (217.65.39.5) by smtpx17.msoutlookonline.net (64.78.22.37) with Microsoft SMTP Server (TLS) id 8.2.234.1; Sun, 16 Oct 2011 08:36:15 -0700 Message-ID: <4E9AF9EC.3080902@conduit.com> Date: Sun, 16 Oct 2011 17:36:12 +0200 From: Guy Doulberg User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:7.0) Gecko/20110917 Thunderbird/7.0 MIME-Version: 1.0 To: flume-user@incubator.apache.org Subject: Loggin Large Events to S3 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Hi fellow flummers, I am struggling with flume for a couple of weeks, I am trying to log events to Amazon S3 so later I could Use Amazon EMR to analyze the events. The architecture I am trying to build is: The client posts data bziped -> a end point decompresses the data and attach extra data (like http headers)-> writes the data to a local file system file -> flume agent tails that file -> send the events to a flume collector -> the flume collector send the file to S3 bzipped After some effort I made this architecture working for small events, the problem is the events I should store are large (72kb expanded) and I have no control over the client (the client writes large zipped XML files and I cann't change this behavior), so this architecture should be able to deal with this kind of events. So I was thinking of two approaches, and I wanted share them with you, and to hear what you can say 1. Flume supports 32kb event size, but can support larger events by changing the "flume.event.max.size.bytes" property, I tried to do that, but: a. I am afraid of the performance issue b. It didn't work well, it seems like the events, it writes are trimmed, and also it writes them infinitely. 2. Fluming the event bziped (not decompressing it on the endpoint) to S3, and decompressing it with the EMR later. In that case: a. What is the format I should store the events? b. How would I enrich the data with the request headers? Thanks for time. Guy Doulberg