flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish <paliwalash...@gmail.com>
Subject Re: AWS S3 flume source
Date Mon, 11 Aug 2014 06:35:25 GMT
Seems like a bit of confusion here. Flume-1491 only deals with
configuration part, nothing else. Even if it get integrated, you would
still need to write/expose API to store meta-data info in Zk (Flume-1491
doesn't bring that in).

HTH !


On Mon, Aug 11, 2014 at 11:39 AM, Jonathan Natkins <natty@streamsets.com>
wrote:

> Given that FLUME-1491 hasn't been committed yet, and may still be a ways
> away, does it seem reasonable to punt on having multiple sources working
> off of a single bucket until ZK is integrated into Flume? The alternative
> probably requires write access to the S3 bucket to record some shared
> state, and would likely have to get rewritten once ZK integration happens
> anyway.
>
>
> On Tue, Aug 5, 2014 at 10:07 PM, Paweł Róg <prog88@gmail.com> wrote:
>
>> Hi,
>>
>> I think that it is not possible to simply use SpoolDirectorySource. Maybe
>> it will be possible to use some elements of SpoolDirectory but without
>> touching it's code I think SpoolDirectory is not a good base. At the very
>> beginning SpoolDirectorySource does this:
>>
>> File directory = new File(spoolDirectory);
>>
>> ReliableSpoolingFileEventReader also instantiate File class.
>> There is also a question. How ReliableSpoolingFileEventReader stores
>> information about files that has been already processed in non-Deleting
>> mode? What happens after Flume restart?
>>
>> I agree with Jonathan that S3 source should be able to store last
>> processed file eg. in Zookeeper.
>> Another thing Jonathan: I think you shouldn't care about multiple buckets
>> processed handled by a single S3Source. As you wrote multiple sources is
>> the solution here. I thought it was already discussed but maybe I'm wrong.
>>
>>
>> >> 2. Is it fair to assume that we're dealing with character files,
>> rather than binary objects?
>>
>> In my opinion S3 source can by default read file as simple text file but
>> also take in configuration a parameter with class name of a "InputStream
>> processor". This processor will we able to eg. unzip, deserialize avro or
>> read JSON and convert it into log events. What do you think?
>>
>> --
>> Paweł Róg
>>
>> 2014-08-06 5:12 GMT+02:00 Viral Bajaria <viral.bajaria@gmail.com>:
>>
>> Agree to the feedback provided by Ashish.
>>>
>>> I have started writing one which is similar to the ExecSource, but I
>>> like the idea of doing something where spooldir takes over most of the hard
>>> work of spitting out events to sinks. Let me think more on how to structure
>>> that.
>>>
>>> Quick thinking out loud, I could create a source which extends the
>>> spooldir and just spins off a thread to manage moving things from S3 to the
>>> spooldir via a temporary directory.
>>>
>>> Regarding maintaining metadata, there are 2 ways:
>>> 1) DB: I currently maintain it in a database because there are a lot of
>>> other tools build around it
>>> 2) File: Just keep the info in memory and in file to help from crash
>>> recovery and/or high memory usage.
>>>
>>> Thanks,
>>> Viral
>>>
>>>
>>>
>>>
>>> On Tue, Aug 5, 2014 at 8:04 PM, Ashish <paliwalashish@gmail.com> wrote:
>>>
>>>> Sharing some random thoughts
>>>>
>>>> 1. Download the file using S3 SDK and let the SpoolDirectory
>>>> implementation take care of rest. Like a Decorator in front of
>>>> SpoolDirectory
>>>>
>>>> 2. Use S3 SDK to create InputStream of S3 objects directly in code and
>>>> create events out of it.
>>>>
>>>> Would be great to reuse an existing implementation which is based on
>>>> InputStream and feed it with S3 object input stream, concern of metadata
>>>> storage still remains. Most often S3 objects are stored in compressed form,
>>>> so this source would need to take care of compression gz/avro/others.
>>>>
>>>> Best is to start with something that works and then start adding more
>>>> features to it.
>>>>
>>>>
>>>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <natty@streamsets.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I started trying to write some code on this, and realized there are a
>>>>> number of issues that need to be discussed in order to really design
this
>>>>> feature effectively. The requirements that have been discussed thus far
are:
>>>>>
>>>>> 1. Fetching data from S3 periodically
>>>>> 2. Fetching data from multiple S3 buckets -- This may be something
>>>>> that should be punted on until later. For a first implementation, this
>>>>> could be solved just by having multiple sources, each with a single S3
>>>>> bucket
>>>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you
>>>>> clarify what you mean by this?*
>>>>> 4. Dynamically reconfigure the source -- This is blocked by
>>>>> FLUME-1491, so I think this is out-of-scope for discussions at the moment
>>>>>
>>>>> Some questions I want to try to answer:
>>>>>
>>>>> 1. How do we identify and track objects that need to be processed
>>>>> versus objects that have been processed already?
>>>>> 1a. What about if we want to have multiple sources working against the
>>>>> same bucket to speed processing?
>>>>> 2. Is it fair to assume that we're dealing with character files,
>>>>> rather than binary objects?
>>>>>
>>>>>  For the first question, if we ignore the multiple source extension
>>>>> of the question, I think the simplest answer is to do something on the
>>>>> local filesystem, like have a tracking directory that contains a list
of
>>>>> to-be-processed objects and a list of already-processed objects. However,
>>>>> if the source goes down, what should the restart semantics be? It seems
>>>>> that the ideal situation is to store this state in a system like ZooKeeper,
>>>>> which would ensure that a number of sources could operate off of the
same
>>>>> bucket, but this probably requires FLUME-1491 first.
>>>>>
>>>>> For the second question, my feeling was just that we should work with
>>>>> similar assumptions to how the SpoolingDirectorySource works, where each
>>>>> line is a separate event. Does that seem reasonable?
>>>>>
>>>>> Thanks,
>>>>> Natty
>>>>>
>>>>>
>>>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <prog88@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> Thanks for explanation Jonathan. I think I will also start working
on
>>>>>> it. When you have any patch (even draft) I'd be glad if you can attach
it
>>>>>> in JIRA. I'll do the same.
>>>>>> What do you think?
>>>>>>
>>>>>> --
>>>>>> Paweł Róg
>>>>>>
>>>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <
>>>>>> hshreedharan@cloudera.com>:
>>>>>>
>>>>>> +1 on an S3 Source. I would gladly review.
>>>>>>>
>>>>>>> Jonathan Natkins wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hey Pawel,
>>>>>>>
>>>>>>> My intention is to start working on it, but I don't know exactly
how
>>>>>>> long it will take, and I'm not a committer, so time estimates
would
>>>>>>> have to be taken with a grain of salt regardless. If this is
>>>>>>> something
>>>>>>> that you need urgently, it may not be ideal to wait for me to
start
>>>>>>> building something for yourself.
>>>>>>>
>>>>>>> That said, as mentioned in the other thread, dynamic configuration
>>>>>>> can
>>>>>>> be done by refreshing the configuration files across the set
of
>>>>>>> Flume
>>>>>>> agents. It's certainly not as great as having a single place
to
>>>>>>> change
>>>>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Natty
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>>>>>> <mailto:prog88@gmail.com>> wrote:
>>>>>>>
>>>>>>>     Hi,
>>>>>>>     Jonathan how should we interpret your last e-mail? You opened
an
>>>>>>>     JIRA issue and want to start implementing this and do you
have
>>>>>>> any
>>>>>>>     estimate how long it will take?
>>>>>>>
>>>>>>>     I think the biggest challenge here is to have dynamic
>>>>>>>     configuration of Flume. It doesn't seem to be part of FLUME-2437
>>>>>>>     issue. Am I right?
>>>>>>>
>>>>>>>     > Would you need to be able to pull files from multiple
S3
>>>>>>>     directories with the same source?
>>>>>>>
>>>>>>>     I think we don't need to track multiple S3 buckets with a
single
>>>>>>>     source. I just imagine an approach where each S3 source can
be
>>>>>>>     added or deleted on demand and attached to any Channel. I'm
only
>>>>>>>     afraid about this dynamic configuration. I'll open a new
thread
>>>>>>>     about this. It seems we have two totally separate things:
>>>>>>>     * build S3 source
>>>>>>>     * make flume configurable dynamically
>>>>>>>
>>>>>>>     --
>>>>>>>     Paweł
>>>>>>>
>>>>>>>
>>>>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>>>>     <otis.gospodnetic@gmail.com <mailto:otis.gospodnetic@gmail.com
>>>>>>> >>:
>>>>>>>
>>>>>>>
>>>>>>>         Hi,
>>>>>>>
>>>>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>>>>         <natty@streamsets.com <mailto:natty@streamsets.com>>
wrote:
>>>>>>>
>>>>>>>             Hey all,
>>>>>>>
>>>>>>>             I created a JIRA for this:
>>>>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>>>>
>>>>>>>
>>>>>>>         Thanks!  Should Fix Version be set to the next Flume
release
>>>>>>>         version?
>>>>>>>
>>>>>>>             I thought I'd start working on one myself, which
can
>>>>>>>             hopefully be contributed back. I'm curious: do you
have
>>>>>>>             particular requirements? Based on the emails in this
>>>>>>>             thread, it sounds like the original goal was to have
>>>>>>>             something that's like a SpoolDirectorySource that
just
>>>>>>>             picks up new files from S3. Is that accurate?
>>>>>>>
>>>>>>>
>>>>>>>         Yes, I think so.  We need to be able to:
>>>>>>>         * fetch data (logs for pulling them in Logsene
>>>>>>>         <http://sematext.com/logsene/>) from S3 periodically
(e.g.
>>>>>>>
>>>>>>>         every 1 min, every 5 min, etc.)
>>>>>>>         * fetch data from multiple S3 buckets
>>>>>>>         * associate an S3 bucket with a user/token/key
>>>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>>>         stored on disk) add new S3 buckets from which data should
be
>>>>>>> fetch
>>>>>>>         * dynamically (i.e. without editing/writing config files
>>>>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>>>>
>>>>>>>
>>>>>>>             Would you need to be able to pull files from multiple
S3
>>>>>>>             directories with the same source?
>>>>>>>
>>>>>>>
>>>>>>>         I think the above addresses this question.
>>>>>>>
>>>>>>>             Thanks,
>>>>>>>             Natty
>>>>>>>
>>>>>>>
>>>>>>>         Thanks!
>>>>>>>
>>>>>>>         Otis
>>>>>>>         --
>>>>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>>>>             <otis.gospodnetic@gmail.com
>>>>>>>             <mailto:otis.gospodnetic@gmail.com>> wrote:
>>>>>>>
>>>>>>>                 +1 for seeing S3Source, starting with a JIRA
issue.
>>>>>>>
>>>>>>>                 But being able to dynamically add/remove S3 buckets
>>>>>>>                 from which to pull data seems important.
>>>>>>>
>>>>>>>                 Any suggestions for how to approach that?
>>>>>>>
>>>>>>>                 Otis
>>>>>>>                 --
>>>>>>>                 Performance Monitoring * Log Analytics * Search
>>>>>>> Analytics
>>>>>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>>>>>
>>>>>>>
>>>>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>>>>                 <hshreedharan@cloudera.com
>>>>>>>                 <mailto:hshreedharan@cloudera.com>>
wrote:
>>>>>>>
>>>>>>>                     Please go ahead and file a jira. If you are
>>>>>>>                     willing to submit a patch, you can post it
on the
>>>>>>>                     jira.
>>>>>>>
>>>>>>>                     Viral Bajaria wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                     I have a similar use case that cropped up
>>>>>>>                     yesterday. I saw the archive
>>>>>>>                     and found that there was a recommendation
to
>>>>>>>                     build it as Sharninder
>>>>>>>                     suggested.
>>>>>>>
>>>>>>>                     For now, I went down the route of writing
a
>>>>>>>                     python script which
>>>>>>>                     downloads from S3 and puts the files in a
>>>>>>>                     directory which is
>>>>>>>                     configured to be picked up via a spooldir.
>>>>>>>
>>>>>>>                     I would prefer to get a direct S3 source,
and
>>>>>>>                     maybe we could
>>>>>>>                     collaborate on it and open-source it. Let
me know
>>>>>>>                     if you prefer that
>>>>>>>                     and we can work directly on it by creating
a
>>>>>>> JIRA.
>>>>>>>
>>>>>>>                     Thanks,
>>>>>>>                     Viral
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari
>>>>>>> Shreedharan
>>>>>>>                     <hshreedharan@cloudera.com
>>>>>>>                     <mailto:hshreedharan@cloudera.com>
>>>>>>>                     <mailto:hshreedharan@cloudera.com
>>>>>>>
>>>>>>>                     <mailto:hshreedharan@cloudera.com>>>
wrote:
>>>>>>>
>>>>>>>                         In both cases, Sharninder is right :)
>>>>>>>
>>>>>>>                         Sharninder wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                         As far as I know, there is no (open source)
>>>>>>>                     implementation of an S3
>>>>>>>                         source, so yes, you'll have to implement
>>>>>>>                     your own. You'll have to
>>>>>>>                         implement a Pollable source and the dev
>>>>>>>                     documentation has an outline
>>>>>>>                         that you can use. You can also look at
the
>>>>>>>                     existing Execsource and
>>>>>>>                         work your way up.
>>>>>>>
>>>>>>>                         As far as I know, there is no way to
>>>>>>>                     configure flume without
>>>>>>>                         using the
>>>>>>>                         configuration file.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                         On Thu, Jul 31, 2014 at 7:57 PM, Paweł
>>>>>>>                     <prog88@gmail.com <mailto:prog88@gmail.com>
>>>>>>>                     <mailto:prog88@gmail.com <mailto:
>>>>>>> prog88@gmail.com>>
>>>>>>>                     <mailto:prog88@gmail.com
>>>>>>>                     <mailto:prog88@gmail.com>
>>>>>>>                     <mailto:prog88@gmail.com
>>>>>>>                     <mailto:prog88@gmail.com>>>>
wrote:
>>>>>>>
>>>>>>>                             Hi,
>>>>>>>                             I'm wondering if Flume is able to
read
>>>>>>>                     directly from S3.
>>>>>>>
>>>>>>>                             I'll describe my case. I have log
files
>>>>>>>                     stored in AWS S3. I have
>>>>>>>                             to fetch periodically new S3 objects
and
>>>>>>>                     read log lines from it.
>>>>>>>                             Than use log lines (events) are
>>>>>>>                     processed in standard flume's way
>>>>>>>                             (as with other sources).
>>>>>>>
>>>>>>>                             *1) Is there any way to fetch S3
objects
>>>>>>>                     or I have to write
>>>>>>>                         my own
>>>>>>>                             Source?*
>>>>>>>
>>>>>>>
>>>>>>>                             There is also second case. I want
to
>>>>>>>                     have flume configuration
>>>>>>>                             dynamic. Flume sources can change
in
>>>>>>>                     time. New AWS key and S3
>>>>>>>                             bucket can be added or deleted.
>>>>>>>
>>>>>>>                             *2) Is there any other way to configure
>>>>>>>                     Flume than by static
>>>>>>>                             configuration file?*
>>>>>>>
>>>>>>>                             --
>>>>>>>                             Paweł Róg
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> thanks
>>>> ashish
>>>>
>>>> Blog: http://www.ashishpaliwal.com/blog
>>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>>>
>>>
>>>
>>
>


-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Mime
View raw message