flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Natkins <na...@streamsets.com>
Subject Re: AWS S3 flume source
Date Mon, 11 Aug 2014 07:23:07 GMT
Yeah, I realize that. The reason I think it should be somewhat dependent
upon FLUME-1491 is that ZooKeeper seems to me to be a pretty heavy-weight
requirement just to use a particular source. FLUME-1491 would make Flume
generally dependent upon ZooKeeper, which is a good transition point to
start using ZK for other state that would be necessary for Flume
components. Would you agree?


On Sun, Aug 10, 2014 at 11:35 PM, Ashish <paliwalashish@gmail.com> wrote:

> Seems like a bit of confusion here. Flume-1491 only deals with
> configuration part, nothing else. Even if it get integrated, you would
> still need to write/expose API to store meta-data info in Zk (Flume-1491
> doesn't bring that in).
>
> HTH !
>
>
> On Mon, Aug 11, 2014 at 11:39 AM, Jonathan Natkins <natty@streamsets.com>
> wrote:
>
>> Given that FLUME-1491 hasn't been committed yet, and may still be a ways
>> away, does it seem reasonable to punt on having multiple sources working
>> off of a single bucket until ZK is integrated into Flume? The alternative
>> probably requires write access to the S3 bucket to record some shared
>> state, and would likely have to get rewritten once ZK integration happens
>> anyway.
>>
>>
>> On Tue, Aug 5, 2014 at 10:07 PM, Paweł Róg <prog88@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I think that it is not possible to simply use SpoolDirectorySource.
>>> Maybe it will be possible to use some elements of SpoolDirectory but
>>> without touching it's code I think SpoolDirectory is not a good base. At
>>> the very beginning SpoolDirectorySource does this:
>>>
>>> File directory = new File(spoolDirectory);
>>>
>>> ReliableSpoolingFileEventReader also instantiate File class.
>>> There is also a question. How ReliableSpoolingFileEventReader stores
>>> information about files that has been already processed in non-Deleting
>>> mode? What happens after Flume restart?
>>>
>>> I agree with Jonathan that S3 source should be able to store last
>>> processed file eg. in Zookeeper.
>>> Another thing Jonathan: I think you shouldn't care about multiple
>>> buckets processed handled by a single S3Source. As you wrote multiple
>>> sources is the solution here. I thought it was already discussed but maybe
>>> I'm wrong.
>>>
>>>
>>> >> 2. Is it fair to assume that we're dealing with character files,
>>> rather than binary objects?
>>>
>>> In my opinion S3 source can by default read file as simple text file but
>>> also take in configuration a parameter with class name of a "InputStream
>>> processor". This processor will we able to eg. unzip, deserialize avro or
>>> read JSON and convert it into log events. What do you think?
>>>
>>> --
>>> Paweł Róg
>>>
>>> 2014-08-06 5:12 GMT+02:00 Viral Bajaria <viral.bajaria@gmail.com>:
>>>
>>> Agree to the feedback provided by Ashish.
>>>>
>>>> I have started writing one which is similar to the ExecSource, but I
>>>> like the idea of doing something where spooldir takes over most of the hard
>>>> work of spitting out events to sinks. Let me think more on how to structure
>>>> that.
>>>>
>>>> Quick thinking out loud, I could create a source which extends the
>>>> spooldir and just spins off a thread to manage moving things from S3 to the
>>>> spooldir via a temporary directory.
>>>>
>>>> Regarding maintaining metadata, there are 2 ways:
>>>> 1) DB: I currently maintain it in a database because there are a lot of
>>>> other tools build around it
>>>> 2) File: Just keep the info in memory and in file to help from crash
>>>> recovery and/or high memory usage.
>>>>
>>>> Thanks,
>>>> Viral
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Aug 5, 2014 at 8:04 PM, Ashish <paliwalashish@gmail.com> wrote:
>>>>
>>>>> Sharing some random thoughts
>>>>>
>>>>> 1. Download the file using S3 SDK and let the SpoolDirectory
>>>>> implementation take care of rest. Like a Decorator in front of
>>>>> SpoolDirectory
>>>>>
>>>>> 2. Use S3 SDK to create InputStream of S3 objects directly in code and
>>>>> create events out of it.
>>>>>
>>>>> Would be great to reuse an existing implementation which is based on
>>>>> InputStream and feed it with S3 object input stream, concern of metadata
>>>>> storage still remains. Most often S3 objects are stored in compressed
form,
>>>>> so this source would need to take care of compression gz/avro/others.
>>>>>
>>>>> Best is to start with something that works and then start adding more
>>>>> features to it.
>>>>>
>>>>>
>>>>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <natty@streamsets.com
>>>>> > wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I started trying to write some code on this, and realized there are
a
>>>>>> number of issues that need to be discussed in order to really design
this
>>>>>> feature effectively. The requirements that have been discussed thus
far are:
>>>>>>
>>>>>> 1. Fetching data from S3 periodically
>>>>>> 2. Fetching data from multiple S3 buckets -- This may be something
>>>>>> that should be punted on until later. For a first implementation,
this
>>>>>> could be solved just by having multiple sources, each with a single
S3
>>>>>> bucket
>>>>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can
you
>>>>>> clarify what you mean by this?*
>>>>>> 4. Dynamically reconfigure the source -- This is blocked by
>>>>>> FLUME-1491, so I think this is out-of-scope for discussions at the
moment
>>>>>>
>>>>>> Some questions I want to try to answer:
>>>>>>
>>>>>> 1. How do we identify and track objects that need to be processed
>>>>>> versus objects that have been processed already?
>>>>>> 1a. What about if we want to have multiple sources working against
>>>>>> the same bucket to speed processing?
>>>>>> 2. Is it fair to assume that we're dealing with character files,
>>>>>> rather than binary objects?
>>>>>>
>>>>>>  For the first question, if we ignore the multiple source extension
>>>>>> of the question, I think the simplest answer is to do something on
the
>>>>>> local filesystem, like have a tracking directory that contains a
list of
>>>>>> to-be-processed objects and a list of already-processed objects.
However,
>>>>>> if the source goes down, what should the restart semantics be? It
seems
>>>>>> that the ideal situation is to store this state in a system like
ZooKeeper,
>>>>>> which would ensure that a number of sources could operate off of
the same
>>>>>> bucket, but this probably requires FLUME-1491 first.
>>>>>>
>>>>>> For the second question, my feeling was just that we should work
with
>>>>>> similar assumptions to how the SpoolingDirectorySource works, where
each
>>>>>> line is a separate event. Does that seem reasonable?
>>>>>>
>>>>>> Thanks,
>>>>>> Natty
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <prog88@gmail.com>
wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> Thanks for explanation Jonathan. I think I will also start working
>>>>>>> on it. When you have any patch (even draft) I'd be glad if you
can attach
>>>>>>> it in JIRA. I'll do the same.
>>>>>>> What do you think?
>>>>>>>
>>>>>>> --
>>>>>>> Paweł Róg
>>>>>>>
>>>>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan <
>>>>>>> hshreedharan@cloudera.com>:
>>>>>>>
>>>>>>> +1 on an S3 Source. I would gladly review.
>>>>>>>>
>>>>>>>> Jonathan Natkins wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hey Pawel,
>>>>>>>>
>>>>>>>> My intention is to start working on it, but I don't know
exactly
>>>>>>>> how
>>>>>>>> long it will take, and I'm not a committer, so time estimates
would
>>>>>>>> have to be taken with a grain of salt regardless. If this
is
>>>>>>>> something
>>>>>>>> that you need urgently, it may not be ideal to wait for me
to start
>>>>>>>> building something for yourself.
>>>>>>>>
>>>>>>>> That said, as mentioned in the other thread, dynamic configuration
>>>>>>>> can
>>>>>>>> be done by refreshing the configuration files across the
set of
>>>>>>>> Flume
>>>>>>>> agents. It's certainly not as great as having a single place
to
>>>>>>>> change
>>>>>>>> it (e.g. ZooKeeper), but it's a way to get the job done.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Natty
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
>>>>>>>> <mailto:prog88@gmail.com>> wrote:
>>>>>>>>
>>>>>>>>     Hi,
>>>>>>>>     Jonathan how should we interpret your last e-mail? You
opened an
>>>>>>>>     JIRA issue and want to start implementing this and do
you have
>>>>>>>> any
>>>>>>>>     estimate how long it will take?
>>>>>>>>
>>>>>>>>     I think the biggest challenge here is to have dynamic
>>>>>>>>     configuration of Flume. It doesn't seem to be part of
FLUME-2437
>>>>>>>>     issue. Am I right?
>>>>>>>>
>>>>>>>>     > Would you need to be able to pull files from multiple
S3
>>>>>>>>     directories with the same source?
>>>>>>>>
>>>>>>>>     I think we don't need to track multiple S3 buckets with
a single
>>>>>>>>     source. I just imagine an approach where each S3 source
can be
>>>>>>>>     added or deleted on demand and attached to any Channel.
I'm only
>>>>>>>>     afraid about this dynamic configuration. I'll open a
new thread
>>>>>>>>     about this. It seems we have two totally separate things:
>>>>>>>>     * build S3 source
>>>>>>>>     * make flume configurable dynamically
>>>>>>>>
>>>>>>>>     --
>>>>>>>>     Paweł
>>>>>>>>
>>>>>>>>
>>>>>>>>     2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
>>>>>>>>     <otis.gospodnetic@gmail.com <mailto:otis.gospodnetic@gmail.com
>>>>>>>> >>:
>>>>>>>>
>>>>>>>>
>>>>>>>>         Hi,
>>>>>>>>
>>>>>>>>         On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
>>>>>>>>         <natty@streamsets.com <mailto:natty@streamsets.com>>
wrote:
>>>>>>>>
>>>>>>>>             Hey all,
>>>>>>>>
>>>>>>>>             I created a JIRA for this:
>>>>>>>>             https://issues.apache.org/jira/browse/FLUME-2437
>>>>>>>>
>>>>>>>>
>>>>>>>>         Thanks!  Should Fix Version be set to the next Flume
release
>>>>>>>>         version?
>>>>>>>>
>>>>>>>>             I thought I'd start working on one myself, which
can
>>>>>>>>             hopefully be contributed back. I'm curious: do
you have
>>>>>>>>             particular requirements? Based on the emails
in this
>>>>>>>>             thread, it sounds like the original goal was
to have
>>>>>>>>             something that's like a SpoolDirectorySource
that just
>>>>>>>>             picks up new files from S3. Is that accurate?
>>>>>>>>
>>>>>>>>
>>>>>>>>         Yes, I think so.  We need to be able to:
>>>>>>>>         * fetch data (logs for pulling them in Logsene
>>>>>>>>         <http://sematext.com/logsene/>) from S3 periodically
(e.g.
>>>>>>>>
>>>>>>>>         every 1 min, every 5 min, etc.)
>>>>>>>>         * fetch data from multiple S3 buckets
>>>>>>>>         * associate an S3 bucket with a user/token/key
>>>>>>>>         * dynamically (i.e. without editing/writing config
files
>>>>>>>>         stored on disk) add new S3 buckets from which data
should
>>>>>>>> be fetch
>>>>>>>>         * dynamically (i.e. without editing/writing config
files
>>>>>>>>         stored on disk) stop fetching data from some S3 buckets
>>>>>>>>
>>>>>>>>
>>>>>>>>             Would you need to be able to pull files from
multiple S3
>>>>>>>>             directories with the same source?
>>>>>>>>
>>>>>>>>
>>>>>>>>         I think the above addresses this question.
>>>>>>>>
>>>>>>>>             Thanks,
>>>>>>>>             Natty
>>>>>>>>
>>>>>>>>
>>>>>>>>         Thanks!
>>>>>>>>
>>>>>>>>         Otis
>>>>>>>>         --
>>>>>>>>         Performance Monitoring * Log Analytics * Search Analytics
>>>>>>>>         Solr & Elasticsearch Support * http://sematext.com/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>             On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
>>>>>>>>             <otis.gospodnetic@gmail.com
>>>>>>>>             <mailto:otis.gospodnetic@gmail.com>>
wrote:
>>>>>>>>
>>>>>>>>                 +1 for seeing S3Source, starting with a JIRA
issue.
>>>>>>>>
>>>>>>>>                 But being able to dynamically add/remove
S3 buckets
>>>>>>>>                 from which to pull data seems important.
>>>>>>>>
>>>>>>>>                 Any suggestions for how to approach that?
>>>>>>>>
>>>>>>>>                 Otis
>>>>>>>>                 --
>>>>>>>>                 Performance Monitoring * Log Analytics *
Search
>>>>>>>> Analytics
>>>>>>>>                 Solr & Elasticsearch Support * http://sematext.com/
>>>>>>>>
>>>>>>>>
>>>>>>>>                 On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
>>>>>>>>                 <hshreedharan@cloudera.com
>>>>>>>>                 <mailto:hshreedharan@cloudera.com>>
wrote:
>>>>>>>>
>>>>>>>>                     Please go ahead and file a jira. If you
are
>>>>>>>>                     willing to submit a patch, you can post
it on
>>>>>>>> the
>>>>>>>>                     jira.
>>>>>>>>
>>>>>>>>                     Viral Bajaria wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                     I have a similar use case that cropped
up
>>>>>>>>                     yesterday. I saw the archive
>>>>>>>>                     and found that there was a recommendation
to
>>>>>>>>                     build it as Sharninder
>>>>>>>>                     suggested.
>>>>>>>>
>>>>>>>>                     For now, I went down the route of writing
a
>>>>>>>>                     python script which
>>>>>>>>                     downloads from S3 and puts the files
in a
>>>>>>>>                     directory which is
>>>>>>>>                     configured to be picked up via a spooldir.
>>>>>>>>
>>>>>>>>                     I would prefer to get a direct S3 source,
and
>>>>>>>>                     maybe we could
>>>>>>>>                     collaborate on it and open-source it.
Let me
>>>>>>>> know
>>>>>>>>                     if you prefer that
>>>>>>>>                     and we can work directly on it by creating
a
>>>>>>>> JIRA.
>>>>>>>>
>>>>>>>>                     Thanks,
>>>>>>>>                     Viral
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                     On Thu, Jul 31, 2014 at 10:26 AM, Hari
>>>>>>>> Shreedharan
>>>>>>>>                     <hshreedharan@cloudera.com
>>>>>>>>                     <mailto:hshreedharan@cloudera.com>
>>>>>>>>                     <mailto:hshreedharan@cloudera.com
>>>>>>>>
>>>>>>>>                     <mailto:hshreedharan@cloudera.com>>>
wrote:
>>>>>>>>
>>>>>>>>                         In both cases, Sharninder is right
:)
>>>>>>>>
>>>>>>>>                         Sharninder wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                         As far as I know, there is no (open
source)
>>>>>>>>                     implementation of an S3
>>>>>>>>                         source, so yes, you'll have to implement
>>>>>>>>                     your own. You'll have to
>>>>>>>>                         implement a Pollable source and the
dev
>>>>>>>>                     documentation has an outline
>>>>>>>>                         that you can use. You can also look
at the
>>>>>>>>                     existing Execsource and
>>>>>>>>                         work your way up.
>>>>>>>>
>>>>>>>>                         As far as I know, there is no way
to
>>>>>>>>                     configure flume without
>>>>>>>>                         using the
>>>>>>>>                         configuration file.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                         On Thu, Jul 31, 2014 at 7:57 PM,
Paweł
>>>>>>>>                     <prog88@gmail.com <mailto:prog88@gmail.com>
>>>>>>>>                     <mailto:prog88@gmail.com <mailto:
>>>>>>>> prog88@gmail.com>>
>>>>>>>>                     <mailto:prog88@gmail.com
>>>>>>>>                     <mailto:prog88@gmail.com>
>>>>>>>>                     <mailto:prog88@gmail.com
>>>>>>>>                     <mailto:prog88@gmail.com>>>>
wrote:
>>>>>>>>
>>>>>>>>                             Hi,
>>>>>>>>                             I'm wondering if Flume is able
to read
>>>>>>>>                     directly from S3.
>>>>>>>>
>>>>>>>>                             I'll describe my case. I have
log files
>>>>>>>>                     stored in AWS S3. I have
>>>>>>>>                             to fetch periodically new S3
objects and
>>>>>>>>                     read log lines from it.
>>>>>>>>                             Than use log lines (events) are
>>>>>>>>                     processed in standard flume's way
>>>>>>>>                             (as with other sources).
>>>>>>>>
>>>>>>>>                             *1) Is there any way to fetch
S3 objects
>>>>>>>>                     or I have to write
>>>>>>>>                         my own
>>>>>>>>                             Source?*
>>>>>>>>
>>>>>>>>
>>>>>>>>                             There is also second case. I
want to
>>>>>>>>                     have flume configuration
>>>>>>>>                             dynamic. Flume sources can change
in
>>>>>>>>                     time. New AWS key and S3
>>>>>>>>                             bucket can be added or deleted.
>>>>>>>>
>>>>>>>>                             *2) Is there any other way to
configure
>>>>>>>>                     Flume than by static
>>>>>>>>                             configuration file?*
>>>>>>>>
>>>>>>>>                             --
>>>>>>>>                             Paweł Róg
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> thanks
>>>>> ashish
>>>>>
>>>>> Blog: http://www.ashishpaliwal.com/blog
>>>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Mime
View raw message