Seems like a bit of confusion here. Flume-1491 only deals with configuration part, nothing else. Even if it get integrated, you would still need to write/expose API to store meta-data info in Zk (Flume-1491 doesn't bring that in). 

HTH !


On Mon, Aug 11, 2014 at 11:39 AM, Jonathan Natkins <natty@streamsets.com> wrote:
Given that FLUME-1491 hasn't been committed yet, and may still be a ways away, does it seem reasonable to punt on having multiple sources working off of a single bucket until ZK is integrated into Flume? The alternative probably requires write access to the S3 bucket to record some shared state, and would likely have to get rewritten once ZK integration happens anyway.


On Tue, Aug 5, 2014 at 10:07 PM, Paweł Róg <prog88@gmail.com> wrote:
Hi,

I think that it is not possible to simply use SpoolDirectorySource. Maybe it will be possible to use some elements of SpoolDirectory but without touching it's code I think SpoolDirectory is not a good base. At the very beginning SpoolDirectorySource does this:

File directory = new File(spoolDirectory);

ReliableSpoolingFileEventReader also instantiate File class. 
There is also a question. How ReliableSpoolingFileEventReader stores information about files that has been already processed in non-Deleting mode? What happens after Flume restart?

I agree with Jonathan that S3 source should be able to store last processed file eg. in Zookeeper.
Another thing Jonathan: I think you shouldn't care about multiple buckets processed handled by a single S3Source. As you wrote multiple sources is the solution here. I thought it was already discussed but maybe I'm wrong.


>> 2. Is it fair to assume that we're dealing with character files, rather than binary objects?

In my opinion S3 source can by default read file as simple text file but also take in configuration a parameter with class name of a "InputStream processor". This processor will we able to eg. unzip, deserialize avro or read JSON and convert it into log events. What do you think?

--
Paweł Róg

2014-08-06 5:12 GMT+02:00 Viral Bajaria <viral.bajaria@gmail.com>:

Agree to the feedback provided by Ashish.

I have started writing one which is similar to the ExecSource, but I like the idea of doing something where spooldir takes over most of the hard work of spitting out events to sinks. Let me think more on how to structure that. 

Quick thinking out loud, I could create a source which extends the spooldir and just spins off a thread to manage moving things from S3 to the spooldir via a temporary directory.

Regarding maintaining metadata, there are 2 ways:
1) DB: I currently maintain it in a database because there are a lot of other tools build around it
2) File: Just keep the info in memory and in file to help from crash recovery and/or high memory usage.

Thanks,
Viral




On Tue, Aug 5, 2014 at 8:04 PM, Ashish <paliwalashish@gmail.com> wrote:
Sharing some random thoughts

1. Download the file using S3 SDK and let the SpoolDirectory implementation take care of rest. Like a Decorator in front of SpoolDirectory

2. Use S3 SDK to create InputStream of S3 objects directly in code and create events out of it. 

Would be great to reuse an existing implementation which is based on InputStream and feed it with S3 object input stream, concern of metadata storage still remains. Most often S3 objects are stored in compressed form, so this source would need to take care of compression gz/avro/others.

Best is to start with something that works and then start adding more features to it.


On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <natty@streamsets.com> wrote:
Hi all,

I started trying to write some code on this, and realized there are a number of issues that need to be discussed in order to really design this feature effectively. The requirements that have been discussed thus far are:

1. Fetching data from S3 periodically
2. Fetching data from multiple S3 buckets -- This may be something that should be punted on until later. For a first implementation, this could be solved just by having multiple sources, each with a single S3 bucket
3. Associating an S3 bucket with a user/token/key -- Otis - can you clarify what you mean by this?
4. Dynamically reconfigure the source -- This is blocked by FLUME-1491, so I think this is out-of-scope for discussions at the moment

Some questions I want to try to answer:

1. How do we identify and track objects that need to be processed versus objects that have been processed already?
1a. What about if we want to have multiple sources working against the same bucket to speed processing?
2. Is it fair to assume that we're dealing with character files, rather than binary objects?

For the first question, if we ignore the multiple source extension of the question, I think the simplest answer is to do something on the local filesystem, like have a tracking directory that contains a list of to-be-processed objects and a list of already-processed objects. However, if the source goes down, what should the restart semantics be? It seems that the ideal situation is to store this state in a system like ZooKeeper, which would ensure that a number of sources could operate off of the same bucket, but this probably requires FLUME-1491 first.

For the second question, my feeling was just that we should work with similar assumptions to how the SpoolingDirectorySource works, where each line is a separate event. Does that seem reasonable?

Thanks,
Natty


On Fri, Aug 1, 2014 at 11:31 AM, Paweł <prog88@gmail.com> wrote:
Hi,
Thanks for explanation Jonathan. I think I will also start working on it. When you have any patch (even draft) I'd be glad if you can attach it in JIRA. I'll do the same.
What do you think?

--
Paweł Róg

2014-08-01 20:19 GMT+02:00 Hari Shreedharan <hshreedharan@cloudera.com>:

+1 on an S3 Source. I would gladly review.

Jonathan Natkins wrote:

Hey Pawel,

My intention is to start working on it, but I don't know exactly how
long it will take, and I'm not a committer, so time estimates would
have to be taken with a grain of salt regardless. If this is something
that you need urgently, it may not be ideal to wait for me to start
building something for yourself.

That said, as mentioned in the other thread, dynamic configuration can
be done by refreshing the configuration files across the set of Flume
agents. It's certainly not as great as having a single place to change
it (e.g. ZooKeeper), but it's a way to get the job done.

Thanks,
Natty


On Fri, Aug 1, 2014 at 1:33 AM, Paweł <prog88@gmail.com
<mailto:prog88@gmail.com>> wrote:

    Hi,
    Jonathan how should we interpret your last e-mail? You opened an
    JIRA issue and want to start implementing this and do you have any
    estimate how long it will take?

    I think the biggest challenge here is to have dynamic
    configuration of Flume. It doesn't seem to be part of FLUME-2437
    issue. Am I right?

    > Would you need to be able to pull files from multiple S3
    directories with the same source?

    I think we don't need to track multiple S3 buckets with a single
    source. I just imagine an approach where each S3 source can be
    added or deleted on demand and attached to any Channel. I'm only
    afraid about this dynamic configuration. I'll open a new thread
    about this. It seems we have two totally separate things:
    * build S3 source
    * make flume configurable dynamically

    --
    Paweł


    2014-08-01 9:51 GMT+02:00 Otis Gospodnetic
    <otis.gospodnetic@gmail.com <mailto:otis.gospodnetic@gmail.com>>:


        Hi,

        On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins
        <natty@streamsets.com <mailto:natty@streamsets.com>> wrote:

            Hey all,

            I created a JIRA for this:
            https://issues.apache.org/jira/browse/FLUME-2437


        Thanks!  Should Fix Version be set to the next Flume release
        version?

            I thought I'd start working on one myself, which can
            hopefully be contributed back. I'm curious: do you have
            particular requirements? Based on the emails in this
            thread, it sounds like the original goal was to have
            something that's like a SpoolDirectorySource that just
            picks up new files from S3. Is that accurate?


        Yes, I think so.  We need to be able to:
        * fetch data (logs for pulling them in Logsene
        <http://sematext.com/logsene/>) from S3 periodically (e.g.

        every 1 min, every 5 min, etc.)
        * fetch data from multiple S3 buckets
        * associate an S3 bucket with a user/token/key
        * dynamically (i.e. without editing/writing config files
        stored on disk) add new S3 buckets from which data should be fetch
        * dynamically (i.e. without editing/writing config files
        stored on disk) stop fetching data from some S3 buckets


            Would you need to be able to pull files from multiple S3
            directories with the same source?


        I think the above addresses this question.

            Thanks,
            Natty


        Thanks!

        Otis
        --
        Performance Monitoring * Log Analytics * Search Analytics
        Solr & Elasticsearch Support * http://sematext.com/



            On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic
            <otis.gospodnetic@gmail.com
            <mailto:otis.gospodnetic@gmail.com>> wrote:

                +1 for seeing S3Source, starting with a JIRA issue.

                But being able to dynamically add/remove S3 buckets
                from which to pull data seems important.

                Any suggestions for how to approach that?

                Otis
                --
                Performance Monitoring * Log Analytics * Search Analytics
                Solr & Elasticsearch Support * http://sematext.com/


                On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan
                <hshreedharan@cloudera.com
                <mailto:hshreedharan@cloudera.com>> wrote:

                    Please go ahead and file a jira. If you are
                    willing to submit a patch, you can post it on the
                    jira.

                    Viral Bajaria wrote:


                    I have a similar use case that cropped up
                    yesterday. I saw the archive
                    and found that there was a recommendation to
                    build it as Sharninder
                    suggested.

                    For now, I went down the route of writing a
                    python script which
                    downloads from S3 and puts the files in a
                    directory which is
                    configured to be picked up via a spooldir.

                    I would prefer to get a direct S3 source, and
                    maybe we could
                    collaborate on it and open-source it. Let me know
                    if you prefer that
                    and we can work directly on it by creating a JIRA.

                    Thanks,
                    Viral



                    On Thu, Jul 31, 2014 at 10:26 AM, Hari Shreedharan
                    <hshreedharan@cloudera.com
                    <mailto:hshreedharan@cloudera.com>
                    <mailto:hshreedharan@cloudera.com

                    <mailto:hshreedharan@cloudera.com>>> wrote:

                        In both cases, Sharninder is right :)

                        Sharninder wrote:



                        As far as I know, there is no (open source)
                    implementation of an S3
                        source, so yes, you'll have to implement
                    your own. You'll have to
                        implement a Pollable source and the dev
                    documentation has an outline
                        that you can use. You can also look at the
                    existing Execsource and
                        work your way up.

                        As far as I know, there is no way to
                    configure flume without
                        using the
                        configuration file.



                        On Thu, Jul 31, 2014 at 7:57 PM, Paweł
                    <prog88@gmail.com <mailto:prog88@gmail.com>
                    <mailto:prog88@gmail.com <mailto:prog88@gmail.com>>
                    <mailto:prog88@gmail.com
                    <mailto:prog88@gmail.com>
                    <mailto:prog88@gmail.com
                    <mailto:prog88@gmail.com>>>> wrote:

                            Hi,
                            I'm wondering if Flume is able to read
                    directly from S3.

                            I'll describe my case. I have log files
                    stored in AWS S3. I have
                            to fetch periodically new S3 objects and
                    read log lines from it.
                            Than use log lines (events) are
                    processed in standard flume's way
                            (as with other sources).

                            *1) Is there any way to fetch S3 objects
                    or I have to write
                        my own
                            Source?*


                            There is also second case. I want to
                    have flume configuration
                            dynamic. Flume sources can change in
                    time. New AWS key and S3
                            bucket can be added or deleted.

                            *2) Is there any other way to configure
                    Flume than by static
                            configuration file?*

                            --
                            Paweł Róg













--
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal






--
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal