flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alaa Ali <contact.a...@gmail.com>
Subject Re: Extract data using regex into HBase
Date Mon, 27 Oct 2014 12:34:33 GMT
I guess I'll keep the batch approach as my backup, but using
RegexHbaseEventSerializer (in the realtime approach you proposed) only
allows me to have one regex pattern in the Flume config and asks me to
specify the columns to write the match groups to, but my regex won't be a
single pattern because the input varies a lot (logs from different
operating systems).

Back to one of my thoughts, I found an article on building a custom Flume
This is the part where headers are added in interceptors:

public Event intercept(Event event) {
        // This is the event's body
        String body = new String(event.getBody());

        // These are the event's headers
        Map<String, String> headers = event.getHeaders();

        // Here, can I start using regex to match patterns from String body?

        // Enrich header with hostname
        headers.put(hostHeader, hostValue);

        // Let the enriched event go
        return event;

Would it be feasible to use regex patterns (in the place I marked above) to
start matching from String body and save into variables, then use
headers.put multiple times to add new headers with those variables?
Shouldn't that work?

Alaa Ali

On Sun, Oct 26, 2014 at 4:17 PM, Ehsan ul Haq <m.ehsan.haq@gmail.com> wrote:

> Hi,
>    Here are my thoughts.
> Using Batch Approach (Inspired by Lambda)
> 1. Store your syslog events as it is in the Hbase using Sync or Async
> Hbase sink.
> 2. Write a map reduce job with your regular expression extraction and
> output either in HDFS or Hbase whatever you need.
> 3. Run your mapreduce job periodically.
> + Once the syslog is imported in Hbase you can easily discard syslogs from
> the actual source.
> + Syslog will be stored as immutable data in Hbase table, allowing you to
> fix your regular expression extraction without destroying the log events.
> - Need to periodically rerun the mapreduce job.
> (I assume you want to expose the output as Hbase table)
> Using Realtime approach
> You can use the RegexHbaseEventSerializer. You can look at the following
> usage
> http://stackoverflow.com/questions/12304826/regular-expression-confiuration-in-flumeng
> + Your data is available immediately.
> - Hard to fix errors.
> - Can't to add more fields to already processed syslog events. (You will
> have to run a mapreduce or reimport the whole syslog events again)
> Regards
> Ehsan
> On Sun, Oct 26, 2014 at 7:50 PM, Alaa Ali <contact.alaa@gmail.com> wrote:
>> Hello! I want to receive syslog, parse out the input using regex into
>> fields (for example username, source IP, destination IP), and store the
>> data in HBase into columns corresponding to those fields. I know how to do
>> the syslog source, but how do I go about doing the extraction+storing?
>> My thoughts:
>> 1. Can I use a Regex Extractor Interceptor to make my own serializer
>> implementation that extracts data into multiple headers in the event? Then
>> use the AsyncHBase sink serializer to simply store the header values into
>> columns? Can I do that?
>> 2. Should I pass the data to the AsyncHBase sink unaltered, and implement
>> everything in the sink's serializer.
>> It is worth noting that the input is in different formats, so my regex
>> implementation isn't one simple regex and will probably contain a lot of
>> ifs to, for example, extract the username because it won't always be in the
>> same place in the log. Which approach is best, or is there another
>> approach, or am I getting it wrong?
>> ​- ​
>> Alaa Ali

View raw message