flume-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ehsan ul Haq <m.ehsan....@gmail.com>
Subject Re: Extract data using regex into HBase
Date Mon, 27 Oct 2014 14:18:15 GMT
If you write an interceptor to extract your fields in header, you still
need to write a serializer to put all the header properties in Hbase table
column as there is no such default hbase serializer to put event header
properties as columns.

Since you will need to write a serializer anyways why don't put your logic
in your custom serializer, your suggested option (2)
You can look at the default implementation of the
SimpleAsyncHbaseEventSerializer for reference.



On Mon, Oct 27, 2014 at 1:34 PM, Alaa Ali <contact.alaa@gmail.com> wrote:

> I guess I'll keep the batch approach as my backup, but using
> RegexHbaseEventSerializer (in the realtime approach you proposed) only
> allows me to have one regex pattern in the Flume config and asks me to
> specify the columns to write the match groups to, but my regex won't be a
> single pattern because the input varies a lot (logs from different
> operating systems).
> Back to one of my thoughts, I found an article on building a custom Flume
> interceptor:
> http://hadoopi.wordpress.com/2014/06/11/flume-getting-started-with-interceptors/.
> This is the part where headers are added in interceptors:
> public Event intercept(Event event) {
>         // This is the event's body
>         String body = new String(event.getBody());
>         // These are the event's headers
>         Map<String, String> headers = event.getHeaders();
>         // Here, can I start using regex to match patterns from String
> body?
>         // Enrich header with hostname
>         headers.put(hostHeader, hostValue);
>         // Let the enriched event go
>         return event;
> }
> Would it be feasible to use regex patterns (in the place I marked above)
> to start matching from String body and save into variables, then use
> headers.put multiple times to add new headers with those variables?
> Shouldn't that work?
> Regards,
> Alaa Ali
> On Sun, Oct 26, 2014 at 4:17 PM, Ehsan ul Haq <m.ehsan.haq@gmail.com>
> wrote:
>> Hi,
>>    Here are my thoughts.
>> Using Batch Approach (Inspired by Lambda)
>> 1. Store your syslog events as it is in the Hbase using Sync or Async
>> Hbase sink.
>> 2. Write a map reduce job with your regular expression extraction and
>> output either in HDFS or Hbase whatever you need.
>> 3. Run your mapreduce job periodically.
>> + Once the syslog is imported in Hbase you can easily discard syslogs
>> from the actual source.
>> + Syslog will be stored as immutable data in Hbase table, allowing you to
>> fix your regular expression extraction without destroying the log events.
>> - Need to periodically rerun the mapreduce job.
>> (I assume you want to expose the output as Hbase table)
>> Using Realtime approach
>> You can use the RegexHbaseEventSerializer. You can look at the following
>> usage
>> http://stackoverflow.com/questions/12304826/regular-expression-confiuration-in-flumeng
>> + Your data is available immediately.
>> - Hard to fix errors.
>> - Can't to add more fields to already processed syslog events. (You will
>> have to run a mapreduce or reimport the whole syslog events again)
>> Regards
>> Ehsan
>> On Sun, Oct 26, 2014 at 7:50 PM, Alaa Ali <contact.alaa@gmail.com> wrote:
>>> Hello! I want to receive syslog, parse out the input using regex into
>>> fields (for example username, source IP, destination IP), and store the
>>> data in HBase into columns corresponding to those fields. I know how to do
>>> the syslog source, but how do I go about doing the extraction+storing?
>>> My thoughts:
>>> 1. Can I use a Regex Extractor Interceptor to make my own serializer
>>> implementation that extracts data into multiple headers in the event? Then
>>> use the AsyncHBase sink serializer to simply store the header values into
>>> columns? Can I do that?
>>> 2. Should I pass the data to the AsyncHBase sink unaltered, and
>>> implement everything in the sink's serializer.
>>> It is worth noting that the input is in different formats, so my regex
>>> implementation isn't one simple regex and will probably contain a lot of
>>> ifs to, for example, extract the username because it won't always be in the
>>> same place in the log. Which approach is best, or is there another
>>> approach, or am I getting it wrong?
>>> ​- ​
>>> Alaa Ali

View raw message