incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Feinauer <j.feina...@pragmaticminds.de>
Subject Re: Hello World / CRUNCH Framework
Date Sun, 16 Dec 2018 23:07:28 GMT
Hi Julian,

thanks for your answer and your insights.
I agree with you on many points (especially our last discussion on the Calcite ML made me
think a lot).
So I agree with your "layered" approach, and in fact this is what we currently do (without
stating it explicit enough, I think).

Basically, we do two thinks, I guess.. first, we provide a (Java-)DSL to make it easy to write
specific operations (and do some very limited optimization, not at all comparable to what
Calcite does).
Second, we also provide some functions which are useful or necessary for signal processing
(smoothing, filtering, ...) and we plan to extend them soon with things like short or long
term predictions, anomaly detection, ... .
By providing suitable wrappers for all that stuff we are able to translate this to "real"
streaming engines (currently Flink and Akka Streams) and run it there.

And indeed MATCH_RECOGNIZE could be a good implementation for many situations (definitely
not all) and I hope that I can contribute soon to your recent work (I will continue the discussion
on the Calcite list). But overall I'm really unsure if our problem can be seen as a problem
of relational algebra. I know and like the overall framework very much (it's one of the most
elegant applications of math I've seen so far I would even say). But it feels like it doesn’t
fit that well. As soon as you have a problem where relations are related, even for simple
things like LAG or LEAD as window functions it gets pretty complicated and unnatural with
regards to the definition of the algebra. But, as I'm lacking a lot of expertise there I would
love to discuss the matter further with you (but again, I think we should do it on the calcite
list).

The following small ASCII Image depicts my thinking of these "layers", and from our perspective
MATCH_RECOGNIZE is one way to solve the problem and we can also provide "native" blocks to
run directly on a streaming engine and there are surely pros and cons for both sides:

		O CRUNCH Evaluation
		|
	----------------------
	|		|
    STREAM	 Rel. Expression with MATCH_RECOGNIZE
	|		|	
   Streaming Engines	|
			|
		SQL based Engines

So, I'm not exactly sure what approach you would prefer from your mail, but my suggestion
for the next steps with CRUNCH would be to enrich the DSL, add more domain specific functions,
find more use-cases and get more users on-board. So to say, work on the semantics side of
things. But in parallel we should follow a path to get a better separation of "business logic"
and execution with support for multiple frameworks and especially the relational algebra side.
Perhaps, we can conclude at one point that we can cover everything by Calcite (I'm skeptical
right now) but I think whats needed for this discussion is a valid basis to also show you
calcite devs what exactly we are doing in-depth.

Julian


Am 16.12.18, 08:20 schrieb "Julian Hyde" <jhyde@apache.org>:

    Hi Julian,
    
    Regarding whether to do this as a streaming engine (with its own query language) or as
a framework above a streaming engine, I’d say that’s a false choice. If there is relational
algebra inside your system, you can provide a high-level query language that can be translated
to a lower-level query language in a streaming engine.
    
    This approach of “layered” databases has worked well for me for several projects,
and is ever more applicable these days as data is becoming federated.
    
    You and I have discussed SQL’s MATCH_RECOGNIZE clause as a way to build complex time-based
logic. You have probably noticed that is now in Flink, I am working on it in Calcite, and
Beam will probably get it at some point. Even if MATCH_RECOGNIZE doesn’t solve your problem,
let’s follow the same approach - convert your problem to a DSL that maps to or extends relational
algebra, and then figure out how to translate that to SQL in an underlying engine. Calcite
is a very good platform for building new “data languages”, so let’s carry on talking.
    
    Julian
    
    
    > On Dec 14, 2018, at 2:11 AM, Julian Feinauer <j.feinauer@pragmaticminds.de>
wrote:
    > 
    > Hi all,
    > 
    > I just joined the incubator ML and wanted to present myself and possibly also start
a discussion about a software project we developed in the past.
    > But first things first. My name is Julian Feinauer and I come from Germany where
I run two “start-up” companies where we work a lot on the “industrial IoT” topics,
data science and processing of “larger amounts of data”. We love open source and so we
love the ASF. Most notably, I closely follow the Apache Calcite project and hopefully find
some time soon to contribute a bit more than in the last monts. Futhermore, I am engaged in
the (incubating) PLC4X project as (P)PMC and in the  (incubating) Edgent project where I try
to “revive” the community as new (P)PMC together with Christopher Dutz.
    > 
    > Now to the real topic. Over the last 3 years I started to develop a “Framework/Library”
(currently a set of jars) to facilitate processing of timeseries data. The focus is mostly
on processing of data from test stands, e.g., automotive tests, driving profiles and so on.
Furthermore, in the recent year we added a lot of functionality for processing of “industrial
data”. This means that we want to make it easy to analyze things like “how long did the
machine spend in this state”, “when are the following set of bits set” or “nofity
when the following conditions is true for the first time”.
    > It is a bit technical and I don’t want to go too deep into it, but generally speaking
we try to introduce the “right” semantics to answer the typical questions when analyzing
machine or test data. This project is called “CRUNCH” and we are in the process of making
it open source (will be moved to a public github repo in this year) under the Apache 2.0 License.
    > 
    > As there can be seen a close relationship to other (incubating or TLP) projects we
are thinking about if this project could fit into the incubator. Some examples for Apache
projects that we see as “related” are Apache Flink (which we can use as the Streaming
Engine to process the stream), (incubating) Edgent which we also can support as Streaming
Engine and where we try to find a suitable project goal and community currently as some of
the (P)PMC members retired or went inactive. Finally, CRUNCH has a very natural fit with PLC4X
because it can directly process the data gathered form PLCs (and in fact we are already using
it in some of our projects that way). I had several discussions with some of the (P)PMCs of
PLC4X, namely Sebastian Rühl and Christpher Dutz wo encouraged me to introduce the project
to the incubator because they also see some potential for the project to enrich the OSS ecosystem
with regards to edge / stream processing of (I)IoT data.
    > 
    > So please feel free to ask questions or discuss your view on this topic as I would
like to find out if this project could fit in the Apache Ecosystem and the Incubator or not.
    > 
    > Thank you already!
    > Julian
    
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
    For additional commands, e-mail: general-help@incubator.apache.org
    
    

Mime
View raw message