incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <>
Subject Re: [DISCUSS] CarbonData incubation proposal
Date Thu, 19 May 2016 22:11:16 GMT
I see code derived from Mondrian in the org.carbondata.core.carbon package[1] (I’m familiar
with Mondrian’s code structure because I wrote it). Mondrian was originally EPL and as such
cannot be re-licensed under ASL. Everything is probably fine, but as part of incubation, we
will need to make sure that this and other code has a clear progeny.



> On May 19, 2016, at 10:04 AM, Liang Chen <> wrote:
> Hi Lars
> Thanks for you participated in discussion.
> Based on the below requirements, we investigated existing file formats in
> the Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
> R1.Support big scan & only fetch a few columns
> R2.Support primary key lookup response in sub-second. 
> R3.Support interactive OLAP-style query over big data which involve many
> filters in a query, this type of workload should response in seconds. 
> R4.Support fast individual record extraction which fetch all columns of the
> record. 
> R5.Support HDFS so that customer can leverage existing Hadoop cluster.
> When we investigate Parquet/ORC, it seems they work very well for R1 and R5,
> but they does not meet for R2,R3,R4. So we designed CarbonData mainly to add
> following differentiating features:
> 1.Stores data along with index: it can significantly accelerate query
> performance and reduces the I/O scans and CPU resources, where there are
> filters in the query.  CarbonData index is consisted of multiple level, a
> processing framework can leverage this index to reduce the task it needs to
> schedule and process, and it can also do skip scan in more finer grain unit
> (called blocklet) in task side scanning instead of scanning the whole file.
> 2.Operable encoded data :Through supporting efficient compression and global
> encoding schemes, can query on compressed/encoded data, the data can be
> converted just before returning the results to the users, which is "late
> materialized".
> 3.Column group: Allow multiple columns form a column group to store as row
> format, thus cost of column reconstructing is reduced.
> 4.Supports for various use cases with one single Data format : like
> interactive OLAP-style query, Sequential Access (big scan), Random Access
> (narrow scan).
> Please kindly let me know if the above info answer your questions.
> Regards
> Liang
> --
> View this message in context:
> Sent from the Apache Incubator - General mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message