phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <els...@apache.org>
Subject Re: Materialized views in Hbase/Phoenix
Date Mon, 30 Sep 2019 22:37:52 GMT
Bulk loading would help a little bit in the "all-or-nothing" problem, 
but still not be fool proof.

You could have a set of files which are destined to different tables and 
have very clear data that needs to be loaded, but, if a file(s) failed 
to be loaded, you would have to take some steps to keep retrying.

On 9/27/19 12:22 PM, Gautham Acharya wrote:
> We will be reaching 100million rows early next year, and then billions 
> shortly after that. So, Hbase will be needed to scale to that degree.
> 
> If one of the tables fails to write, we need some kind of a rollback 
> mechanism, which is why I was considering a transaction. We cannot be in 
> a partial state where some of the ‘views’ are written and some aren’t.
> 
> *From:*Pedro Boado [mailto:pedro.boado@gmail.com]
> *Sent:* Friday, September 27, 2019 7:22 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Materialized views in Hbase/Phoenix
> 
> *CAUTION:*This email originated from outside the Allen Institute. Please 
> do not click links or open attachments unless you've validated the 
> sender and know the content is safe.
> 
> ------------------------------------------------------------------------
> 
> For just a few million rows I would go for a RDBMS and not Phoenix / HBase.
> 
> You don't really need transactions to control completion, just write a 
> flag (a COMPLETED empty file, for instance) as a final step in your job.
> 
> On Fri, 27 Sep 2019, 15:03 Gautham Acharya, <gauthama@alleninstitute.org 
> <mailto:gauthama@alleninstitute.org>> wrote:
> 
>     Thanks Anil.
> 
>     So, what you’re essentially advocating for is to use some kind of
>     Spark/compute framework (I was going to use AWS Glue) job to write
>     the ‘materialized views’ as separate tables (maybe tied together
>     with some kind of a naming convention?)
> 
>     In this case, we’d end up with some sticky data consistency issues
>     if the write job failed halfway through (some ‘materialized view’
>     tables would be updated, and some wouldn’t). Can I use Phoenix
>     transactions to wrap the write jobs together, to make sure either
>     all the data is updated, or none?
> 
>     --gautham
> 
>     *From:*anil gupta [mailto:anilgupta84@gmail.com
>     <mailto:anilgupta84@gmail.com>]
>     *Sent:* Friday, September 27, 2019 6:58 AM
>     *To:* user@phoenix.apache.org <mailto:user@phoenix.apache.org>
>     *Subject:* Re: Materialized views in Hbase/Phoenix
> 
>     *CAUTION:*This email originated from outside the Allen Institute.
>     Please do not click links or open attachments unless you've
>     validated the sender and know the content is safe.
> 
>     ------------------------------------------------------------------------
> 
>     For your use case, i would suggest to create another table that
>     stores the matrix. Since this data doesnt change that often, maybe
>     you can write a nightly spark/MR job to update/rebuild the matrix
>     table.(If you want near real time that is also possible with any
>     streaming system) Have you looked into bloom filters? It might help
>     if you have sparse dataset and you are using Phoenix dynamic columns.
>     We use dynamic columns for a table that has columns upto 40k. Here
>     is the presentation and optimizations we made for that use case:
>     https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal
>     <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fanilgupta84%2Fphoenix-con2017-truecarfinal&data=02%7C01%7C%7C63db5e769d074a7ec9c908d743562e01%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C1%7C637051909727164641&sdata=lKAo7Zw%2FWYQyIRg6kfQ2lqx4yO55AAgVaJ6kgXvqqRc%3D&reserved=0>
> 
>     IMO, Hive integration with HBase is not fully baked and it has a lot
>     of rough edges. So, it better to stick with native Phoenix/HBase if
>     you care about performance and ease of operations.
> 
>     HTH,
> 
>     Anil Gupta
> 
>     On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya
>     <gauthama@alleninstitute.org <mailto:gauthama@alleninstitute.org>>
>     wrote:
> 
>         Hi,
> 
>         Currently I'm using Hbase to store large, sparse matrices of
>         50,000 columns 10+ million rows of integers.
> 
>         This matrix is used for fast, random access - we need to be able
>         to fetch random row/column subsets, as well as entire columns.
>         We also want to very quickly fetch aggregates (Mean, median,
>         etc) on this matrix.
> 
>         The data does not change very often for these matrices (a few
>         times a week at most), so pre-computing is very feasible here.
>         What I would like to do is maintain a column store (store the
>         column names as row keys, and a compressed list of all the row
>         values) for the use case where we select an entire column.
>         Additionally, I would like to maintain a separate table for each
>         precomputed aggregate (median table, mean table, etc).
> 
>         The query time for all these use cases needs to be low latency -
>         under 100ms.
> 
>         When the data does change for a certain matrix, it would be nice
>         to easily update the optimized table. Ideally, I would like the
>         column store/aggregation tables to just be materialized views of
>         the original matrix. It doesn't look like Apache Phoenix
>         supports materialized views. It looks like Hive does, but
>         unfortunately Hive doesn't normally offer low latency queries.
> 
>         Maybe Hive can create the materialized view, and we can just
>         query the underlying Hbase store for lower latency responses?
> 
>         What would be a good solution for this?
> 
>         --gautham
> 
>         --gautham
> 
> 
> 
>     -- 
> 
>     Thanks & Regards,
>     Anil Gupta
> 

Mime
View raw message