phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <>
Subject Re: Materialized views in Hbase/Phoenix
Date Mon, 30 Sep 2019 22:37:52 GMT
Bulk loading would help a little bit in the "all-or-nothing" problem, 
but still not be fool proof.

You could have a set of files which are destined to different tables and 
have very clear data that needs to be loaded, but, if a file(s) failed 
to be loaded, you would have to take some steps to keep retrying.

On 9/27/19 12:22 PM, Gautham Acharya wrote:
> We will be reaching 100million rows early next year, and then billions 
> shortly after that. So, Hbase will be needed to scale to that degree.
> If one of the tables fails to write, we need some kind of a rollback 
> mechanism, which is why I was considering a transaction. We cannot be in 
> a partial state where some of the ‘views’ are written and some aren’t.
> *From:*Pedro Boado []
> *Sent:* Friday, September 27, 2019 7:22 AM
> *To:*
> *Subject:* Re: Materialized views in Hbase/Phoenix
> *CAUTION:*This email originated from outside the Allen Institute. Please 
> do not click links or open attachments unless you've validated the 
> sender and know the content is safe.
> ------------------------------------------------------------------------
> For just a few million rows I would go for a RDBMS and not Phoenix / HBase.
> You don't really need transactions to control completion, just write a 
> flag (a COMPLETED empty file, for instance) as a final step in your job.
> On Fri, 27 Sep 2019, 15:03 Gautham Acharya, < 
> <>> wrote:
>     Thanks Anil.
>     So, what you’re essentially advocating for is to use some kind of
>     Spark/compute framework (I was going to use AWS Glue) job to write
>     the ‘materialized views’ as separate tables (maybe tied together
>     with some kind of a naming convention?)
>     In this case, we’d end up with some sticky data consistency issues
>     if the write job failed halfway through (some ‘materialized view’
>     tables would be updated, and some wouldn’t). Can I use Phoenix
>     transactions to wrap the write jobs together, to make sure either
>     all the data is updated, or none?
>     --gautham
>     *From:*anil gupta [
>     <>]
>     *Sent:* Friday, September 27, 2019 6:58 AM
>     *To:* <>
>     *Subject:* Re: Materialized views in Hbase/Phoenix
>     *CAUTION:*This email originated from outside the Allen Institute.
>     Please do not click links or open attachments unless you've
>     validated the sender and know the content is safe.
>     ------------------------------------------------------------------------
>     For your use case, i would suggest to create another table that
>     stores the matrix. Since this data doesnt change that often, maybe
>     you can write a nightly spark/MR job to update/rebuild the matrix
>     table.(If you want near real time that is also possible with any
>     streaming system) Have you looked into bloom filters? It might help
>     if you have sparse dataset and you are using Phoenix dynamic columns.
>     We use dynamic columns for a table that has columns upto 40k. Here
>     is the presentation and optimizations we made for that use case:
>     <>
>     IMO, Hive integration with HBase is not fully baked and it has a lot
>     of rough edges. So, it better to stick with native Phoenix/HBase if
>     you care about performance and ease of operations.
>     HTH,
>     Anil Gupta
>     On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya
>     < <>>
>     wrote:
>         Hi,
>         Currently I'm using Hbase to store large, sparse matrices of
>         50,000 columns 10+ million rows of integers.
>         This matrix is used for fast, random access - we need to be able
>         to fetch random row/column subsets, as well as entire columns.
>         We also want to very quickly fetch aggregates (Mean, median,
>         etc) on this matrix.
>         The data does not change very often for these matrices (a few
>         times a week at most), so pre-computing is very feasible here.
>         What I would like to do is maintain a column store (store the
>         column names as row keys, and a compressed list of all the row
>         values) for the use case where we select an entire column.
>         Additionally, I would like to maintain a separate table for each
>         precomputed aggregate (median table, mean table, etc).
>         The query time for all these use cases needs to be low latency -
>         under 100ms.
>         When the data does change for a certain matrix, it would be nice
>         to easily update the optimized table. Ideally, I would like the
>         column store/aggregation tables to just be materialized views of
>         the original matrix. It doesn't look like Apache Phoenix
>         supports materialized views. It looks like Hive does, but
>         unfortunately Hive doesn't normally offer low latency queries.
>         Maybe Hive can create the materialized view, and we can just
>         query the underlying Hbase store for lower latency responses?
>         What would be a good solution for this?
>         --gautham
>         --gautham
>     -- 
>     Thanks & Regards,
>     Anil Gupta

View raw message