phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gautham Acharya <>
Subject RE: Materialized views in Hbase/Phoenix
Date Fri, 27 Sep 2019 14:02:51 GMT
Thanks Anil.

So, what you’re essentially advocating for is to use some kind of Spark/compute framework
(I was going to use AWS Glue) job to write the ‘materialized views’ as separate tables
(maybe tied together with some kind of a naming convention?)

In this case, we’d end up with some sticky data consistency issues if the write job failed
halfway through (some ‘materialized view’ tables would be updated, and some wouldn’t).
Can I use Phoenix transactions to wrap the write jobs together, to make sure either all the
data is updated, or none?


From: anil gupta []
Sent: Friday, September 27, 2019 6:58 AM
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not click links
or open attachments unless you've validated the sender and know the content is safe.
For your use case, i would suggest to create another table that stores the matrix. Since this
data doesnt change that often, maybe you can write a nightly spark/MR job to update/rebuild
the matrix table.(If you want near real time that is also possible with any streaming system)
Have you looked into bloom filters? It might help if you have sparse dataset and you are using
Phoenix dynamic columns.
We use dynamic columns for a table that has columns upto 40k. Here is the presentation and
optimizations we made for that use case:<>
IMO, Hive integration with HBase is not fully baked and it has a lot of rough edges. So, it
better to stick with native Phoenix/HBase if you care about performance and ease of operations.

Anil Gupta

On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya <<>>

Currently I'm using Hbase to store large, sparse matrices of 50,000 columns 10+ million rows
of integers.

This matrix is used for fast, random access - we need to be able to fetch random row/column
subsets, as well as entire columns. We also want to very quickly fetch aggregates (Mean, median,
etc) on this matrix.

The data does not change very often for these matrices (a few times a week at most), so pre-computing
is very feasible here. What I would like to do is maintain a column store (store the column
names as row keys, and a compressed list of all the row values) for the use case where we
select an entire column. Additionally, I would like to maintain a separate table for each
precomputed aggregate (median table, mean table, etc).

The query time for all these use cases needs to be low latency - under 100ms.

When the data does change for a certain matrix, it would be nice to easily update the optimized
table. Ideally, I would like the column store/aggregation tables to just be materialized views
of the original matrix. It doesn't look like Apache Phoenix supports materialized views. It
looks like Hive does, but unfortunately Hive doesn't normally offer low latency queries.

Maybe Hive can create the materialized view, and we can just query the underlying Hbase store
for lower latency responses?

What would be a good solution for this?



Thanks & Regards,
Anil Gupta
View raw message