phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Boado <pedro.bo...@gmail.com>
Subject Re: Materialized views in Hbase/Phoenix
Date Fri, 27 Sep 2019 17:52:51 GMT
Yeah, phoenix won't aggregate billions of rows in under 100ms (probably,
nothing will).

This sounds more and more like an OLAP use case, doesn't it? Facts table
with billions of rows (still, you can handle that volumes with a shared
RDBMS) that will never be queried directly.. And precomputed aggregations
to be queried interactively (maybe you could use Phoenix here, but you
could also use a RDBMS, that additionally can give you all guarantees
you're looking for).

If that's the case, I don't really think HBase/Phoenix is the right choice,
(which is good doing gets by key or running scans/aggregations over
reasonable key intervals).

Maybe explaining the use case could help (we are getting more info drop by
drop in each new message in terms of volume, different query patterns
expected, concurrency, etc etc). For instance, how are this 100s of queries
interacting with the DB? Via a REST API?





On Fri, 27 Sep 2019, 17:39 Gautham Acharya, <gauthama@alleninstitute.org>
wrote:

> We are looking at being able to support hundreds of concurrent queries,
> but not too many more.
>
>
>
> Will aggregations be performant across these large datasets? (e.g. give me
> the mean value of each column when all rows are grouped by a certain row
> property).
>
>
>
> Precomputing seems much more efficient.
>
>
>
> *From:* Pedro Boado [mailto:pedro.boado@gmail.com]
> *Sent:* Friday, September 27, 2019 9:27 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Materialized views in Hbase/Phoenix
>
>
>
> *CAUTION:* This email originated from outside the Allen Institute. Please
> do not click links or open attachments unless you've validated the sender
> and know the content is safe.
> ------------------------------
>
> Can the aggregation be run on the flight in a phoenix query? 100ms
> response time but... With how many concurrent queries?
>
>
>
> On Fri, 27 Sep 2019, 17:23 Gautham Acharya, <gauthama@alleninstitute.org>
> wrote:
>
> We will be reaching 100million rows early next year, and then billions
> shortly after that. So, Hbase will be needed to scale to that degree.
>
>
>
> If one of the tables fails to write, we need some kind of a rollback
> mechanism, which is why I was considering a transaction. We cannot be in a
> partial state where some of the ‘views’ are written and some aren’t.
>
>
>
>
>
> *From:* Pedro Boado [mailto:pedro.boado@gmail.com]
> *Sent:* Friday, September 27, 2019 7:22 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Materialized views in Hbase/Phoenix
>
>
>
> *CAUTION:* This email originated from outside the Allen Institute. Please
> do not click links or open attachments unless you've validated the sender
> and know the content is safe.
> ------------------------------
>
> For just a few million rows I would go for a RDBMS and not Phoenix / HBase.
>
>
>
> You don't really need transactions to control completion, just write a
> flag (a COMPLETED empty file, for instance) as a final step in your job.
>
>
>
>
>
>
>
> On Fri, 27 Sep 2019, 15:03 Gautham Acharya, <gauthama@alleninstitute.org>
> wrote:
>
> Thanks Anil.
>
>
>
> So, what you’re essentially advocating for is to use some kind of
> Spark/compute framework (I was going to use AWS Glue) job to write the
> ‘materialized views’ as separate tables (maybe tied together with some kind
> of a naming convention?)
>
>
>
> In this case, we’d end up with some sticky data consistency issues if the
> write job failed halfway through (some ‘materialized view’ tables would be
> updated, and some wouldn’t). Can I use Phoenix transactions to wrap the
> write jobs together, to make sure either all the data is updated, or none?
>
>
>
> --gautham
>
>
>
>
>
> *From:* anil gupta [mailto:anilgupta84@gmail.com]
> *Sent:* Friday, September 27, 2019 6:58 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Materialized views in Hbase/Phoenix
>
>
>
> *CAUTION:* This email originated from outside the Allen Institute. Please
> do not click links or open attachments unless you've validated the sender
> and know the content is safe.
> ------------------------------
>
> For your use case, i would suggest to create another table that stores the
> matrix. Since this data doesnt change that often, maybe you can write a
> nightly spark/MR job to update/rebuild the matrix table.(If you want near
> real time that is also possible with any streaming system) Have you looked
> into bloom filters? It might help if you have sparse dataset and you are
> using Phoenix dynamic columns.
> We use dynamic columns for a table that has columns upto 40k. Here is the
> presentation and optimizations we made for that use case:
> https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal
> <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fanilgupta84%2Fphoenix-con2017-truecarfinal&data=02%7C01%7C%7C9805924799694582dc0908d74368e9ea%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C0%7C637051990198734214&sdata=KvYBkQtZk%2FQ9hD%2F4aL6ZFnVIurU6JJjpf3ZjkjF9A7Q%3D&reserved=0>
>
> IMO, Hive integration with HBase is not fully baked and it has a lot of
> rough edges. So, it better to stick with native Phoenix/HBase if you care
> about performance and ease of operations.
>
>
>
> HTH,
>
> Anil Gupta
>
>
>
>
>
> On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya <
> gauthama@alleninstitute.org> wrote:
>
> Hi,
>
>
>
> Currently I'm using Hbase to store large, sparse matrices of 50,000
> columns 10+ million rows of integers.
>
>
>
> This matrix is used for fast, random access - we need to be able to fetch
> random row/column subsets, as well as entire columns. We also want to very
> quickly fetch aggregates (Mean, median, etc) on this matrix.
>
>
>
> The data does not change very often for these matrices (a few times a week
> at most), so pre-computing is very feasible here. What I would like to do
> is maintain a column store (store the column names as row keys, and a
> compressed list of all the row values) for the use case where we select an
> entire column. Additionally, I would like to maintain a separate table for
> each precomputed aggregate (median table, mean table, etc).
>
>
>
> The query time for all these use cases needs to be low latency - under
> 100ms.
>
>
>
> When the data does change for a certain matrix, it would be nice to easily
> update the optimized table. Ideally, I would like the column
> store/aggregation tables to just be materialized views of the original
> matrix. It doesn't look like Apache Phoenix supports materialized views. It
> looks like Hive does, but unfortunately Hive doesn't normally offer low
> latency queries.
>
>
>
> Maybe Hive can create the materialized view, and we can just query the
> underlying Hbase store for lower latency responses?
>
>
>
> What would be a good solution for this?
>
>
>
> --gautham
>
>
>
>
>
>
>
> --gautham
>
>
>
>
>
> --
>
> Thanks & Regards,
> Anil Gupta
>
>

Mime
View raw message