phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <>
Subject Re: Schema on read vs. secondary indexes
Date Thu, 27 Mar 2014 19:31:34 GMT
On Thu, Mar 27, 2014 at 9:36 AM, Otis Gospodnetic <> wrote:

> Hi,
> Sorry, more Qs.  This is exciting stuff!

No problem. Thanks for the interest!

> On Wed, Mar 26, 2014 at 2:29 PM, James Taylor <>wrote:
>> On Wed, Mar 26, 2014 at 9:17 AM, Otis Gospodnetic <
>>> wrote:
>>> Hi James,
>>> On Wed, Mar 26, 2014 at 2:15 AM, James Taylor <>wrote:
>>>> Hi Otis,
>>>> That's an excellent idea. Phoenix does support (1) & (2), but we don't
>>>> support adding a secondary index on a dynamic column. However, there's
>>>> really no reason why we couldn't - we've never had anyone ask for this. Our
>>>> mutable secondary index support is done at the HBase level, so as long as
>>>> only Puts and Deletes are done on the data, it should work fine. We'd just
>>>> need to add the syntax for dynamic column declaration for CREATE INDEX to
>>>> our grammar.
>>> Only Puts in our case.  And batch Deletes when data's TTL is reached,
>>> but I assume an external job/process would have to go through HBase rows
>>> and delete and that would automatically update any data structures and
>>> files Phoenix uses?
>> If you're using the standard batch Delete API, you'd be fine. If you're
>> relying on setting a TTL on the HBase table, then you'd need to set the
>> same TTL on the index table (and this would be somewhat dangerous, as HBase
>> might prune the data table and index table at different times).
> How about time-based partitioning instead of TTL? e.g. a table for today,
> a table for yesterday, etc. (or maybe not "table", but some sort of
> non-table partitioning mechanism?).
> I'm asking because I assume dropping tables/partitions when they are old
> enough to be removed would be faster than having HBase scan everything and
> remove based on TTL.

The TTL feature in HBase piggybacks on compaction, so there's no additional
overhead. As far as partitioning, essentially your row key in HBase is your
partitioning mechanism, so if you use a date/time column as the leading
part of your primary key in Phoenix, you've partitioned the table by time.

> And if such partitioned structure is doable at the HBase level (e.g.
> multiple tables), would that work for Phoenix?  For example, if I have
> daily partitions/tables, and I query for SELECT model, colour FROM Cars
> WHERE saleDate BETWEEN now() and now() - 10 days GROUP BY colour, would
> Phoenix be able to query across all 10 daily partitions?
>  What's the use case you have in mind? Keep in mind too, that adding
>>>> secondary indexes has an impact on write performance (from the HBase POV,
>>>> your doing two Puts instead of one and there's some cost associated with
>>>> the incremental maintenance).
>>> Think of it as an organization-wide "throw anything in and query it
>>> right away database". :)
>> +1. I like it!
>>> Ideally, I'd like to be able to have HBase that is schemaless ("throw
>>> anything in"), that is multi-tenant ("anyone at my company can start
>>> inserting their data"), where each tenant can start adding rows with their
>>> own columns ("schemaless"), and then search against any of them right away,
>>> without anyone having to run the "CREATE INDEX..." stuff.  Imagine I work
>>> in an org with such a database service and I decide I want to start
>>> tracking the number of user logins in my web application.  So I may decide
>>> to start adding rows like this:
>> Ah, this is an ideal use case for our multi-tenant support. Take a look
>> at this: So
>> you'd create a "base" multi-tenant table with the columns below. Then each
>> tenant would create a "view" over this base table and they could evolve
>> independently.
> But imagine there is no base table.  Really, imagine only something like
> "timestamp" column is required everywhere, and maybe tenantID, but all
> other columns are completely ad-hoc.  That is, somebody in the Sales
> department will want to insert rows with sales-specific columns, Marketing
> person will shove in marketing data, the IT person may throw in names and
> hardware specs and locations of their 10000 servers, the engineer may
> decide to stuff performance metrics in this, and the front-end developer
> may feed in clickstream data.  So there is no such thing as "base table",
> it's really completely open.
> Would the multi-tenant approach with a super minimal base table still
> work, even though the number of shared/base columns would be very, very
> small compared to the total number of different columns across all tenants?

Yes, that's no problem, and not uncommon. You might just have a tenantID
column (that's a requirement as the leading column) plus a date/time
column. Not sure if you'd have a userID column as well. Note that with
Phoenix, our DATE type maintains time at the granularity of milliseconds,
so assuming that's a fine enough granularity, I'd recommend that over a

>    timestamp, userId, countryOfUser, genderOfUser, userAgent,
>>> operatingSystem
>>>   timestamp, userId, countryOfUser, genderOfUser, userAgent,
>>> operatingSystem
>>>   timestamp, userId, countryOfUser, genderOfUser, userAgent,
>>> operatingSystem
>>>   ...
>>> And then I want to switch to some UI where I can enter:
>>>   select count(*) where userAgent like '%chrome%' and countryOfUser='FR'
>>> group by gender
>>> I assume this will work ONLY if somebody does CREATE INDEX on userAgent
>>> and countryOfUser first, right?
>> No, this will work fine with and without an index. An index will just
>> make the query execute faster (at the cost of slower writes, of course).
> Right, but now much slower are reads without an index vs. what is the cost
> of maintaining that index at write time?

That's difficult to generalize - it depends also on the filters in your
WHERE clause, as they get pushed down to HBase as well. Remember, though,
that in a multi-tenant world, you'll always be constraining the rows that
get queried by the tenantID (that's done automatically for you
under-the-covers when a tenant-specific connection is used).

> If you look at my examples above, some of them will be low volume (e.g.
> sales figures), but some will be *crazy* high volume with many thousands of
> inserts per second across hundreds or more tenants.
> So data volume will grow very very fast.
> As such, I'm guessing searching against non-indexed columns will go well
> beyond a second or two.
> But I don't have a sense of the index write cost (disk IO, CPU, disk
> space, etc.) and memory cost (I'm assuming some portion of the index is
> read into memory/heap and cached?)

Besides metadata caching, the caching is all handled by HBase (through it's
block cache).

> Are there any guidelines or anecdotes or benchmarks that would give one
> some rough ideas about this?
>  Assuming the answer is yes:
>>> 1) How much overhead does this add in terms of disk space or heap/memory
>>> or CPU used for creation/updates of this index?
>> Take a look at this:
>> and in
>> particular, checkout the presentation at the bottom that Jesse put
>> together, as that provides more technical depth. Post 3.0/4.0 release, we
>> (specifically Rajeshbabu) plan to also support an orthogonal "local index"
>> mechanism where the index data and table data are colocated on the same RS.
>> He has a nice writeup on it here:
> I had a look, thanks.
> But this makes me wonder - isn't this, at least conceptually, then very,
> very, very similar to saying "Here is my HBase table, let me hook up Lucene
> and build side-car Lucene index(es), so I can use them for queries that
> would be better handled through the index Lucene provides instead of doing
> indexless scans against the primary HBase table"?  Sure, there is no
> inverted index in the hindex or in secondary indexing, but isn't concept
> the same?
> If yes, why not just use Lucene for example?

The difference is the level at which these secondary index solutions are
integrated with HBase. The HIndex solution is going to be awesome on the
write side of the fence - basically you're not hit with any additional cost
for writes. The crux of their solution is a custom load balancer that
ensures that the index data remains colocated with the corresponding table
data. There are tons of corner cases the get right, and using Lucene, I
suspect would be difficult to get those right, b/c you'd have to integrate
at a higher level.

Our current secondary index solution works well if many more reads are done
than writes. You take no hit at read time - it's as if you had a
different/optimal row key for your query, and you're just querying the
index table instead of the data table (which is really what's happening).

> And the reason I'm asking is because we have tried using Lucene-based tech
> for the use case described here.  The problem was the maintenance of the
> index.  Lucene has the notion of segments and these segments need to be
> merged (
> nice
video showing the merges) as one indexes data, and these merges make
> it very hard to process thousands of inserts per second.

That is one fantastic blog post. Wow. I'd recommend first measuring the
insert performance of HBase to see if it's going to meet your needs. HBase
is another implementation of an LSM tree. There are only so many ways to
"skin a cat" (as Lars H likes to say), so I wouldn't expect things to be
drastically different than it was with Lucene (I say, knowing nothing about
Lucene :-) ).

> So seeing "secondary indexing" in Phoenix makes me worried that we'd see
> the same problem here.
> Although.... I suspect Phoenix's indexing is faster because it sounds like
> the extra processing is simpler - sounds like the main thing that happens
> is that "special keys" are constructed and inserted in separate HBase
> tables.  So there is no notion of inverted index, segment merges, just the
> usual puts, region splits and compactions, etc.  Right?

Yes, that's correct. The extra processing is to lookup the prior value of a
row that is being changed. This is required during index maintenance to
issue a delete of the prior index row plus to form the row key of the new
index row (as often times, for an index with multiple columns, you only
have the new column value for one of the columns, the one that changed, and
you need to find the current value for the other one as well).

Note that we do have an optimization for write-once data where you can
declare at DDL time that your table has IMMUTABLE_ROWS=true. In this case,
we do not need to do incremental index maintenance, as the same row is
never written more than once.

>>  2) I assume there is nothing to autodetect new columns being inserted,
>>> so one would have to "trap" insertion of new columns and quickly do CREATE
>>> INDEX?
>>  See above - indexes are really orthogonal to querying. They just make
>> them faster.
> Right, but when the Marketing person says "Hey, let's start tracking <some
> new type of marketing data with new columns>" then I assume something
> should first issue a CREATE TABLE or CREATE VIEW and/or CREATE INDEX
> statements first, before anything is inserted, right?  Actually, I'm
> guessing CREATE INDEX can be done at a later time, but the first two would
> have to be done before inserts?

Yes, that's correct. You can issue the CREATE VIEW or ALTER VIEW on-the-fly
(and even include the IF NOT EXISTS if you can to just ignore the case
where the table or column already exist). These DDL statements are not your
typical heavy weight RDBMS calls. They basically boil down to doing a
batched Put on our metadata table.

>>  3) How realistic is such a system in a large org with a few 10s of
>>> thousands of potential users throwing in millions of rows per day and
>>> expecting responses in a few seconds at most, for simple queries like in my
>>> above example?.  Sure, depends on the details and hardware, etc., but is
>>> CREATE INDEX on any one column really realistic?
>> There's the rub, and why we added multi-tenant support. HBase won't scale
>> past a few hundred tables (at most). In general, HBase is happier with few
>> big tables than lots of small tables. With multi-tenancy, the Phoenix
>> tables share the same physical HBase table. Then each "tenant" has his
>> "view" which is independent of other tenant's views (i.e. columns can be
>> added to one and won't affect the other). The advantage of this (beyond the
>> scaling one) is that each tenant can still introspect their own table to
>> see what columns it has. Plus you get all the nice type checking and error
>> checking that you'd want, AND a tenant can add secondary indexes if they
>> need better perf (these are shared in a second HBase table).
> Makes sense.
> I was going to ask about partitioning as a way to handle (querying
> against) large volumes of data.  This is related to my Q above about
> date-based partitioning.  But I'm wondering if one can go further.
>  Partitioning by date, partitioning by tenant, but then also partitioning
> by some other columns, which would be different for each type of data being
> inserted. e.g. for sales data maybe the partitions would be date, tenantID,
> but then also customerCountry, customerGender, etc.  For performance
> metrics data maybe it would be date, tenantID, but then also environment
> (prod vs. dev), or applicationType (e.g. my HBase cluster performance
> metrics vs. my Tomcat performance metrics), and so on.

Essentially, a secondary index is declaring a partitioning. The indexed
columns make up the row key which in HBase determines the partitioning.

> Is there any room for that in Phoenix to further speed up queries? (or
> maybe that's what Impala does?  Didn't fully read up on that yet...)

There's always room for better performance! :-) Some at the Phoenix layer,
some at the HBase layer. We work with them constantly - they're super
responsive and great to work with. Lot's going on there.

Would be great to have you contribute too if this sounds promising.

> Thank you for answering all my questions so far, I really appreciate that!
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support *
>>>> On Tue, Mar 25, 2014 at 7:58 PM, Otis Gospodnetic <
>>>>> wrote:
>>>>> Hi,
>>>>> When I saw "Schema on read" my heart jumped because I thought that
>>>>> meant:
>>>>> 1) being able to insert rows without having to define columns ahead of
>>>>> time, and
>>>>> 2) being able to query against any column in a row without having to
>>>>> know which columns one will be searching against.  For example, if a
>>>>> with "anyRandomColumn" gets added, I could run a query like select ....
>>>>> where anyRandomColumn='foo' and select that row even though I didn't
set a
>>>>> secondary index on anyRandomColumn.
>>>>> But after reading a bit about Phoenix I think Phoenix can do 1), but
>>>>> cannot do 2) -- one has to tell it which columns to build indexes.  Is
>>>>> correct?
>>>>> Thanks,
>>>>> Otis
>>>>> --
>>>>> Performance Monitoring * Log Analytics * Search Analytics
>>>>> Solr & Elasticsearch Support *

View raw message