phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: Schema on read vs. secondary indexes
Date Wed, 26 Mar 2014 18:29:03 GMT
On Wed, Mar 26, 2014 at 9:17 AM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Hi James,
>
> On Wed, Mar 26, 2014 at 2:15 AM, James Taylor <jamestaylor@apache.org>wrote:
>
>> Hi Otis,
>> That's an excellent idea. Phoenix does support (1) & (2), but we don't
>> support adding a secondary index on a dynamic column. However, there's
>> really no reason why we couldn't - we've never had anyone ask for this. Our
>> mutable secondary index support is done at the HBase level, so as long as
>> only Puts and Deletes are done on the data, it should work fine. We'd just
>> need to add the syntax for dynamic column declaration for CREATE INDEX to
>> our grammar.
>>
>
> Only Puts in our case.  And batch Deletes when data's TTL is reached, but
> I assume an external job/process would have to go through HBase rows and
> delete and that would automatically update any data structures and files
> Phoenix uses?
>

If you're using the standard batch Delete API, you'd be fine. If you're
relying on setting a TTL on the HBase table, then you'd need to set the
same TTL on the index table (and this would be somewhat dangerous, as HBase
might prune the data table and index table at different times).


>
>
>> What's the use case you have in mind? Keep in mind too, that adding
>> secondary indexes has an impact on write performance (from the HBase POV,
>> your doing two Puts instead of one and there's some cost associated with
>> the incremental maintenance).
>>
>
> Think of it as an organization-wide "throw anything in and query it right
> away database". :)
>

+1. I like it!

>
> Ideally, I'd like to be able to have HBase that is schemaless ("throw
> anything in"), that is multi-tenant ("anyone at my company can start
> inserting their data"), where each tenant can start adding rows with their
> own columns ("schemaless"), and then search against any of them right away,
> without anyone having to run the "CREATE INDEX..." stuff.  Imagine I work
> in an org with such a database service and I decide I want to start
> tracking the number of user logins in my web application.  So I may decide
> to start adding rows like this:
>

Ah, this is an ideal use case for our multi-tenant support. Take a look at
this: http://phoenix.incubator.apache.org/multi-tenancy.html. So you'd
create a "base" multi-tenant table with the columns below. Then each tenant
would create a "view" over this base table and they could evolve
independently.


>   timestamp, userId, countryOfUser, genderOfUser, userAgent,
> operatingSystem
>   timestamp, userId, countryOfUser, genderOfUser, userAgent,
> operatingSystem
>   timestamp, userId, countryOfUser, genderOfUser, userAgent,
> operatingSystem
>   ...
>
> And then I want to switch to some UI where I can enter:
>
>   select count(*) where userAgent like '%chrome%' and countryOfUser='FR'
> group by gender
>
> I assume this will work ONLY if somebody does CREATE INDEX on userAgent
> and countryOfUser first, right?
>

No, this will work fine with and without an index. An index will just make
the query execute faster (at the cost of slower writes, of course).


>
> Assuming the answer is yes:
> 1) How much overhead does this add in terms of disk space or heap/memory
> or CPU used for creation/updates of this index?
>

Take a look at this:
http://phoenix.incubator.apache.org/secondary_indexing.html and in
particular, checkout the presentation at the bottom that Jesse put
together, as that provides more technical depth. Post 3.0/4.0 release, we
(specifically Rajeshbabu) plan to also support an orthogonal "local index"
mechanism where the index data and table data are colocated on the same RS.
He has a nice writeup on it here: https://github.com/Huawei-Hadoop/hindex


> 2) I assume there is nothing to autodetect new columns being inserted, so
> one would have to "trap" insertion of new columns and quickly do CREATE
> INDEX?
>

See above - indexes are really orthogonal to querying. They just make them
faster.


> 3) How realistic is such a system in a large org with a few 10s of
> thousands of potential users throwing in millions of rows per day and
> expecting responses in a few seconds at most, for simple queries like in my
> above example?.  Sure, depends on the details and hardware, etc., but is
> CREATE INDEX on any one column really realistic?
>

There's the rub, and why we added multi-tenant support. HBase won't scale
past a few hundred tables (at most). In general, HBase is happier with few
big tables than lots of small tables. With multi-tenancy, the Phoenix
tables share the same physical HBase table. Then each "tenant" has his
"view" which is independent of other tenant's views (i.e. columns can be
added to one and won't affect the other). The advantage of this (beyond the
scaling one) is that each tenant can still introspect their own table to
see what columns it has. Plus you get all the nice type checking and error
checking that you'd want, AND a tenant can add secondary indexes if they
need better perf (these are shared in a second HBase table).

Thanks,
James

>
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
>>
>> On Tue, Mar 25, 2014 at 7:58 PM, Otis Gospodnetic <
>> otis.gospodnetic@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> When I saw "Schema on read" my heart jumped because I thought that meant:
>>>
>>> 1) being able to insert rows without having to define columns ahead of
>>> time, and
>>>
>>> 2) being able to query against any column in a row without having to
>>> know which columns one will be searching against.  For example, if a row
>>> with "anyRandomColumn" gets added, I could run a query like select ....
>>> where anyRandomColumn='foo' and select that row even though I didn't set a
>>> secondary index on anyRandomColumn.
>>>
>>> But after reading a bit about Phoenix I think Phoenix can do 1), but
>>> cannot do 2) -- one has to tell it which columns to build indexes.  Is this
>>> correct?
>>>
>>> Thanks,
>>> Otis
>>> --
>>> Performance Monitoring * Log Analytics * Search Analytics
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>>
>>
>

Mime
View raw message