phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis.gospodne...@gmail.com>
Subject Re: Schema on read vs. secondary indexes
Date Wed, 26 Mar 2014 16:17:44 GMT
Hi James,

On Wed, Mar 26, 2014 at 2:15 AM, James Taylor <jamestaylor@apache.org>wrote:

> Hi Otis,
> That's an excellent idea. Phoenix does support (1) & (2), but we don't
> support adding a secondary index on a dynamic column. However, there's
> really no reason why we couldn't - we've never had anyone ask for this. Our
> mutable secondary index support is done at the HBase level, so as long as
> only Puts and Deletes are done on the data, it should work fine. We'd just
> need to add the syntax for dynamic column declaration for CREATE INDEX to
> our grammar.
>

Only Puts in our case.  And batch Deletes when data's TTL is reached, but I
assume an external job/process would have to go through HBase rows and
delete and that would automatically update any data structures and files
Phoenix uses?


> What's the use case you have in mind? Keep in mind too, that adding
> secondary indexes has an impact on write performance (from the HBase POV,
> your doing two Puts instead of one and there's some cost associated with
> the incremental maintenance).
>

Think of it as an organization-wide "throw anything in and query it right
away database". :)

Ideally, I'd like to be able to have HBase that is schemaless ("throw
anything in"), that is multi-tenant ("anyone at my company can start
inserting their data"), where each tenant can start adding rows with their
own columns ("schemaless"), and then search against any of them right away,
without anyone having to run the "CREATE INDEX..." stuff.  Imagine I work
in an org with such a database service and I decide I want to start
tracking the number of user logins in my web application.  So I may decide
to start adding rows like this:

  timestamp, userId, countryOfUser, genderOfUser, userAgent, operatingSystem
  timestamp, userId, countryOfUser, genderOfUser, userAgent, operatingSystem
  timestamp, userId, countryOfUser, genderOfUser, userAgent, operatingSystem
  ...

And then I want to switch to some UI where I can enter:

  select count(*) where userAgent like '%chrome%' and countryOfUser='FR'
group by gender

I assume this will work ONLY if somebody does CREATE INDEX on userAgent and
countryOfUser first, right?

Assuming the answer is yes:
1) How much overhead does this add in terms of disk space or heap/memory or
CPU used for creation/updates of this index?
2) I assume there is nothing to autodetect new columns being inserted, so
one would have to "trap" insertion of new columns and quickly do CREATE
INDEX?
3) How realistic is such a system in a large org with a few 10s of
thousands of potential users throwing in millions of rows per day and
expecting responses in a few seconds at most, for simple queries like in my
above example?.  Sure, depends on the details and hardware, etc., but is
CREATE INDEX on any one column really realistic?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/




>
> On Tue, Mar 25, 2014 at 7:58 PM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
>> Hi,
>>
>> When I saw "Schema on read" my heart jumped because I thought that meant:
>>
>> 1) being able to insert rows without having to define columns ahead of
>> time, and
>>
>> 2) being able to query against any column in a row without having to know
>> which columns one will be searching against.  For example, if a row with
>> "anyRandomColumn" gets added, I could run a query like select .... where
>> anyRandomColumn='foo' and select that row even though I didn't set a
>> secondary index on anyRandomColumn.
>>
>> But after reading a bit about Phoenix I think Phoenix can do 1), but
>> cannot do 2) -- one has to tell it which columns to build indexes.  Is this
>> correct?
>>
>> Thanks,
>> Otis
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>

Mime
View raw message