phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis.gospodne...@gmail.com>
Subject Re: Schema on read vs. secondary indexes
Date Thu, 27 Mar 2014 16:36:09 GMT
Hi,

Sorry, more Qs.  This is exciting stuff!

On Wed, Mar 26, 2014 at 2:29 PM, James Taylor <jamestaylor@apache.org>wrote:

> On Wed, Mar 26, 2014 at 9:17 AM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
>> Hi James,
>>
>> On Wed, Mar 26, 2014 at 2:15 AM, James Taylor <jamestaylor@apache.org>wrote:
>>
>>> Hi Otis,
>>> That's an excellent idea. Phoenix does support (1) & (2), but we don't
>>> support adding a secondary index on a dynamic column. However, there's
>>> really no reason why we couldn't - we've never had anyone ask for this. Our
>>> mutable secondary index support is done at the HBase level, so as long as
>>> only Puts and Deletes are done on the data, it should work fine. We'd just
>>> need to add the syntax for dynamic column declaration for CREATE INDEX to
>>> our grammar.
>>>
>>
>> Only Puts in our case.  And batch Deletes when data's TTL is reached, but
>> I assume an external job/process would have to go through HBase rows and
>> delete and that would automatically update any data structures and files
>> Phoenix uses?
>>
>
> If you're using the standard batch Delete API, you'd be fine. If you're
> relying on setting a TTL on the HBase table, then you'd need to set the
> same TTL on the index table (and this would be somewhat dangerous, as HBase
> might prune the data table and index table at different times).
>

How about time-based partitioning instead of TTL? e.g. a table for today, a
table for yesterday, etc. (or maybe not "table", but some sort of non-table
partitioning mechanism?).
I'm asking because I assume dropping tables/partitions when they are old
enough to be removed would be faster than having HBase scan everything and
remove based on TTL.

And if such partitioned structure is doable at the HBase level (e.g.
multiple tables), would that work for Phoenix?  For example, if I have
daily partitions/tables, and I query for SELECT model, colour FROM Cars
WHERE saleDate BETWEEN now() and now() - 10 days GROUP BY colour, would
Phoenix be able to query across all 10 daily partitions?

What's the use case you have in mind? Keep in mind too, that adding
>>> secondary indexes has an impact on write performance (from the HBase POV,
>>> your doing two Puts instead of one and there's some cost associated with
>>> the incremental maintenance).
>>>
>>
>> Think of it as an organization-wide "throw anything in and query it right
>> away database". :)
>>
>
> +1. I like it!
>
>>
>> Ideally, I'd like to be able to have HBase that is schemaless ("throw
>> anything in"), that is multi-tenant ("anyone at my company can start
>> inserting their data"), where each tenant can start adding rows with their
>> own columns ("schemaless"), and then search against any of them right away,
>> without anyone having to run the "CREATE INDEX..." stuff.  Imagine I work
>> in an org with such a database service and I decide I want to start
>> tracking the number of user logins in my web application.  So I may decide
>> to start adding rows like this:
>>
>
> Ah, this is an ideal use case for our multi-tenant support. Take a look at
> this: http://phoenix.incubator.apache.org/multi-tenancy.html. So you'd
> create a "base" multi-tenant table with the columns below. Then each tenant
> would create a "view" over this base table and they could evolve
> independently.
>

But imagine there is no base table.  Really, imagine only something like
"timestamp" column is required everywhere, and maybe tenantID, but all
other columns are completely ad-hoc.  That is, somebody in the Sales
department will want to insert rows with sales-specific columns, Marketing
person will shove in marketing data, the IT person may throw in names and
hardware specs and locations of their 10000 servers, the engineer may
decide to stuff performance metrics in this, and the front-end developer
may feed in clickstream data.  So there is no such thing as "base table",
it's really completely open.

Would the multi-tenant approach with a super minimal base table still work,
even though the number of shared/base columns would be very, very small
compared to the total number of different columns across all tenants?

  timestamp, userId, countryOfUser, genderOfUser, userAgent, operatingSystem
>>   timestamp, userId, countryOfUser, genderOfUser, userAgent,
>> operatingSystem
>>   timestamp, userId, countryOfUser, genderOfUser, userAgent,
>> operatingSystem
>>   ...
>>
>> And then I want to switch to some UI where I can enter:
>>
>>   select count(*) where userAgent like '%chrome%' and countryOfUser='FR'
>> group by gender
>>
>> I assume this will work ONLY if somebody does CREATE INDEX on userAgent
>> and countryOfUser first, right?
>>
>
> No, this will work fine with and without an index. An index will just make
> the query execute faster (at the cost of slower writes, of course).
>

Right, but now much slower are reads without an index vs. what is the cost
of maintaining that index at write time?

If you look at my examples above, some of them will be low volume (e.g.
sales figures), but some will be *crazy* high volume with many thousands of
inserts per second across hundreds or more tenants.
So data volume will grow very very fast.
As such, I'm guessing searching against non-indexed columns will go well
beyond a second or two.
But I don't have a sense of the index write cost (disk IO, CPU, disk space,
etc.) and memory cost (I'm assuming some portion of the index is read into
memory/heap and cached?)

Are there any guidelines or anecdotes or benchmarks that would give one
some rough ideas about this?

Assuming the answer is yes:
>> 1) How much overhead does this add in terms of disk space or heap/memory
>> or CPU used for creation/updates of this index?
>>
>
> Take a look at this:
> http://phoenix.incubator.apache.org/secondary_indexing.html and in
> particular, checkout the presentation at the bottom that Jesse put
> together, as that provides more technical depth. Post 3.0/4.0 release, we
> (specifically Rajeshbabu) plan to also support an orthogonal "local index"
> mechanism where the index data and table data are colocated on the same RS.
> He has a nice writeup on it here: https://github.com/Huawei-Hadoop/hindex
>

I had a look, thanks.
But this makes me wonder - isn't this, at least conceptually, then very,
very, very similar to saying "Here is my HBase table, let me hook up Lucene
and build side-car Lucene index(es), so I can use them for queries that
would be better handled through the index Lucene provides instead of doing
indexless scans against the primary HBase table"?  Sure, there is no
inverted index in the hindex or in secondary indexing, but isn't concept
the same?
If yes, why not just use Lucene for example?

And the reason I'm asking is because we have tried using Lucene-based tech
for the use case described here.  The problem was the maintenance of the
index.  Lucene has the notion of segments and these segments need to be
merged (
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html-
nice video showing the merges) as one indexes data, and these merges
make
it very hard to process thousands of inserts per second.

So seeing "secondary indexing" in Phoenix makes me worried that we'd see
the same problem here.
Although.... I suspect Phoenix's indexing is faster because it sounds like
the extra processing is simpler - sounds like the main thing that happens
is that "special keys" are constructed and inserted in separate HBase
tables.  So there is no notion of inverted index, segment merges, just the
usual puts, region splits and compactions, etc.  Right?


> 2) I assume there is nothing to autodetect new columns being inserted, so
>> one would have to "trap" insertion of new columns and quickly do CREATE
>> INDEX?
>>
>
> See above - indexes are really orthogonal to querying. They just make them
> faster.
>

Right, but when the Marketing person says "Hey, let's start tracking <some
new type of marketing data with new columns>" then I assume something
should first issue a CREATE TABLE or CREATE VIEW and/or CREATE INDEX
statements first, before anything is inserted, right?  Actually, I'm
guessing CREATE INDEX can be done at a later time, but the first two would
have to be done before inserts?


> 3) How realistic is such a system in a large org with a few 10s of
>> thousands of potential users throwing in millions of rows per day and
>> expecting responses in a few seconds at most, for simple queries like in my
>> above example?.  Sure, depends on the details and hardware, etc., but is
>> CREATE INDEX on any one column really realistic?
>>
>
> There's the rub, and why we added multi-tenant support. HBase won't scale
> past a few hundred tables (at most). In general, HBase is happier with few
> big tables than lots of small tables. With multi-tenancy, the Phoenix
> tables share the same physical HBase table. Then each "tenant" has his
> "view" which is independent of other tenant's views (i.e. columns can be
> added to one and won't affect the other). The advantage of this (beyond the
> scaling one) is that each tenant can still introspect their own table to
> see what columns it has. Plus you get all the nice type checking and error
> checking that you'd want, AND a tenant can add secondary indexes if they
> need better perf (these are shared in a second HBase table).
>

Makes sense.

I was going to ask about partitioning as a way to handle (querying against)
large volumes of data.  This is related to my Q above about date-based
partitioning.  But I'm wondering if one can go further.  Partitioning by
date, partitioning by tenant, but then also partitioning by some other
columns, which would be different for each type of data being inserted.
e.g. for sales data maybe the partitions would be date, tenantID, but then
also customerCountry, customerGender, etc.  For performance metrics data
maybe it would be date, tenantID, but then also environment (prod vs. dev),
or applicationType (e.g. my HBase cluster performance metrics vs. my Tomcat
performance metrics), and so on.

Is there any room for that in Phoenix to further speed up queries? (or
maybe that's what Impala does?  Didn't fully read up on that yet...)

Thank you for answering all my questions so far, I really appreciate that!

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/




>
>>
>>
>>>
>>> On Tue, Mar 25, 2014 at 7:58 PM, Otis Gospodnetic <
>>> otis.gospodnetic@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> When I saw "Schema on read" my heart jumped because I thought that
>>>> meant:
>>>>
>>>> 1) being able to insert rows without having to define columns ahead of
>>>> time, and
>>>>
>>>> 2) being able to query against any column in a row without having to
>>>> know which columns one will be searching against.  For example, if a row
>>>> with "anyRandomColumn" gets added, I could run a query like select ....
>>>> where anyRandomColumn='foo' and select that row even though I didn't set
a
>>>> secondary index on anyRandomColumn.
>>>>
>>>> But after reading a bit about Phoenix I think Phoenix can do 1), but
>>>> cannot do 2) -- one has to tell it which columns to build indexes.  Is this
>>>> correct?
>>>>
>>>> Thanks,
>>>> Otis
>>>> --
>>>> Performance Monitoring * Log Analytics * Search Analytics
>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>
>>>>
>>>
>>
>

Mime
View raw message