phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcin Januszkiewicz <katamaran...@gmail.com>
Subject Re: SELECT + ORDER BY vs self-join
Date Tue, 31 Oct 2017 08:02:16 GMT
Created https://issues.apache.org/jira/browse/PHOENIX-4336

On Tue, Oct 31, 2017 at 8:44 AM, Marcin Januszkiewicz
<katamaran666@gmail.com> wrote:
> Sorry, I forgot to include the DDL for the index:
> CREATE LOCAL INDEX ix_0 ON traces (UPPER(number)) INCLUDE (time, +
> some other columns used for filtering).
> I'm going to file a JIRA with the details, but local indexes are being
> used in both cases.
> My suspicion is that the first query will attempt to load all 30M rows
> of data from HBase and then sort it, and the second will only sort 30M
> (rowkey, time) pairs.
>
> On Tue, Oct 31, 2017 at 8:02 AM, Sergey Soldatov
> <sergeysoldatov@gmail.com> wrote:
>> I agree with James that this happens because the index was not involved
>> because it doesn't cover all columns. I believe that in the second case, the
>> RHT  is using the index to create a list of rowkeys and they are used for
>> point lookups by skipscan.
>>
>> bq. When is using the self-join a worse choice than the simple select?
>>
>> Hash join has it's own limitations:
>> 1.  RHT is supposed to be small, so it's better to keep LIMIT small (far
>> less than 30 mil).
>> 2.  Client is always involved to collect data from RHT, build the hash join
>> cache and send it to all RSes.
>>
>>
>> bq. Is there a better way to construct this query?
>>
>> Using local index may help in this case.
>>
>> Thanks,
>> Sergey
>>
>>
>> On Mon, Oct 30, 2017 at 11:26 PM, James Taylor <jamestaylor@apache.org>
>> wrote:
>>>
>>> Please file a JIRA and include the explain plan for each of the queries. I
>>> suspect your index is not being used in the first query due to the selection
>>> of all the columns. You can try hinting the query to force your index to be
>>> used. See https://phoenix.apache.org/secondary_indexing.html#Index_Usage
>>>
>>> Thanks,
>>> James
>>>
>>> On Mon, Oct 30, 2017 at 7:02 AM, Marcin Januszkiewicz
>>> <januszkiewicz.marcin@gmail.com> wrote:
>>>>
>>>> We have a wide table with 100M records created with the following DDL:
>>>>
>>>> CREATE TABLE traces (
>>>>   rowkey VARCHAR PRIMARY KEY,
>>>>   time VARCHAR,
>>>>   number VARCHAR,
>>>>   +40 more columns)
>>>>
>>>> We want to select a large (~30M records) subset of this data with the
>>>> query:
>>>>
>>>> SELECT *all columns*
>>>>   FROM traces
>>>>   WHERE (UPPER(number) LIKE 'PO %')
>>>>   ORDER BY time DESC, ROWKEY
>>>>   LIMIT 101;
>>>>
>>>> This times out after 15 minutes and puts a huge load on our cluster.
>>>> We have an alternate way of selecting this data:
>>>>
>>>> SELECT t.rowkey, *all columns*
>>>> FROM TRACES t
>>>> JOIN (
>>>>   SELECT rowkey
>>>>   FROM TRACES
>>>>   WHERE (UPPER(number) LIKE 'PO %')
>>>>   ORDER BY time DESC, ROWKEY
>>>>   LIMIT 101
>>>> ) ix
>>>> ON t.ROWKEY = ix.ROWKEY
>>>> order by t.ROWKEY;
>>>>
>>>> Which completes in just under a minute.
>>>> Is there a better way to construct this query?
>>>> When is using the self-join a worse choice than the simple select?
>>>> Given that we have a functional index on UPPER(number), could this
>>>> potentially be a statistics-based optimizer decision?
>>>>
>>>> --
>>>> Pozdrawiam,
>>>> Marcin Januszkiewicz
>>>
>>>
>>

Mime
View raw message