phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maryann Xue <maryann....@gmail.com>
Subject Re: Phoenix query performance
Date Wed, 22 Feb 2017 21:35:31 GMT
Hi Pradheep,

Thank you for the answers! Please see my response inline.


On Wed, Feb 22, 2017 at 12:39 PM, Pradheep Shanmugam <
Pradheep.Shanmugam@infor.com> wrote:

> Hi Maryann
>
> Please find my answers inline.
>
> Thanks,
> Pradheep
>
> From: Maryann Xue <maryann.xue@gmail.com>
> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org>
> Date: Wednesday, February 22, 2017 at 2:22 PM
> To: "user@phoenix.apache.org" <user@phoenix.apache.org>
> Subject: Re: Phoenix query performance
>
> Hi Pradheep,
>
> Thank you for posting the query and the log file! There are two things
> going on on the server side at the same time here. I think it'd be a good
> idea to isolate the problem first. So a few questions:
> 1. When you say data size went from "< 1M" to 30M, did the data from both
> LHS and RHS grow proportionately?
> *It is basically the same table..the query is like a self join..yes, you
> can say that it is proportional.*
>

Sorry that I didn't read the query well enough. Just went through it again,
but it looks to me that instead of a LEFT OUTER join, you are actually
trying to do an ANTI join, is that correct? i.e.,
SELECT PARENTID
FROM msbo_phoenix_comp_rowkey msbo1
WHERE <other_lhs_conditions> AND NOT EXISTS (
        SELECT 1 FROM msbo_phoenix_comp_rowkey
        WHERE <other_rhs_conditions> AND PARENTID = msbo1. PARENTID)
If the query can be rewritten to an ANTI join, the join operation can be
more efficient.


> 2. If yes to (1), what if we only increase the data in LHS, but keep it
> small for RHS? Would the query run significantly faster?
> *When RHS count is 420336, time taken is 37 seconds*
> *When RHS count is 63575, time taken is 32 seconds (not a significant
> difference)*
> 3. What if we only do group by on LHS? Would the query time be linear to
> the data size?
> After Removing group by on RHS
> *When RHS count is 420336, time taken is 34 seconds*
> *When RHS count is 63575, time taken is 32 seconds*
>

Just to confirm, are you saying that you removed GROUP BY and kept the
JOIN, and it's taking a long time? One more question, how long would it
take to further remove the JOIN?


> 4. How was GC when running the query?
> *About 12ms in 1 RS, 10ms in 1 RS, 4-5ms in couple of them and less than
> 1ms in the rest of the region servers when the query is runnning.*
>
> Thanks,
> Maryann
>
> On Wed, Feb 22, 2017 at 9:28 AM, Pradheep Shanmugam <
> Pradheep.Shanmugam@infor.com> wrote:
>
>> Hi,
>>
>> We have a hbase cluster with 8 region servers with 20G memory
>> We have a table  with 1 column family along with a secondary index.
>> Following query took only few milliseconds when we had less data(< 1
>> million)
>> After adding more data(~30M rows) the performance declined and took about
>> a minute or more(not stable)
>>
>> select msbo1.PARENTID
>>   from msbo_phoenix_comp_rowkey msbo1
>>   left outer join (
>>          select PARENTID,MILESTONETYPEID
>>            from msbo_phoenix_comp_rowkey
>>           where PARENTREFERENCETIME between 1479964000 and 1480464000
>>             and OWNERORGID = 100
>>             and PARENTTYPE = 'SHIPMENT'
>>             and MILESTONETYPEID = 19661
>>             group by PARENTID,MILESTONETYPEID
>>              ) msbo2
>>   on msbo1.PARENTID = msbo2.PARENTID
>>   where msbo1.PARENTTYPE = 'SHIPMENT'
>>    and msbo1.OWNERORGID = 100
>>    and msbo2.MILESTONETYPEID is null
>>    and msbo1.PARENTREFERENCETIME between 1479964000 and 1480464000
>> group by msbo1.PARENTID
>>   order by msbo1.PARENTID
>>
>> The RHS return about a 500K rows ..LHS about 18M rows…final result about
>> 500K rows
>>
>> MSBO_PHOENIX_COMP_ROWKEY_INDEX_PARENTTYPE_OWNERORGID_MILESTONETYPEID_PARENTREFERENCETIME
>>  is the index
>>
>> *Query plan:*
>> CLIENT 8-CHUNK PARALLEL 8-WAY RANGE SCAN OVER
>> MSBO_PHOENIX_COMP_ROWKEY_INDEX_PARENTTYPE_OWNERORGID_MILESTONETYPEID_PARENTREFERENCETIME
>> [0,'SHIPMENT',100]
>>     SERVER FILTER BY FIRST KEY ONLY AND (TO_UNSIGNED_LONG("PARENTREFERENCETIME")
>> >= 1477958400 AND TO_UNSIGNED_LONG("PARENTREFERENCETIME") <= 1480464000)
>>     SERVER AGGREGATE INTO DISTINCT ROWS BY ["MSBO1.:PARENTID"]
>> CLIENT MERGE SORT
>>     PARALLEL LEFT-JOIN TABLE 0
>>         CLIENT 8-CHUNK PARALLEL 8-WAY RANGE SCAN OVER
>> MSBO_PHOENIX_COMP_ROWKEY_INDEX_PARENTTYPE_OWNERORGID_MILESTONETYPEID_PARENTREFERENCETIME
>> [0,'SHIPMENT',100,19661,1,477,958,400] - [0,'SHIPMENT',100,19661,1,480,
>> 464,000]
>>             SERVER FILTER BY FIRST KEY ONLY
>>             SERVER AGGREGATE INTO DISTINCT ROWS BY ["PARENTID",
>> "MILESTONETYPEID"]
>>         CLIENT MERGE SORT
>>     AFTER-JOIN SERVER FILTER BY MSBO2.MILESTONETYPEID IS NULL
>>
>> Attached the phoenix log.
>> I see the caching to set as 100..and "maxResultSize”:2097152..is that
>> something that can be tuned will help?
>> Is that the client merge sort consuming more time can be improved? Is
>> there any other tuning possible?
>>
>> Thanks,
>> Pradheep
>>
>
>

Mime
View raw message