phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <els...@apache.org>
Subject Re: Large differences in query execution time for similar queries
Date Mon, 22 Apr 2019 15:58:06 GMT
Further, I'd try to implement James' suggestions _not_ using the Phoenix 
Query Server. Remember that the thin-client uses PQS, adding a level of 
indirection and re-serialization.

By using the "thick" driver, you can avoid this overhead which will help 
you get repeatable test results with less moving pieces.

On 4/17/19 11:30 AM, James Taylor wrote:
> Hi Hieu,
> You could try add the /*+ SERIAL */ hint to see if that has any impact. 
> Also, have you tried not salting the table? The SALT_BUCKETS value of 
> 128 is pretty high.
> 
> For the other issue, do you have a lot of deleted cells? You might try 
> running a major compaction. You might try adding a secondary index 
> on "doubleCol" if that's a common query.
> 
> Thanks,
> James
> 
> On Thu, Apr 11, 2019 at 5:44 PM Hieu Nguyen <hieu@box.com 
> <mailto:hieu@box.com>> wrote:
> 
>     Hi,
> 
>     I am using Phoenix 4.14-cdh5.11, with sqlline-thin as the client.  I
>     am seeing strange patterns around SELECT query execution time:
>     1. Increasing the LIMIT past a certain "threshold" results in
>     significantly slower execution time.
>     2. Adding just one column (BIGINT) to the SELECT results in
>     significantly slower execution time.
> 
>     This is our schema (names are changed for readability):
>     CREATE TABLE "metadata" (
>        "pk"                           VARCHAR PRIMARY KEY
>     )
>     SALT_BUCKETS = 128,
>     COLUMN_ENCODED_BYTES = 0,
>     BLOOMFILTER = 'ROWCOL',
>     COMPRESSION = 'GZ';
> 
>     CREATE VIEW "extended" (
>     "doubleCol" DOUBLE,
>     "intCol" BIGINT,
>     "intCol2" BIGINT,
>     "intCol3" BIGINT,
>     "stringCol" VARCHAR,
>     "stringCol2" VARCHAR,
>     "stringCol3" VARCHAR,
>     "stringCol4" VARCHAR,
>     "stringCol5" VARCHAR,
>     "stringCol6" VARCHAR,
>     "stringCol7" VARCHAR,
>     ) AS SELECT * FROM "metadata"
> 
>     We have other views created that also select from "metadata" that
>     define their own columns.  Overall, there are 1 million rows in this
>     table, and 20k rows match the condition "doubleCol" > 100.
> 
>     Base query:
>     SELECT
>     "pk","doubleCol","intCol","intCol2","stringCol","stringCol2","intCol3"
>     FROM "templatealldatattype-7d55c5a6-efe3-419d-9bce-9fea7c14f8bc"
>     WHERE "doubleCol" > 100
>     LIMIT 10000
>     -> 1.976 seconds
> 
>     Decreasing LIMIT to 9500 (only 5% decrease in number of rows):
>     SELECT
>     "pk","doubleCol","intCol","intCol2","stringCol","stringCol2","intCol3"
>     FROM "templatealldatattype-7d55c5a6-efe3-419d-9bce-9fea7c14f8bc"
>     WHERE "doubleCol" > 100
>     LIMIT 9500
>     -> 0.409 seconds
> 
>     Removing "intCol3" from SELECT, keeping LIMIT at 10000:
>     SELECT "pk","doubleCol","intCol","intCol2","stringCol","stringCol2"
>     FROM "templatealldatattype-7d55c5a6-efe3-419d-9bce-9fea7c14f8bc"
>     WHERE "doubleCol" > 100
>     LIMIT 10000
>     -> 0.339 seconds
> 
>     I ran each of these queries a few times in a row.  There was small
>     variation in execution time, but the 2nd and 3rd queries never were
>     slower than the 1st query.
> 
>     The EXPLAIN plan did not change, except the ROW LIMIT value when
>     explaining the 2nd query (9500 instead of 10000).
>     +----------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
>     |                                                PLAN   
           
>                                      | EST_BYTES_READ  | EST_ROWS_READ

>     |  EST_INFO_TS   |
>     +----------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
>     | CLIENT 128-CHUNK 382226 ROWS 314572800 BYTES PARALLEL 128-WAY
>     ROUND ROBIN FULL SCAN OVER metadata  | 314572800       | 382226     
>         | 1554973434637  |
>     |     SERVER FILTER BY "doubleCol" > 100.0                         
>                                       | 314572800       | 382226
      
>       | 1554973434637  |
>     |     SERVER 10000 ROW LIMIT                                       
>                                       | 314572800       | 382226
      
>       | 1554973434637  |
>     | CLIENT 10000 ROW LIMIT                                           
>                                       | 314572800       | 382226
      
>       | 1554973434637  |
>     +----------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
> 
>     I tried adding the SEEK_TO_COLUMN and NO_SEEK_TO_COLUMN hints as
>     suggested in a similar thread
>     (https://lists.apache.org/thread.html/4ef8384ecd31f30fdaf5837e3abc613142426d899e916c7aae4a46d4@%3Cuser.phoenix.apache.org%3E),
>     but they had no effect.
> 
>     Any pointers to how we can investigate the 4-5x slowdown when
>     increasing LIMIT by only ~5% or when selecting just one more BIGINT
>     column?  Could we have exceeded some threshold in the result size
>     that caused the query to perform a lot slower for seemingly small
>     changes in the query?
> 
>     Thanks,
>     -Hieu
> 

Mime
View raw message