phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Perko, Ralph J" <Ralph.Pe...@pnnl.gov>
Subject Re: Query performance question
Date Mon, 15 Dec 2014 19:46:50 GMT
My apologies for the delay.  I had to switch clusters and use a smaller dataset.  The discrepancy
still exists but the numbers are a little different:

I ran the same queries as in the original email (below)

Total records: 581M

Simple query based on secondary index value used in the subselect:
1747 recs – 0.256 sec

Subselect query:  177s

Join query: 179s

The answers to your questions are below.

1. What is the Primary Key definition of your BULK_TABLE?

CONSTRAINT pkey PRIMARY KEY (file_id,recnum)

2. How many (approximately) distinct "file_id" values are there in the BULK_TABLE? (If you
don't know for sure, you can just run a query to find out).

"select count(distinct(file_id)) from BULK_TABLE” - 7902 - 92s

"select distinct(file_id) from BULK_TABLE”  - returns in 25m

3. How long does it take to run a full-scan query on BULK_TABLE, like "select * from BULK_TABLE”?

Results began returning after about 25min

4. How long does it take to run a full-scan join query on BULK_TABLE, like "select * from
BULK_TABLE join (select file_id, recnum from BULK_TABLE) as SS on BULK_TABLE.file_id = SS.file_id
and BULK_TABLE.recnum = SS.recnum”?

The full-scan join fails with a MaxServerCacheSizeExceededException  - server cache set to
1G.

Custom hbase/phoenix settings are attached.

Thanks,
Ralph


From: Maryann Xue <maryann.xue@gmail.com<mailto:maryann.xue@gmail.com>>
Reply-To: "user@phoenix.apache.org<mailto:user@phoenix.apache.org>" <user@phoenix.apache.org<mailto:user@phoenix.apache.org>>
Date: Friday, December 12, 2014 at 8:07 AM
To: "user@phoenix.apache.org<mailto:user@phoenix.apache.org>" <user@phoenix.apache.org<mailto:user@phoenix.apache.org>>
Subject: Re: Query performance question

Hi Ralph,

Thanks for the question!
According to the "explain" result you got, the optimization worked exactly as expected with
this query:

"DYNAMIC SERVER FILTER BY FILE_ID IN (SS.FILE_ID)" means a skip-scan instead of a full-scan
over BULK_TABLE will be executed at runtime based on the values of "file_id" it got from the
inner query.

So I need to know a few more things:
1. What is the Primary Key definition of your BULK_TABLE?
2. How many (approximately) distinct "file_id" values are there in the BULK_TABLE? (If you
don't know for sure, you can just run a query to find out).
3. How long does it take to run a full-scan query on BULK_TABLE, like "select * from BULK_TABLE"?
4. How long does it take to run a full-scan join query on BULK_TABLE, like "select * from
BULK_TABLE join (select file_id, recnum from BULK_TABLE) as SS on BULK_TABLE.file_id = SS.file_id
and BULK_TABLE.recnum = SS.recnum"?


Thanks,
Maryann


On Thu, Dec 11, 2014 at 6:28 PM, Perko, Ralph J <Ralph.Perko@pnnl.gov<mailto:Ralph.Perko@pnnl.gov>>
wrote:
Hi,

Thanks for all your help thus far with Phoenix.

I am trying to understand the best way to construct a query that returns all the fields from
a table but still takes advantage of a single field secondary index.   I have a table with
upwards of 50 fields and do not wish to index them all but the use case exists to return them
all.

My general approach is to first select the records I want using an indexed field then use
the returned pk values to get the entire record in the form of a subselect or join.

The initial select executes very fast, sub-second , returning close to 3000 records.  When
used as a subselect or join the entire query takes very long (over 15min)  or does not return.
  Based on the processing plans it appears there is a lot more going on than just a simple
look-up of the values returned in the subselect.  Is there a way to do this using Phoenix
SQL syntax?

Any suggestions are appreciated.

Initial indexed query (very fast):

SELECT file_id,recnum
FROM BULK_TABLE
WHERE saddr IN (ip1,ip2,ip3))

file_id and recnum make up the primary key

Plan:
| CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX [0,1,000,004,076]
- [9,1,000,142,114] |
| CLIENT MERGE SORT |

Used as a subselect (times out):

SELECT * FROM BULK_TABLE
WHERE (file_id,recnum) IN(SELECT file_id,recnum
             FROM BULK_TABLE
      WHERE saddr IN (ip1,ip2,ip3));
Plan:
| CLIENT PARALLEL 10-WAY FULL SCAN OVER BULK_TABLE |
| CLIENT MERGE SORT |
|     PARALLEL SEMI-JOIN TABLE 0 (SKIP MERGE) |
|         CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX [0,1,000,004,076]
- [9,1,000,142,114] |
|             SERVER AGGREGATE INTO DISTINCT ROWS BY [RECNUM, FILE_ID] |
|         CLIENT MERGE SORT |
|     DYNAMIC SERVER FILTER BY FILE_ID IN ($1.$3) |

Another approach using using a join instead:

SELECT *
FROM BULK_TABLE
JOIN
    (SELECT file_id, recnum
     FROM BULK_TABLE
     WHERE saddr in (ip1,ip2,ip3)) AS SS
ON BULK_TABLE.file_id = SS.file_id AND BULK_TABLE.recnum = SS.recnum;

Runs faster but still can take about 15min

Plan:

| CLIENT PARALLEL 10-WAY FULL SCAN OVER BULK_TABLE |
| CLIENT MERGE SORT |
|     PARALLEL INNER-JOIN TABLE 0 |
|         CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX [0,1,000,004,076]
- [9,1,000,142,114] |
|         CLIENT MERGE SORT |
|     DYNAMIC SERVER FILTER BY FILE_ID IN (SS.FILE_ID) |


Is there a more efficient way to run a query such as this?

Thanks!
Ralph







--
Thanks,
Maryann

Mime
View raw message