Hi Ralph,

Thanks for the question!
According to the "explain" result you got, the optimization worked exactly as expected with this query:

"DYNAMIC SERVER FILTER BY FILE_ID IN (SS.FILE_ID)" means a skip-scan instead of a full-scan over BULK_TABLE will be executed at runtime based on the values of "file_id" it got from the inner query.

So I need to know a few more things:
1. What is the Primary Key definition of your BULK_TABLE?
2. How many (approximately) distinct "file_id" values are there in the BULK_TABLE? (If you don't know for sure, you can just run a query to find out).
3. How long does it take to run a full-scan query on BULK_TABLE, like "select * from BULK_TABLE"?
4. How long does it take to run a full-scan join query on BULK_TABLE, like "select * from BULK_TABLE join (select file_id, recnum from BULK_TABLE) as SS on BULK_TABLE.file_id = SS.file_id and BULK_TABLE.recnum = SS.recnum"?


Thanks,
Maryann


On Thu, Dec 11, 2014 at 6:28 PM, Perko, Ralph J <Ralph.Perko@pnnl.gov> wrote:
Hi,

Thanks for all your help thus far with Phoenix. 

I am trying to understand the best way to construct a query that returns all the fields from a table but still takes advantage of a single field secondary index.   I have a table with upwards of 50 fields and do not wish to index them all but the use case exists to return them all. 

My general approach is to first select the records I want using an indexed field then use the returned pk values to get the entire record in the form of a subselect or join.

The initial select executes very fast, sub-second , returning close to 3000 records.  When used as a subselect or join the entire query takes very long (over 15min)  or does not return.   Based on the processing plans it appears there is a lot more going on than just a simple look-up of the values returned in the subselect.  Is there a way to do this using Phoenix SQL syntax?

Any suggestions are appreciated.

Initial indexed query (very fast):

SELECT file_id,recnum 

FROM BULK_TABLE 

WHERE saddr IN (ip1,ip2,ip3))


file_id and recnum make up the primary key


Plan:

| CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX [0,1,000,004,076] - [9,1,000,142,114] |

| CLIENT MERGE SORT |


Used as a subselect (times out):

SELECT * FROM BULK_TABLE 

WHERE (file_id,recnum) IN(SELECT file_id,recnum 

              FROM BULK_TABLE 

      WHERE saddr IN (ip1,ip2,ip3));

Plan:

| CLIENT PARALLEL 10-WAY FULL SCAN OVER BULK_TABLE |

| CLIENT MERGE SORT |

|     PARALLEL SEMI-JOIN TABLE 0 (SKIP MERGE) |

|         CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX [0,1,000,004,076] - [9,1,000,142,114] |

|             SERVER AGGREGATE INTO DISTINCT ROWS BY [RECNUM, FILE_ID] |

|         CLIENT MERGE SORT |

|     DYNAMIC SERVER FILTER BY FILE_ID IN ($1.$3) |


Another approach using using a join instead:


SELECT * 

FROM BULK_TABLE 

JOIN 

    (SELECT file_id, recnum 

     FROM BULK_TABLE 

     WHERE saddr in (ip1,ip2,ip3)) AS SS 

ON BULK_TABLE.file_id = SS.file_id AND BULK_TABLE.recnum = SS.recnum;


Runs faster but still can take about 15min


Plan:


| CLIENT PARALLEL 10-WAY FULL SCAN OVER BULK_TABLE |

| CLIENT MERGE SORT |

|     PARALLEL INNER-JOIN TABLE 0 |

|         CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX [0,1,000,004,076] - [9,1,000,142,114] |

|         CLIENT MERGE SORT |

|     DYNAMIC SERVER FILTER BY FILE_ID IN (SS.FILE_ID) |



Is there a more efficient way to run a query such as this?

Thanks!
Ralph







--
Thanks,
Maryann