phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: Skip Scan
Date Mon, 05 Oct 2015 18:37:40 GMT
Hi Sumit,

No, it doesn't matter whether the values in an IN clause are sorted or not
- we'll sort them internally to ensure they are. We've run perf tests in
the past where we had 250K values in an IN clause over a 1B row data set,
and performance was still good. Take a look at this blog for more detail:
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html

Thanks,
James

On Mon, Oct 5, 2015 at 2:08 AM, James Heather <james.heather@mendeley.com>
wrote:

> I'll leave someone else to comment on the Phoenix specifics.
>
> I recall from some experiments on MySQL that if you have a massive load of
> IDs to pass, it's quicker if you split them into batches of some reasonable
> (but still large) size, and that for this, you would want to sort them
> first. I don't think it made any difference whether the IDs were sorted
> within an individual SQL statement, but you want to split into batches that
> cover disjoint ranges, so the easiest is to sort the whole lot first and
> then split.
>
> This might be MySQL-specific, though. I think each query was being turned
> into a range scan, from the lowest ID in the IN clause to the highest,
> which was why it was useful to get the ranges disjoint and not too huge.
>
> James
>
>
> On 05/10/15 09:49, Sumit Nigam wrote:
>
> Hi,
>
> Would it make any difference if I were to pass non-sorted IDs (secondary
> indexed) to a huge IN clause? I assume that skip scan optimization would
> work in either case.
>
> Also, can any one let me know if there is some limit to beyond how many
> such IDs in a large IN clause do I get into diminishing returns? Or is it
> plainly dependent on specific workloads and memory of region servers?
>
> Thanks,
> Sumit
>
>
>

Mime
View raw message