phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yohan Bismuth <yohan.bismu...@gmail.com>
Subject Re: Phoenix table scan performance
Date Mon, 09 Mar 2015 17:52:22 GMT
I've been facing this issue for a long time, so i'm pretty sure a major
compaction already occured.
Running your query returns 27006.

I have run update statistics on my table, this didn't solve my problem. But
if i understand well, these guideposts are used to parallelize scan over a
region, not between regions of a same regionserver, aren't they ?

On Mon, Mar 9, 2015 at 6:45 PM, James Taylor <jamestaylor@apache.org> wrote:

> Hi Yohan,
> Have you done a major compaction on your table and are stats generated
> for your table? You can run this to confirm:
> SELECT sum(guide_posts_count) from SYSTEM.STATS where
> physical_name=<your full table name>;
>
> Phoenix does intra-region parallelization based on these guideposts as
> described briefly here:
> http://phoenix.apache.org/update_statistics.html
>
> Thanks,
> James
>
> On Mon, Mar 9, 2015 at 10:35 AM, Jerry <chilinglam@gmail.com> wrote:
> > Hi Yohan,
> >
> > I think your observation is correct. A scan in hbase is sequential by
> > default unless you use something like HBASE-10502.
> >
> > Best Regards,
> >
> > Jerry
> >
> > Sent from my iPad
> >
> > On Mar 9, 2015, at 1:01 PM, Yohan Bismuth <yohan.bismuth1@gmail.com>
> wrote:
> >
> > Hello,
> > we're currently using Phoenix 4.2 with Hbase 0.98.6 from CDH5.3.2 on our
> > cluster and we're experiencing some perf issues.
> >
> > What we need to do is a full table scan over 1 billion rows. We've got 50
> > regionservers and approximatively 1000 regions of 1Gb equally
> distributed on
> > these rs (which means ~20 regions per rs). Each node has 14 disks and 12
> > cores.
> >
> > A simple "Select count(1) from table" is currently taking 400~500 sec.
> >
> > We noticed that a range scan over 2 regions located on 2 different rs
> seems
> > to be done in parallel (taking 15~20 sec) but a range scan over 2
> regions of
> > a single rs is taking twice this time (about 30~40 sec). We experience
> the
> > same result with more than 2 regions.
> >
> > Could this mean that parallelization is done at a regionserver level but
> not
> > a region level ? in this case 400~500 seconds seems legit with 20~25
> regions
> > per rs. We expected regions of a single rs to be scanned in parallel, is
> > this a normal behavior or are we doing something wrong ?
> >
> > Thanks for your help
>

Mime
View raw message