phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brady, John" <>
Subject RE: Phoenix table scan performance
Date Mon, 09 Mar 2015 17:13:48 GMT
Hi Yohan,

Apologies, I don’t have an answer to your question.

Could I ask a separate question please? Is your cluster on AWS?

I have Apache Phoenix installed on a 5 node cluster with 3 zookeeper nodes on AWS. Also using
Phoenix 4.2 with Hbase 0.98.6 from CDH5.3.2.  I put the phoenix server and client jars in
the hbase class path on all nodes and restarted the cluster. The phoenix command line works
on the cluster and running a JDBC app on the cluster returns data.

The problem is that I can’t run a JDBC app outside the cluster.

I've read that the link below that there is an issue on AWS where internal and external IPs
get confused and zookeeper can't connect to HBase properly. Did you have this problem?

As suggested in the link  I solved this by creating aliases in /etc/hosts on the machines
in the cluster pointing at internal IP addresses, then on my local desktop using the same
aliases but pointing to the external IPs. Then, altered my cluster setup to use aliases everywhere
instead of IP addresses. I could run the app on my local machine. But modifying cloud era
config files to point to aliases on the servers ultimately breaks cloudera and isn’t a viable
solution long term.


From: Yohan Bismuth []
Sent: Monday, March 09, 2015 5:02 PM
Subject: Phoenix table scan performance

we're currently using Phoenix 4.2 with Hbase 0.98.6 from CDH5.3.2 on our cluster and we're
experiencing some perf issues.

What we need to do is a full table scan over 1 billion rows. We've got 50 regionservers and
approximatively 1000 regions of 1Gb equally distributed on these rs (which means ~20 regions
per rs). Each node has 14 disks and 12 cores.

A simple "Select count(1) from table" is currently taking 400~500 sec.

We noticed that a range scan over 2 regions located on 2 different rs seems to be done in
parallel (taking 15~20 sec) but a range scan over 2 regions of a single rs is taking twice
this time (about 30~40 sec). We experience the same result with more than 2 regions.

Could this mean that parallelization is done at a regionserver level but not a region level
? in this case 400~500 seconds seems legit with 20~25 regions per rs. We expected regions
of a single rs to be scanned in parallel, is this a normal behavior or are we doing something
wrong ?

Thanks for your help
Intel Ireland Limited (Branch)
Collinstown Industrial Park, Leixlip, County Kildare, Ireland
Registered Number: E902934

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
View raw message