Please ignore the current thread as I forgot to set email subject :(

Please use the new one with the correct subject: "Slow performance on PRIMARY KEY Queries compared to HBase API"

Adrianos Dadis.

On Thu, Apr 7, 2016 at 6:20 PM, Adrianos Dadis <adddev@gmail.com> wrote:
Hello all,

as part of a larger project, we are evaluating Phoenix performance compared to pure HBase API.

Demo cluster have 6 slave nodes (plus required nodes for additional helper nodes) and all nodes are real machines (not VMs). Cluster is running Cloudera 5.4.5 distribution (Hadoop, HBase, etc.). 
Phoenix distribution (4.5.2 with some changes by CDH) is provided by Cloudera too, but it has some minor alterations in order to be compatible with CDH 5.4.5.

Our code use phoenix version required by CDH.
<dependency>
      <groupId>org.apache.phoenix</groupId>
      <artifactId>phoenix-core</artifactId>
      <version>4.5.2-cdh5.4.5</version>
</dependency>

For Phoenix and Cloudera details please check this: https://blog.cloudera.com/blog/2015/11/new-apache-phoenix-4-5-2-package-from-cloudera-labs/


When running our test, we saw a considerably slower performance on the select query on row key. 

Our test scenario is the following:
1) WRITE to HBase 4.000.000 rows using Phoenix
2) When WRITE is completed, then FETCH random (existed) ROWKEYs from HBase (using Phoenix). Code that run FETCH is running on the same nodes with HBase region servers (using Apache Storm streaming engine).


The CREATE statement used on my example and the select query code are below:
CREATE TABLE IF NOT EXISTS TA (ID varchar not null primary key, TA1 varchar, TA2 varchar, TA3 varchar, TA4 varchar, ta5 varchar) SPLIT ON ('666833', '1333666', '2000500','2667333', '3334166');

The only code related to Phoenix is the following:

public void findById(String id) {
  Connection conn = DriverManager.getConnection(databaseURL);
  PreparedStatement ps = null;
  ResultSet rs = null;
  try {
    ps = conn.prepareStatement("select TA1,TA2,TA4 from TA where ID = ?");
    ps.setString(1, id);
    rs = ps.executeQuery();
    while (rs.next()) {
      LOG.debug("Fetched row OK");
    }         
  } catch (SQLException ex) {
    LOG.error("Could not fetch data for Id {}", Id, ex);
  } finally {
    rs.close();
    ps.close();
    conn.close();
  }
}

The explain query (explain select TA1,TA2,TA4 from TA where ID = '10') returns the following plan:
| CLIENT 1-CHUNK PARALLEL 1-WAY POINT LOOKUP ON 1 KEY OVER TA |

We have used metrics to get the mean execution time of each part of previous code:
- ps.executeQuery(): 0.3ms
- rs.next(): 1.2ms

The query returns only one row and it seems that the slow part is rs.next(). 
The total execution of the same query with HBase API is about 0.3ms , making phoenix 5 times slower.

Just to say that the upsert query of a row (not shown in code) takes about the same time on both implementations (Phoenix and HBase API).

Is it normal or there is something that I should improve?

Thanks,
Adrianos Dadis.