It'd be helpful to see the table DDL and the query too along with an idea of how many regions might be involved in the query. If a query is a commonly run query, usually you'll design the row key around optimizing it. If you have other, simpler queries that have determined your row key, then another alternative is to add one or more secondary indexes. Another common technique is to denormalize your data in ways that precompute the join to avoid having to do it at run time.
With joins, make sure to order your tables from post filtered largest (on LHS) to smallest (on RHS). Also, if you're joining on the PK of both tables, you should use the USE_SORT_MERGE_JOIN hint. Another common tuning exercise is around determining the best parallelization to use (i.e. guidepost width) or even disabling parallelization for more than an entire region's worth of data.
It'd also be interesting to see the raw HBase code for a query of this complexity.