Anchal,
HBase orders data based on your row key. This allows efficient range scans over data. If your data isn't ordered, then range scans become prohibitively expensive. Phoenix leverages this property of HBase to allow queries to run fast. 

In the case of data serialized with the Bytes.toBytes(long) method, negative numbers would sort ahead of positive numbers, so if you ran a query WHERE value > -10 and value < 10 you wouldn't get the correct results.

In your case, the data is outside of the row key, as the value of Cell, so the above doesn't affect it as much (though there are still optimizations Phoenix does which rely on the serialized bytes matching the natural sort order to prevent having to deserialize the bytes frequently). I'd suggest filing a JIRA to support negative values for Bytes.toBytes() serialized values in Cells - we welcome patches.

Alternatives I can think of would be to
- serialize the data differently where ever it is being serialized (you can use our PDataType to do that).
- use a VARBINARY in Phoenix and create a UDF to deserialize the value.

FWIW, HBase introduced a new type system that fixes this sort order issue - you can read about it here: https://blogs.apache.org/hbase/entry/data_types_schema

Thanks,
James

On Wed, Jul 29, 2015 at 6:23 PM, Anchal Agrawal <anchal@yahoo-inc.com> wrote:
Hi James,

Thanks for your reply. I don't understand the issue fully - do HBase's Bytes.toBytes() methods not have the same sort order as that of Phoenix? I'd really appreciate it if you could give more insight on this. Their documentation doesn't mention the sort order. If negative numbers sort ahead of the positive numbers, why is that incompatible with Phoenix?

It's interesting because it seems that we can't have columns in Phoenix views/tables where the values have (negative) long values. In my setup, it is not feasible to create a new Phoenix table and copy over the data because the table is very large and we'd need to recreate the Phoenix table with updated data every time we want to run queries.

Is it feasible to write a UDF (for a SELECT statement) that converts the bytearray in that column to a long value? If it is, would I use the Tuple object that's passed in to the evaluate() method to get the values in that column? I tried using Tuple's getValue() method to grab the bytearray in the column, but I'm running into issues. I'm looking at ToNumberFunction for reference.

I really appreciate your help.

- Anchal


On Wednesday, July 29, 2015 8:43 AM, James Taylor <jamestaylor@apache.org> wrote:


Hi Anchal,
Phoenix depends on the sort order of the serialized bytes to match the natural sort order of the column value. The HBase Bytes.toBytes() methods do not meet this requirement, as negative numbers will sort ahead of positive numbers. About the only option you have in this case is to create a new Phoenix table and copy the data over from your old table. If the data is being created by some external process, then you'd need to change it to use the PDataType toBytes() method instead of the HBase Bytes.toBytes() method.

It's possible that Phoenix could relax this constraint for columns that are not part of the primary key constraint - please file a JIRA for this. We'd need to define a new PDataType (it could share almost all of it's implementation with PUnsignedLong) and handle ORDER BY differently for these types.

Thanks,
James

On Tue, Jul 28, 2015 at 10:40 PM, Anchal Agrawal <anchal@yahoo-inc.com> wrote:
Hi,

I'm creating a Phoenix view of an existing HBase table on v4.4.0.

Command: CREATE VIEW "table_name" (pk VARBINARY PRIMARY KEY, "cf"."col" DATA_TYPE_HERE);

The col column has long values that are serialized by Bytes.toBytes(long) but since some values are negative, I can't use UNSIGNED_LONG. I tried BIGINT instead since the documentation says that it maps to java.lang.Long, but that resulted in incorrect column values. The datatype documentation for UNSIGNED_LONG says "use the regular signed type instead" - which datatype is this referring to? LONG isn't supported.

I could create the view with the column values as bytearrays and write a UDF to extract long values, but I think that will add to the latency. Is there a way around this? I really appreciate your help.

Sincerely,
Anchal Agrawal