phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Craig Roberts <craig.robe...@frogasia.com>
Subject Re: Duplicate Records Showing in Apache Phoenix
Date Fri, 28 Oct 2016 05:52:38 GMT
Hi all,

I've managed to fix this by manually dropping the SYSTEM.* tables and
restarting HBase/Phoenix. I did a full reset for this, so I'm not quite
sure if I would have been able to keep my data and just drop the SYSTEM
tables, but when I get some spare time I'll see if I can get a set of
reproducible steps for this.

Thanks for the responses :)

*Craig Roberts*
*Senior Developer*

*FrogAsia Sdn Bhd (A YTL Company) *| Unit 9, Level 2, D6 at Sentul East |
801, Jalan Sentul, 51000 Kuala Lumpur | 01125618093 | Twitter
<http://www.twitter.com/FrogAsia> | Facebook
<http://www.facebook.com/FrogAsia> | Website <http://www.frogasia.com/>

*This message (including any attachments) is for the use of the addressee
only. It may contain private proprietary or legally privileged statements
and information. No confidentiality or privilege is waived or lost by any
mistransmission. If you are not the intended recipient, please immediately
delete it and all copies of it from your system, destroy any hard copies of
it and notify the sender. You must not, directly or indirectly, use,
disclose, distribute, print, copy or rely on any part of the message if you
are not the intended recipient. Any views expressed in this message
(including any attachments) are those of the individual sender and not
those of any member of the YTL Group, except where the message states
otherwise and the sender is authorized to state them to be the views of any
such entity.*

On Fri, Oct 28, 2016 at 12:15 PM, Craig Roberts <craig.roberts@frogasia.com>
wrote:

> Hi Brian,
>
> Thanks for the response.
>
> I've managed to pin it down a bit more precisely.
>
> An SQL query by "uuid" in the Phoenix sqlline.py script (JDBC):
>
> 0: jdbc:phoenix:10.124.12.191> SELECT * FROM ANALYTICS WHERE "uuid" =
> 'c21bcfaf-2a2f-4776-9de6-7e25608baf06';
> +------------------------------------------+----------------
> --------------------------+-------------------------+-------
> -----------------------------------+
> |               school_code                |
> code                   |        iso_8601         |
> uuid                   |
> +------------------------------------------+----------------
> --------------------------+-------------------------+-------
> -----------------------------------+
> | WRA0004                                  |
> site.view                                | 2016-02-18 16:00:00.000 |
> c21bcfaf-2a2f-4776-9de6-7e25608baf06     |
> | WRA0004                                  |
> site.view                                | 2016-02-18 16:00:00.000 |
> c21bcfaf-2a2f-4776-9de6-7e25608baf06     |
> +------------------------------------------+----------------
> --------------------------+-------------------------+-------
> -----------------------------------+
>
> And a HBASE scan for the RowKey:
>
> base(main):001:0> get 'ANALYTICS', "WRA0004\x00site.view\x00\x80\
> x00\x01R\xF5\x1C\x04\x00c21bcfaf-2a2f-4776-9de6-7e25608baf06"
> COLUMN                                                CELL
>  events:_0
> timestamp=1477549465155, value=
>  events:auth.name
> timestamp=1477549465155, value=NURUL296389@Yes.My
>  events:auth.role
> timestamp=1477549465155, value=User
>  events:auth.uuid
> timestamp=1477549465155, value=23B1BEAB200055F315DE1F30
> 3EA4C00141F9704C0A51A527
>  events:date
> timestamp=1477549465155, value=2016-02-19 00:00:00
>  events:migrated
> timestamp=1477549465155, value=\x01
>  events:site_uuid
> timestamp=1477549465155, value=2BAC0082200254FE67D6CF0C
> 5E675D052A3F53BC00260F67
>  events:type
> timestamp=1477549465155, value=site
>  events:user_uuid
> timestamp=1477549465155, value=23B1BEAB200055F315DE1F30
> 3EA4C00141F9704C0A51A527
>
> We've been checking, and re-checking - but HBase is adamant it has 745,281
> rows, while Phoenix is adamant it has 1,527,352...
>
> Is there any situation where Phoenix and HBase row counts will differ?
> Some of our records will be similar, but should have different UUIDs. Does
> Phoenix/HBase de-duplication occur at any point that might cause this?
>
> Thanks for any help you guys can offer,
>
> *Craig Roberts*
> *Senior Developer*
>
> *FrogAsia Sdn Bhd (A YTL Company) *| Unit 9, Level 2, D6 at Sentul East |
> 801, Jalan Sentul, 51000 Kuala Lumpur | 01125618093 | Twitter
> <http://www.twitter.com/FrogAsia> | Facebook
> <http://www.facebook.com/FrogAsia> | Website <http://www.frogasia.com/>
>
> *This message (including any attachments) is for the use of the addressee
> only. It may contain private proprietary or legally privileged statements
> and information. No confidentiality or privilege is waived or lost by any
> mistransmission. If you are not the intended recipient, please immediately
> delete it and all copies of it from your system, destroy any hard copies of
> it and notify the sender. You must not, directly or indirectly, use,
> disclose, distribute, print, copy or rely on any part of the message if you
> are not the intended recipient. Any views expressed in this message
> (including any attachments) are those of the individual sender and not
> those of any member of the YTL Group, except where the message states
> otherwise and the sender is authorized to state them to be the views of any
> such entity.*
>
> On Wed, Oct 26, 2016 at 6:58 PM, Brian Jeltema <bdjeltema@gmail.com>
> wrote:
>
>> I suspect that both rows exist and that they are different,  but you
>> can’t see the difference due to
>> some non-printing character. Did you try doing the query using JDBC or a
>> direct HBase scan?
>>
>> Brian
>>
>> On Oct 26, 2016, at 6:20 AM, Craig Roberts <craig.roberts@frogasia.com>
>> wrote:
>>
>> Hi all,
>>
>> I have a strange issue with Phoenix/HBase (Phoenix 4.4 and HBase 1.1 in
>> HDP 2.4).
>>
>> This is my table DDL:
>>
>> CREATE TABLE IF NOT EXISTS ANALYTICS(
>>   "school_code" VARCHAR NOT NULL,
>>   "code" VARCHAR NOT NULL,
>>   "iso_8601" DATE NOT NULL,
>>   "uuid" VARCHAR NOT NULL,
>>   -- Used to ensure the "events" column family exists
>>   "events"."___" VARCHAR,
>>   CONSTRAINT id PRIMARY KEY ("school_code", "code", "iso_8601", "uuid")
>> )
>>
>> I then load events into the table using a custom Flume sink.
>>
>>
>> 0: jdbc:phoenix:10.124.12.191> SELECT COUNT(*) FROM ANALYTICS;
>> +------------------------------------------+
>> |                 COUNT(1)                 |
>> +------------------------------------------+
>> | 1248515                                  |
>> +------------------------------------------+
>> 1 row selected (2.415 seconds)
>> 0: jdbc:phoenix:10.124.12.191> SELECT COUNT(DISTINCT("uuid")) FROM
>> ANALYTICS;
>> +------------------------------------------+
>> |          DISTINCT_COUNT("uuid")          |
>> +------------------------------------------+
>> | 773373                                   |
>> +------------------------------------------+
>> 1 row selected (5.751 seconds)
>>
>> Now, this is only showing up with lots of events, and I have almost 1
>> million events.
>>
>> We've managed to pin down an errant event UUID. The primary key is meant
>> to ensure that repeated events (Flume's at-least-once guarantees) are
>> overwritten, rather than duplicated (due to UPSERT). Here's some example
>> queries for the errant UUID:
>>
>> 0: jdbc:phoenix:10.124.12.191> SELECT * FROM ANALYTICS WHERE "uuid" =
>> '00011783-c870-463c-9c01-4e530a57714f';
>> +------------------------------------------+----------------
>> --------------------------+-------------------------+-------
>> -----------------------------------+
>> |               school_code                |
>> code                   |        iso_8601         |
>> uuid                   |
>> +------------------------------------------+----------------
>> --------------------------+-------------------------+-------
>> -----------------------------------+
>> | WRA0004                                  |
>> site.view                                | 2016-03-21 16:00:00.000 |
>> 00011783-c870-463c-9c01-4e530a57714f     |
>>
>> | WRA0004                                  |
>> site.view                                | 2016-03-21 16:00:00.000 |
>> 00011783-c870-463c-9c01-4e530a57714f     |
>>
>> +------------------------------------------+----------------
>> --------------------------+-------------------------+-------
>> -----------------------------------+
>>
>>
>> Which shows two events with what *should* be a duplicate primary key.
>>
>> If I pin it down by including all the primary key fields, however, I get
>> one result:
>>
>> 0: jdbc:phoenix:10.124.12.191> SELECT * FROM ANALYTICS WHERE
>> "school_code" = 'WRA0004' AND "code" = 'site.view' AND "iso_8601" =
>> TO_DATE('2016-03-21 16:00:00') AND "uuid" = '00011783-c870-463c-9c01-4e530
>> a57714f';
>> +------------------------------------------+----------------
>> --------------------------+-------------------------+-------
>> -----------------------------------+
>> |               school_code                |
>> code                   |        iso_8601         |
>> uuid                   |
>> +------------------------------------------+----------------
>> --------------------------+-------------------------+-------
>> -----------------------------------+
>> | WRA0004                                  |
>> site.view                                | 2016-03-21 16:00:00.000 |
>> 00011783-c870-463c-9c01-4e530a57714f     |
>> +------------------------------------------+----------------
>> --------------------------+-------------------------+-------
>> -----------------------------------+
>>
>> Finally, hbase shell is showing me the correct number of records:
>>
>> hbase(main):001:0> count 'ANALYTICS'
>> # lots of output
>> 773376 row(s) in 142.3770 seconds
>>
>> Now, I'll admit, I would have expected this to be "impossible".
>>
>> Does anybody have any idea how Phoenix can show me two primary keys with
>> a generic query, but only one for a specific query? And how I might fix
>> this?
>>
>> Thanks in advance,
>> *Craig Roberts*
>> *Senior Developer*
>>
>> *FrogAsia Sdn Bhd (A YTL Company) *| Unit 9, Level 2, D6 at Sentul East
>> | 801, Jalan Sentul, 51000 Kuala Lumpur | 01125618093 | Twitter
>> <http://www.twitter.com/FrogAsia> | Facebook
>> <http://www.facebook.com/FrogAsia> | Website <http://www.frogasia.com/>
>>
>> *This message (including any attachments) is for the use of the addressee
>> only. It may contain private proprietary or legally privileged statements
>> and information. No confidentiality or privilege is waived or lost by any
>> mistransmission. If you are not the intended recipient, please immediately
>> delete it and all copies of it from your system, destroy any hard copies of
>> it and notify the sender. You must not, directly or indirectly, use,
>> disclose, distribute, print, copy or rely on any part of the message if you
>> are not the intended recipient. Any views expressed in this message
>> (including any attachments) are those of the individual sender and not
>> those of any member of the YTL Group, except where the message states
>> otherwise and the sender is authorized to state them to be the views of any
>> such entity.*
>>
>>
>>
>

Mime
View raw message