phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mujtaba Chohan <mujt...@apache.org>
Subject Re: missing rows after using performance.py
Date Tue, 08 Sep 2015 21:16:40 GMT
Thanks James. Filed https://issues.apache.org/jira/browse/PHOENIX-2240.

On Tue, Sep 8, 2015 at 12:38 PM, James Heather <james.heather@mendeley.com>
wrote:

> Thanks.
>
> I've discovered that the cause is even simpler. With 100M rows, you get
> collisions in the primary key in the CSV file. An experiment (capturing the
> CSV file, and counting the rows with a unique primary key) reveals that the
> number of unique primary keys is about 500 short of the full 100M. So the
> upserting is working as it should!
>
> I don't know if there's a way round this, because it does produce rather
> suspicious-looking results. It might be worth having the program emit a
> warning to this effect if the parameter size is large, or finding a way to
> increase the entropy in the primary keys that are generated, to ensure that
> there won't be collisions.
>
> It's a bit surprising no one has run into this before! Hopefully this
> script has been run on that many rows before... it seems a reasonable
> number for testing performance of a scalable database... (in fact I was
> planning to increase the row count somewhat).
>
> James
>
>
> On 08/09/15 20:16, James Taylor wrote:
>
> Hi James,
> Looks like currently you'll get a error log message generated if a row is
> attempted to be imported but cannot be (usually due to the data not being
> compatible with the schema). For psql.py, this would be the client side log
> and messages would look like this:
>             LOG.error("Error upserting record {}: {}", csvRecord,
> errorMessage);
>
> FWIW, we have a "strict" option for CSV loading (using the -s or --strict
> option) which is meant to cause the load to abort if bad data is found, but
> it doesn't look like this is currently checked (when bad data is
> encountered). I've filed PHOENIX-2239 for this.
>
> Thanks,
> James
>
> On Tue, Sep 8, 2015 at 11:26 AM, James Heather <james.heather@mendeley.com
> > wrote:
>
>> I've had another go running the performance.py script to upsert
>> 100,000,000 rows into a Phoenix table, and again I've ended up with around
>> 500 rows missing.
>>
>> Can anyone explain this, or reproduce it?
>>
>> It is rather concerning: I'm reluctant to use Phoenix if I'm not sure
>> whether rows will be silently dropped.
>>
>> James
>>
>
>
>

Mime
View raw message