phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Kiran <maghamraviki...@gmail.com>
Subject Re: Pig vs Bulk Load record count
Date Wed, 04 Feb 2015 00:49:54 GMT
Hi Ralph,

   Also, can you please have the schema also attached to the JIRA using
"DESCRIBE Z" when you don't explicitly specify the data type for the
columns.

Regards
Ravi

On Tue, Feb 3, 2015 at 4:47 PM, Ravi Kiran <maghamravikiran@gmail.com>
wrote:

> Hi Ralph,
>
>    Also, can you please have the schema also attached to the JIRA using
> "DESCRIBE Z"
>
> Regards
> Ravi
>
> On Tue, Feb 3, 2015 at 4:39 PM, James Taylor <jamestaylor@apache.org>
> wrote:
>
>> Glad to hear it, Ralph. Still sounds like there's a bug here (or at a
>> minimum a usability issue), but not a showstopper for the 4.3 release.
>> Would you mind filing a JIRA for it?
>> Thanks,
>> James
>>
>> On Tue, Feb 3, 2015 at 4:31 PM, Ravi Kiran <maghamravikiran@gmail.com>
>> wrote:
>> > Hi Ralph,
>> >
>> >    Glad it is working!!
>> >
>> > Regards
>> > Ravi
>> >
>> > On Tue, Feb 3, 2015 at 3:29 PM, Perko, Ralph J <Ralph.Perko@pnnl.gov>
>> wrote:
>> >>
>> >> I have solved the problem.  This was a mystery because the same data
>> >> loaded into the same schema gave conflicting counts depending on the
>> load
>> >> technique.  While the data itself had no duplicate keys the behavior
>> >> suggested something was up with the keys (MR input / output had the
>> correct
>> >> record count for both load techniques for instance).  I confirmed this
>> by
>> >> creating a pig udf that created a uuid for each row as the pk.  The
>> result
>> >> of running this test was each row appeared as expected and I got the
>> correct
>> >> count.  But I couldn’t figure out why the data itself would behave
>> >> differently because it was also unique.  My pig script could hardly be
>> >> simpler with no transformations, it is a simple load and store.  This
>> ended
>> >> up being the issue!
>> >>
>> >> Solution:
>> >> Assign the correct pig data type to the PK values rather than letting
>> pig
>> >> figure it out.  I am not sure what the exact underlying issue is, but
>> this
>> >> fixed it (perhaps when pig coerced the values to a datatype it thought
>> best
>> >> it munged it somehow).
>> >>
>> >> Changes to pig script from below:
>> >>
>> >>
>> >> Z = load '$data' USING PigStorage(',') as (
>> >>
>> >>   file_name:chararray,
>> >>
>> >>   rec_num:int,
>> >>
>> >>
>> >> Thanks for the help
>> >>
>> >> Ralph
>> >>
>> >>
>> >> From: <Ciureanu>, "Constantin (GfK)" <Constantin.Ciureanu@gfk.com>
>> >> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>> >> Date: Tuesday, February 3, 2015 at 1:52 AM
>> >> To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>> >> Subject: RE: Pig vs Bulk Load record count
>> >>
>> >> Hello Ralph,
>> >>
>> >>
>> >>
>> >> Try to check if the PIG script doesn’t produce keys that overlap (that
>> >> would explain the reduce in number of rows).
>> >>
>> >>
>> >>
>> >> Good luck,
>> >>
>> >>    Constantin
>> >>
>> >>
>> >>
>> >> From: Ravi Kiran [mailto:maghamravikiran@gmail.com]
>> >> Sent: Tuesday, February 03, 2015 2:42 AM
>> >> To: user@phoenix.apache.org
>> >> Subject: Re: Pig vs Bulk Load record count
>> >>
>> >>
>> >>
>> >> Thanks Ralph. I will try to reproduce this on my end with a sample data
>> >> set and get back to you.
>> >>
>> >> Regards
>> >>
>> >> Ravi
>> >>
>> >>
>> >>
>> >> On Mon, Feb 2, 2015 at 5:27 PM, Perko, Ralph J <Ralph.Perko@pnnl.gov>
>> >> wrote:
>> >>
>> >> Ravi,
>> >>
>> >>
>> >>
>> >> The create statement is attached.  You will see some additional fields
>> I
>> >> excluded from the first email.
>> >>
>> >>
>> >>
>> >> Thanks!
>> >>
>> >> Ralph
>> >>
>> >>
>> >>
>> >> ________________________________
>> >>
>> >> From: Ravi Kiran [maghamravikiran@gmail.com]
>> >> Sent: Monday, February 02, 2015 5:03 PM
>> >> To: user@phoenix.apache.org
>> >>
>> >>
>> >> Subject: Re: Pig vs Bulk Load record count
>> >>
>> >>
>> >>
>> >> Hi Ralph,
>> >>
>> >>    Is it possible to share the CREATE TABLE command as I would like to
>> >> reproduce the error on my side with a sample dataset with the specific
>> data
>> >> types of yours.
>> >>
>> >> Regards
>> >> Ravi
>> >>
>> >>
>> >>
>> >> On Mon, Feb 2, 2015 at 1:29 PM, Perko, Ralph J <Ralph.Perko@pnnl.gov>
>> >> wrote:
>> >>
>> >> Ravi,
>> >>
>> >>
>> >>
>> >> Thanks for the help - I am sorry I am not finding the upsert statement.
>> >> Attache are the logs and output.  I specify the columns because I get
>> errors
>> >> if I do not.
>> >>
>> >>
>> >>
>> >> I ran a test on 10K records.  Pig states it processed 10K records.
>> Select
>> >> count() says 9030.  I analyzed the 10k data in excel and there are no
>> >> duplicates
>> >>
>> >>
>> >>
>> >> Thanks!
>> >>
>> >> Ralph
>> >>
>> >>
>> >>
>> >> __________________________________________________
>> >>
>> >> Ralph Perko
>> >>
>> >> Pacific Northwest National Laboratory
>> >>
>> >> (509) 375-2272
>> >>
>> >> ralph.perko@pnnl.gov
>> >>
>> >>
>> >>
>> >> From: Ravi Kiran <maghamravikiran@gmail.com>
>> >> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>> >> Date: Monday, February 2, 2015 at 12:23 PM
>> >>
>> >>
>> >> To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>> >> Subject: Re: Pig vs Bulk Load record count
>> >>
>> >>
>> >>
>> >> Hi Ralph,
>> >>
>> >>    Regarding the upsert query in the logs, it should be Phoenix Custom
>> >> Upsert Statement:  as you have explicitly specified the fields in
>> STORE .
>> >> Is it possible to give it a try with a smaller set of records , say 8k
>> to
>> >> see the behavior.
>> >>
>> >> Regards
>> >> Ravi
>> >>
>> >>
>> >>
>> >> On Mon, Feb 2, 2015 at 11:27 AM, Perko, Ralph J <Ralph.Perko@pnnl.gov>
>> >> wrote:
>> >>
>> >> Thanks for the quick response.  Here is what I have below:
>> >>
>> >>
>> >>
>> >> ========================================
>> >>
>> >> Pig script:
>> >>
>> >> -------------------------------
>> >>
>> >> register $phoenix_jar;
>> >>
>> >>
>> >>
>> >> Z = load '$data' USING PigStorage(',') as (
>> >>
>> >>   file_name,
>> >>
>> >>   rec_num,
>> >>
>> >>   epoch_time,
>> >>
>> >>   timet,
>> >>
>> >>   site,
>> >>
>> >>   proto,
>> >>
>> >>   saddr,
>> >>
>> >>   daddr,
>> >>
>> >>   sport,
>> >>
>> >>   dport,
>> >>
>> >>   mf,
>> >>
>> >>   cf,
>> >>
>> >>   dur,
>> >>
>> >>   sdata,
>> >>
>> >>   ddata,
>> >>
>> >>   sbyte,
>> >>
>> >>   dbyte,
>> >>
>> >>   spkt,
>> >>
>> >>   dpkt,
>> >>
>> >>   siopt,
>> >>
>> >>   diopt,
>> >>
>> >>   stopt,
>> >>
>> >>   dtopt,
>> >>
>> >>   sflags,
>> >>
>> >>   dflags,
>> >>
>> >>   flags,
>> >>
>> >>   sfseq,
>> >>
>> >>   dfseq,
>> >>
>> >>   slseq,
>> >>
>> >>   dlseq,
>> >>
>> >>   category);
>> >>
>> >>
>> >>
>> >> STORE Z into
>> >>
>> 'hbase://$table_name/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY'
>> >> using
>> org.apache.phoenix.pig.PhoenixHBaseStorage('$zookeeper','-batchSize
>> >> 5000');
>> >>
>> >>
>> >>
>> >> =========================
>> >>
>> >>
>> >>
>> >> I cannot find the upsert statement you are referring to in either the
>> MR
>> >> logs or Pig output but I do have this below – Pig thinks it output the
>> >> correct number of records
>> >>
>> >>
>> >>
>> >> Input(s):
>> >>
>> >> Successfully read 42871627 records (1479463169 bytes) from:
>> >> "/data/incoming/201501124931/SAMPLE"
>> >>
>> >>
>> >>
>> >> Output(s):
>> >>
>> >> Successfully stored 42871627 records in:
>> >>
>> "hbase://TEST/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY"
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Count command:
>> >>
>> >> select count(1) from TEST;
>> >>
>> >>
>> >>
>> >> __________________________________________________
>> >>
>> >> Ralph Perko
>> >>
>> >> Pacific Northwest National Laboratory
>> >>
>> >> (509) 375-2272
>> >>
>> >> ralph.perko@pnnl.gov
>> >>
>> >>
>> >>
>> >> From: Ravi Kiran <maghamravikiran@gmail.com>
>> >> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>> >> Date: Monday, February 2, 2015 at 11:01 AM
>> >> To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>> >> Subject: Re: Pig vs Bulk Load record count
>> >>
>> >>
>> >>
>> >> Hi Ralph,
>> >>
>> >>    That's definitely a cause of worry. Can you please share the UPSERT
>> >> query being built by Phoenix . You should see it in the logs with an
>> entry
>> >> "Phoenix Generic Upsert Statement: ..
>> >>
>> >> Also, what do the MapReduce counters say for the job.  If possible can
>> you
>> >> share the pig script as sometimes the order of columns in the STORE
>> command
>> >> impacts.
>> >>
>> >> Regards
>> >> Ravi
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, Feb 2, 2015 at 10:46 AM, Perko, Ralph J <Ralph.Perko@pnnl.gov>
>> >> wrote:
>> >>
>> >> Hi, I’ve run into a peculiar issue between loading data using Pig vs
>> the
>> >> CsvBulkLoadTool.  I have 42M csv records to load and I am comparing the
>> >> performance.
>> >>
>> >>
>> >>
>> >> In both cases the MR jobs are successful, and there are no errors.
>> >>
>> >> In both cases the MR job counters state there are 42M Map input and
>> output
>> >> records
>> >>
>> >>
>> >>
>> >> However, when I run count on the table when the jobs are complete
>> >> something is terribly off.
>> >>
>> >> After the bulk load, select count shows all 42M recs in Phoenix as is
>> >> expected.
>> >>
>> >> After the pig load there are only 3M recs in Phoenix – not even close.
>> >>
>> >>
>> >>
>> >> I have no errors to send.  I have run the same test multiple times and
>> >> gotten the same results.    The pig script is not doing any
>> transformations.
>> >> It is a simple LOAD and STORE
>> >>
>> >> I get the same result using client jars from 4.2.2 and 4.2.3-SNAPSHOT.
>> >> 4.2.3-SNAPSHOT is running on the region servers.
>> >>
>> >>
>> >>
>> >> Thanks,
>> >>
>> >> Ralph
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>>
>
>

Mime
View raw message