phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Kiran <maghamraviki...@gmail.com>
Subject Re: Pig vs Bulk Load record count
Date Wed, 04 Feb 2015 00:47:38 GMT
Hi Ralph,

   Also, can you please have the schema also attached to the JIRA using
"DESCRIBE Z"

Regards
Ravi

On Tue, Feb 3, 2015 at 4:39 PM, James Taylor <jamestaylor@apache.org> wrote:

> Glad to hear it, Ralph. Still sounds like there's a bug here (or at a
> minimum a usability issue), but not a showstopper for the 4.3 release.
> Would you mind filing a JIRA for it?
> Thanks,
> James
>
> On Tue, Feb 3, 2015 at 4:31 PM, Ravi Kiran <maghamravikiran@gmail.com>
> wrote:
> > Hi Ralph,
> >
> >    Glad it is working!!
> >
> > Regards
> > Ravi
> >
> > On Tue, Feb 3, 2015 at 3:29 PM, Perko, Ralph J <Ralph.Perko@pnnl.gov>
> wrote:
> >>
> >> I have solved the problem.  This was a mystery because the same data
> >> loaded into the same schema gave conflicting counts depending on the
> load
> >> technique.  While the data itself had no duplicate keys the behavior
> >> suggested something was up with the keys (MR input / output had the
> correct
> >> record count for both load techniques for instance).  I confirmed this
> by
> >> creating a pig udf that created a uuid for each row as the pk.  The
> result
> >> of running this test was each row appeared as expected and I got the
> correct
> >> count.  But I couldn’t figure out why the data itself would behave
> >> differently because it was also unique.  My pig script could hardly be
> >> simpler with no transformations, it is a simple load and store.  This
> ended
> >> up being the issue!
> >>
> >> Solution:
> >> Assign the correct pig data type to the PK values rather than letting
> pig
> >> figure it out.  I am not sure what the exact underlying issue is, but
> this
> >> fixed it (perhaps when pig coerced the values to a datatype it thought
> best
> >> it munged it somehow).
> >>
> >> Changes to pig script from below:
> >>
> >>
> >> Z = load '$data' USING PigStorage(',') as (
> >>
> >>   file_name:chararray,
> >>
> >>   rec_num:int,
> >>
> >>
> >> Thanks for the help
> >>
> >> Ralph
> >>
> >>
> >> From: <Ciureanu>, "Constantin (GfK)" <Constantin.Ciureanu@gfk.com>
> >> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org>
> >> Date: Tuesday, February 3, 2015 at 1:52 AM
> >> To: "user@phoenix.apache.org" <user@phoenix.apache.org>
> >> Subject: RE: Pig vs Bulk Load record count
> >>
> >> Hello Ralph,
> >>
> >>
> >>
> >> Try to check if the PIG script doesn’t produce keys that overlap (that
> >> would explain the reduce in number of rows).
> >>
> >>
> >>
> >> Good luck,
> >>
> >>    Constantin
> >>
> >>
> >>
> >> From: Ravi Kiran [mailto:maghamravikiran@gmail.com]
> >> Sent: Tuesday, February 03, 2015 2:42 AM
> >> To: user@phoenix.apache.org
> >> Subject: Re: Pig vs Bulk Load record count
> >>
> >>
> >>
> >> Thanks Ralph. I will try to reproduce this on my end with a sample data
> >> set and get back to you.
> >>
> >> Regards
> >>
> >> Ravi
> >>
> >>
> >>
> >> On Mon, Feb 2, 2015 at 5:27 PM, Perko, Ralph J <Ralph.Perko@pnnl.gov>
> >> wrote:
> >>
> >> Ravi,
> >>
> >>
> >>
> >> The create statement is attached.  You will see some additional fields I
> >> excluded from the first email.
> >>
> >>
> >>
> >> Thanks!
> >>
> >> Ralph
> >>
> >>
> >>
> >> ________________________________
> >>
> >> From: Ravi Kiran [maghamravikiran@gmail.com]
> >> Sent: Monday, February 02, 2015 5:03 PM
> >> To: user@phoenix.apache.org
> >>
> >>
> >> Subject: Re: Pig vs Bulk Load record count
> >>
> >>
> >>
> >> Hi Ralph,
> >>
> >>    Is it possible to share the CREATE TABLE command as I would like to
> >> reproduce the error on my side with a sample dataset with the specific
> data
> >> types of yours.
> >>
> >> Regards
> >> Ravi
> >>
> >>
> >>
> >> On Mon, Feb 2, 2015 at 1:29 PM, Perko, Ralph J <Ralph.Perko@pnnl.gov>
> >> wrote:
> >>
> >> Ravi,
> >>
> >>
> >>
> >> Thanks for the help - I am sorry I am not finding the upsert statement.
> >> Attache are the logs and output.  I specify the columns because I get
> errors
> >> if I do not.
> >>
> >>
> >>
> >> I ran a test on 10K records.  Pig states it processed 10K records.
> Select
> >> count() says 9030.  I analyzed the 10k data in excel and there are no
> >> duplicates
> >>
> >>
> >>
> >> Thanks!
> >>
> >> Ralph
> >>
> >>
> >>
> >> __________________________________________________
> >>
> >> Ralph Perko
> >>
> >> Pacific Northwest National Laboratory
> >>
> >> (509) 375-2272
> >>
> >> ralph.perko@pnnl.gov
> >>
> >>
> >>
> >> From: Ravi Kiran <maghamravikiran@gmail.com>
> >> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org>
> >> Date: Monday, February 2, 2015 at 12:23 PM
> >>
> >>
> >> To: "user@phoenix.apache.org" <user@phoenix.apache.org>
> >> Subject: Re: Pig vs Bulk Load record count
> >>
> >>
> >>
> >> Hi Ralph,
> >>
> >>    Regarding the upsert query in the logs, it should be Phoenix Custom
> >> Upsert Statement:  as you have explicitly specified the fields in STORE
> .
> >> Is it possible to give it a try with a smaller set of records , say 8k
> to
> >> see the behavior.
> >>
> >> Regards
> >> Ravi
> >>
> >>
> >>
> >> On Mon, Feb 2, 2015 at 11:27 AM, Perko, Ralph J <Ralph.Perko@pnnl.gov>
> >> wrote:
> >>
> >> Thanks for the quick response.  Here is what I have below:
> >>
> >>
> >>
> >> ========================================
> >>
> >> Pig script:
> >>
> >> -------------------------------
> >>
> >> register $phoenix_jar;
> >>
> >>
> >>
> >> Z = load '$data' USING PigStorage(',') as (
> >>
> >>   file_name,
> >>
> >>   rec_num,
> >>
> >>   epoch_time,
> >>
> >>   timet,
> >>
> >>   site,
> >>
> >>   proto,
> >>
> >>   saddr,
> >>
> >>   daddr,
> >>
> >>   sport,
> >>
> >>   dport,
> >>
> >>   mf,
> >>
> >>   cf,
> >>
> >>   dur,
> >>
> >>   sdata,
> >>
> >>   ddata,
> >>
> >>   sbyte,
> >>
> >>   dbyte,
> >>
> >>   spkt,
> >>
> >>   dpkt,
> >>
> >>   siopt,
> >>
> >>   diopt,
> >>
> >>   stopt,
> >>
> >>   dtopt,
> >>
> >>   sflags,
> >>
> >>   dflags,
> >>
> >>   flags,
> >>
> >>   sfseq,
> >>
> >>   dfseq,
> >>
> >>   slseq,
> >>
> >>   dlseq,
> >>
> >>   category);
> >>
> >>
> >>
> >> STORE Z into
> >>
> 'hbase://$table_name/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY'
> >> using
> org.apache.phoenix.pig.PhoenixHBaseStorage('$zookeeper','-batchSize
> >> 5000');
> >>
> >>
> >>
> >> =========================
> >>
> >>
> >>
> >> I cannot find the upsert statement you are referring to in either the MR
> >> logs or Pig output but I do have this below – Pig thinks it output the
> >> correct number of records
> >>
> >>
> >>
> >> Input(s):
> >>
> >> Successfully read 42871627 records (1479463169 bytes) from:
> >> "/data/incoming/201501124931/SAMPLE"
> >>
> >>
> >>
> >> Output(s):
> >>
> >> Successfully stored 42871627 records in:
> >>
> "hbase://TEST/FILE_NAME,REC_NUM,EPOCH_TIME,TIMET,SITE,PROTO,SADDR,DADDR,SPORT,DPORT,MF,CF,DUR,SDATA,DDATA,SBYTE,DBYTE,SPKT,DPKT,SIOPT,DIOPT,STOPT,DTOPT,SFLAGS,DFLAGS,FLAGS,SFSEQ,DFSEQ,SLSEQ,DLSEQ,CATEGORY"
> >>
> >>
> >>
> >>
> >>
> >> Count command:
> >>
> >> select count(1) from TEST;
> >>
> >>
> >>
> >> __________________________________________________
> >>
> >> Ralph Perko
> >>
> >> Pacific Northwest National Laboratory
> >>
> >> (509) 375-2272
> >>
> >> ralph.perko@pnnl.gov
> >>
> >>
> >>
> >> From: Ravi Kiran <maghamravikiran@gmail.com>
> >> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org>
> >> Date: Monday, February 2, 2015 at 11:01 AM
> >> To: "user@phoenix.apache.org" <user@phoenix.apache.org>
> >> Subject: Re: Pig vs Bulk Load record count
> >>
> >>
> >>
> >> Hi Ralph,
> >>
> >>    That's definitely a cause of worry. Can you please share the UPSERT
> >> query being built by Phoenix . You should see it in the logs with an
> entry
> >> "Phoenix Generic Upsert Statement: ..
> >>
> >> Also, what do the MapReduce counters say for the job.  If possible can
> you
> >> share the pig script as sometimes the order of columns in the STORE
> command
> >> impacts.
> >>
> >> Regards
> >> Ravi
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Feb 2, 2015 at 10:46 AM, Perko, Ralph J <Ralph.Perko@pnnl.gov>
> >> wrote:
> >>
> >> Hi, I’ve run into a peculiar issue between loading data using Pig vs the
> >> CsvBulkLoadTool.  I have 42M csv records to load and I am comparing the
> >> performance.
> >>
> >>
> >>
> >> In both cases the MR jobs are successful, and there are no errors.
> >>
> >> In both cases the MR job counters state there are 42M Map input and
> output
> >> records
> >>
> >>
> >>
> >> However, when I run count on the table when the jobs are complete
> >> something is terribly off.
> >>
> >> After the bulk load, select count shows all 42M recs in Phoenix as is
> >> expected.
> >>
> >> After the pig load there are only 3M recs in Phoenix – not even close.
> >>
> >>
> >>
> >> I have no errors to send.  I have run the same test multiple times and
> >> gotten the same results.    The pig script is not doing any
> transformations.
> >> It is a simple LOAD and STORE
> >>
> >> I get the same result using client jars from 4.2.2 and 4.2.3-SNAPSHOT.
> >> 4.2.3-SNAPSHOT is running on the region servers.
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Ralph
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
>

Mime
View raw message