phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jamestay...@apache.org>
Subject Re: Auto-increment field?
Date Wed, 27 Aug 2014 16:14:46 GMT
Hi JM,
Let me make sure I understand your use case. You have 156M rows worth
of data in the form CustID (BIGINT), URL (VARCHAR). You have a CSV
file with the data. Is CustID already unique in the CSV file? If not,
won't you run into issues trying to load the data, as you'll be
overwriting row values when CustID isn't unique already? If the URL is
unique and you don't have a CustID yet, then you can load that into
one table through our CSV bulk loader and then in second SQL statement
populate a different table with an auto incrementing sequence.

There is an easy way of updating a field in place (or to a different
table) in a single statement which may help you:
    conn.setAutoCommit(true); //  Automatically commit below stmt in batches
    UPSERT INTO my_table(id, my_col)  SELECT id, next value for my_seq
FROM my_table;
If my_col is part of the PK, though, you'll be inserting new rows, so
I'm not sure if this is what you want.

We don't have a way of auto incrementing a field during CSV bulk load.
We do have a JIRA to support defining a DEFAULT in your CREATE TABLE
statement which would allow this
(https://issues.apache.org/jira/browse/PHOENIX-476).

Also, FWIW, we support running a map-reduce job to do bulk CSV import
as well: http://phoenix.apache.org/bulk_dataload.html

Thanks,
James

On Wed, Aug 27, 2014 at 5:56 AM, Jean-Marc Spaggiari
<jean-marc@spaggiari.org> wrote:
> Hi,
>
> I have data like:
>
> CustID, URL
>
> and I want to put that into Phoenix. Is there a way to have an
> auto-increment field to do something like:
> CREATE TABLE IF NOT EXISTS testdata (  id BIGINT NOT NULL,  subid
> AUTO-INCREMENT, url VARCHAR CONSTRAINT my_pk PRIMARY KEY (id, subid));
>
> Idea is, I have duplicates in the key. On customer can access 2 different
> URLs, so I want to store the 2 under this customer name.
>
> I'm totally fine to do that with HBase only, but I'm wondering how I can do
> that with Phoenix?
>
> I planned to use the bulkload given in the documentation to load the data so
> I have a 22GB csv sample I want to load:
> /usr/lib/phoenix/bin/psql.py hbm101 us_population.sql us_population.csv
> us_population_queries.sq
>
> But this doesn't seem to have any increment option.
>
> I can probably use a sequence and loop over all my rows and generate UPSERT
> statement like in this example: http://phoenix.apache.org/sequences.html but
> I have 156 576 856 lines into the file and doing 156M UPSERT might take a
> while.
>
> Do you have any shortcut for my issue? So far, the only idea I have is to
> update the .cvs file and add an increment column manually.
>
> Thanks,
>
> JM

Mime
View raw message