phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Phoenix bulk loading
Date Thu, 12 Feb 2015 20:17:33 GMT
Hi Siva,

If I'm not mistaken, the bulk loader will currently just drop rows
that don't have the correct number of fields.

I'm actually in favor of this behavior (i.e. enforcing the consistency
of records in an input file). A CSV file is generally considered a
table with a given set of columns, so I think it's pretty reasonable
to consider a CSV record corrupt if it doesn't include the correct
number of columns.

I think the workaround here (I think you already realized this) would
be to encode your input data as

r1,c1,c2,c3,
r2,c1,c2,,
r2,c1,c2,c3,4

I realize that this makes your use case more complex, because it
requires processing on the file before loading it into Phoenix, but I
think that keeping the Phoenix loader relatively strict (and simple)
in this case is the better choice.

- Gabriel

On Thu, Feb 12, 2015 at 7:32 PM, Siva <sbhavanari@gmail.com> wrote:
> Hi Gabriel,
>
> Thanks for your response.
>
> Your understanding is correct. The usecase we have is, we get the data from
> different sources (having different table structure (in terms of columns)
> based on client type) in csv format. if a column is not available in source
> we dont have a choice to even append a blank comma (,) with in that place.
> But in Hbase, it ignores the column if it dont find the data with in a
> record.
>
> if I have set of records like below, and if I specify as 4 columns
> (excluding row key), then for first record it inserts data for 3 columns,
> for 2nd record 2 cols, for 3rd record 4 column, it just ignore the column if
> it dont find the data.
>
> r1,c1,c2,c3
> r2,c1,c2
> r2,c1,c2,c3,c4
>
>
> Since phoenix doesn't have this capability, we have to create tables in
> HBase and load them through it. Use Phoenix just for sql queries.
>
> I think we should enhance the Phoenix data loader in the same way like Hbase
> loader. What do you say, any thoughts on this?
>
> Thanks,
> Siva.
>
> On Wed, Feb 11, 2015 at 11:34 PM, Gabriel Reid <gabriel.reid@gmail.com>
> wrote:
>>
>> Hi Siva,
>>
>> If I understand correctly, you want to explicitly supply null values
>> in a CSV file for some fields. In general, this should work by just
>> leaving the field empty in your CSV file. For example, if you have
>> three fields (id, first_name, last_name) in your CSV file, then a
>> record like "1,,Reid" should create a record with first_name left as
>> null.
>>
>> Note that there is still an open bug, PHOENIX-1277 [1] that will
>> prevent inserting null values via the bulk loader or psql, so for some
>> datatypes there currently isn't a way to explicitly supply null
>> values.
>>
>> - Gabriel
>>
>>
>> 1. https://issues.apache.org/jira/browse/PHOENIX-1277
>>
>> On Thu, Feb 12, 2015 at 1:28 AM, Siva <sbhavanari@gmail.com> wrote:
>> > Hello all,
>> >
>> > is there a way to specify to keep NULL values for the columns which were
>> > not there in csv file as part of bulk loading?
>> >
>> > Requirement I have is, few row in csv file contains all the column, but
>> > rows contain only few columns.
>> >
>> > In Hbase, if the given record doesnt have desired columns, it just
>> > ignore
>> > the columns and it goes for next record while loading the data from
>> > ImportTsv.
>> >
>> >
>> > HADOOP_CLASSPATH=/usr/hdp/2.2.0.0-2041/hbase/lib/hbase-protocol.jar:/usr/hdp/2.2.0.0-2041/hbase/conf
>> > hadoop jar
>> > /usr/hdp/2.2.0.0-2041/phoenix/phoenix-4.2.0.2.2.0.0-2041-client.jar
>> > org.apache.phoenix.mapreduce.CsvBulkLoadTool --table P_TEST_2_COLS
>> > --input
>> > /user/sbhavanari/p_h_test_2_cols_less.csv --import-columns NAME,LEADID,D
>> > --zookeeper 172.31.45.176:2181:/hbase
>> >
>> > Thanks,
>> > Siva.
>
>

Mime
View raw message