phoenix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Hampson <hamps...@gmail.com>
Subject Re: Importing Postgres-generated CSVs into Phoenix
Date Sat, 09 Jan 2016 01:04:25 GMT
Hi, James:

Thanks for the reply. I took a look at the patch attached to PHOENIX-2434,
and it does look like it has a good chance to solve the problem, as
Postgres exports boolean values as 't' and 'f', which the patch now checks
for. Once we're able to get to Phoenix 4.7.0, I'll definitely try it out to
confirm.

We don't and won't be able to use Flyway due to various architectural
considerations, but we do use Sqoop to get data from Postgres into Hive. I
had skimmed  SQOOP-2649 when originally looking at how to bulk load into
Phoenix, and figured CSV was the way to go for now since the  SQOOP-2649
enhancement was still in patch form. It's definitely something I will keep
an eye on going forward.

Thanks again,
- Ken

On Fri, Jan 8, 2016 at 1:13 PM James Taylor <jamestaylor@apache.org> wrote:

> Hi Ken,
> PHOENIX-2434 improved our CSV handling of booleans and will appear in our
> upcoming 4.7.0 release. It'd be good if you can confirm whether or not this
> is what you need. We definitely want to support ingest of CSVs from other
> RDBMSs.
>
> There are a couple of other avenues of ingest into Phoenix:
> - using Flyway DB (http://flywaydb.org/) where Phoenix support has been
> added (not sure if this has appeared in a release yet or not, though)
> - through a pull request to add support for Phoenix in Scoop (
> http://sqoop.apache.org/). Unfortunately, this isn't getting much
> attention in that community (if this is useful for you, you may want to
> chime in on SQOOP-2649 or on their mailing list).
>
> Thanks,
> James
>
> On Thu, Jan 7, 2016 at 5:28 PM, Ken Hampson <hampsonk@gmail.com> wrote:
>
>> Hi, all:
>>
>> I am in the process of importing some data from Postgres into Phoenix as
>> part of a larger project. I noticed after doing so and cross-checking the
>> results of a fairly complex JOIN query (of 6 tables of so) that the number
>> of rows returned by Postgres and Phoenix differed -- Phoenix had a
>> significant number more rows. I investigated this further, and ultimately
>> determined that the reason was that all *boolean *values were imported
>> as being false, regardless of their actual value.
>>
>> When I investigated that aspect further, I discovered that Phoenix
>> appears to require the full strings "true" and "false" in the CSV in order
>> to properly register their actual values. As a workaround, I am currently
>> using a full SELECT in the "\copy" statement in psql to select each column
>> individually, and for each boolean column, I am using a CASE statement to
>> emit the full strings. I have since also learned that casting the boolean
>> to the "text" datatype will also cause Postgres to emit the full string.
>>
>> In either case, though, the pain point is that this discrepancy requires
>> that I examine each table closely for boolean columns, and if there are
>> any, I then need to enumerate all the columns out explicitly and use either
>> the CASE statement or the cast method on the boolean columns in order to
>> get the values Phoenix expects. In my case, I didn't need to transform any
>> of the data otherwise, so this forced a much more complex procedure.
>>
>> While psql provides many options to change various values -- delimiters
>> and what not -- there is no option to change the boolean values. Likewise,
>> as near as I can tell, there are no options to change what Phoenix expects,
>> either.
>>
>> I am wondering:
>>
>> - Is there any way from within Phoenix to change the boolean values that
>> it expects in the CSV?
>> - Are there any other methods from a Phoenix perspective that you can
>> think of that would obviate the need for the column-by-column enumeration
>> procedure I outlined above?
>> - Since Postgres is a major open source RDBMS which I suspect many people
>> -- myself included -- will want to export data from into Phoenix, could
>> there be an enhancement to Phoenix to allow an option to specify a default
>> boolean value, or perhaps an option to specify the CSV source, from which
>> the proper value could be defined and utilized automatically?
>>
>> Thanks,
>> - Ken
>>
>
>

Mime
View raw message