Added: websites/staging/sqoop/trunk/content/docs/1.4.4/SqoopUserGuide.html ============================================================================== --- websites/staging/sqoop/trunk/content/docs/1.4.4/SqoopUserGuide.html (added) +++ websites/staging/sqoop/trunk/content/docs/1.4.4/SqoopUserGuide.html Wed Jul 31 18:02:50 2013 @@ -0,0 +1,3017 @@ +Sqoop User Guide (v1.4.4)
Documentation Home

Sqoop User Guide (v1.4.4)


Table of Contents

1. Introduction
2. Supported Releases
3. Sqoop Releases
4. Prerequisites
5. Basic Usage
6. Sqoop Tools
6.1. Using Command Aliases
6.2. Controlling the Hadoop Installation
6.3. Using Generic and Specific Arguments
6.4. Using Options Files to Pass Arguments
6.5. Using Tools
7. sqoop-import
7.1. Purpose
7.2. Syntax
7.2.1. Connecting to a Database Server
7.2.2. Selecting the Data to Import
7.2.3. Free-form Query Imports
7.2.4. Controlling Parallelism
7.2.5. Controlling the Import Process
7.2.6. Controlling type mapping
7.2.7. Incremental Imports
7.2.8. File Formats
7.2.9. Large Objects
7.2.10. Importing Data Into Hive
7.2.11. Importing Data Into HBase
7.2.12. Additional Import Configuration Properties
7.3. Example Invocations
8. sqoop-import-all-tables
8.1. Purpose
8.2. Syntax
8.3. Example Invocations
9. sqoop-export
9.1. Purpose
9.2. Syntax
9.3. Inserts vs. Updates
9.4. Exports and Transactions
9.5. Failed Exports
9.6. Example Invocations
10. validation
10.1. Purpose
10.2. Introduction
10.3. Syntax
10.4. Configuration
10.5. Limitations
10.6. Example Invocations
11. Saved Jobs
12. sqoop-job
12.1. Purpose
12.2. Syntax
12.3. Saved jobs and passwords
12.4. Saved jobs and incremental imports
13. sqoop-metastore
13.1. Purpose
13.2. Syntax
14. sqoop-merge
14.1. Purpose
14.2. Syntax
15. sqoop-codegen
15.1. Purpose
15.2. Syntax
15.3. Example Invocations
16. sqoop-create-hive-table
16.1. Purpose
16.2. Syntax
16.3. Example Invocations
17. sqoop-eval
17.1. Purpose
17.2. Syntax
17.3. Example Invocations
18. sqoop-list-databases
< a href="#_purpose_11">18.1. Purpose
18.2. Syntax
18.3. Example Invocations
19. sqoop-list-tables
19.1. Purpose
19.2. Syntax
19.3. Example Invocations
20. sqoop-help
20.1. Purpose
20.2. Syntax
2 0.3. Example Invocations
21. sqoop-version
21.1. Purpose
21.2. Syntax
21.3. Example Invocations
22. Sqoop-HCatalog Integration
22.1. HCatalog Background
22.2. Exposing HCatalog Tables to Sqoop
22.2.1. New Command Line Options
22.2.2. Supp orted Sqoop Hive Options
22.2.3. Unsupported Sqoop Options
22.2.3.1. Unsupported Sqoop Hive Import Options
22.2.3.2. Unsupported Sqoop Export and Import Options
22.2.4. Ignored Sqoop Options
22.3. Automatic Table Creation
22.4. Delimited Text Formats and Field and Line Delimiter Characters
22.5. HCatalog Table Requirements
22.6. Support for Partitioning
22.7. Schema Mapping
22.8. Support for HCatalog Data Types
22.9. Providing Hive and HCatalog Libraries for the Sqoop Job
22.10. Examples
22.11. Import
22.12. Export
23. Compatibility Notes
23.1. Supported Databases
23.2. MySQL
23.2.1. zeroDateTimeBehavior
23.2.2. UNSIGNED columns
23.2.3. BLOB and CLOB columns
23.2.4. Importing views in direct mode
23.2.5. Direct-mode Transactions
23.3. PostgreSQL
23.3.1. Importing views in direct mode
23.4. Oracle
23.4.1. Dates and Times
23.5. Schema Definition in Hive
24. Notes for specific connectors
24.1. MySQL JDBC Connector
24.1.1. Upsert functionality
24.2. Microsoft SQL Connector
24.2.1. Extra arguments
24.2.2. Schema support
24.2.3. Table hints
24.3. PostgreSQL Connector
24.3.1. Extra arguments
24.3.2. Schema support
24.4. pg_bulkload connector
24.4.1. Purpose
24.4.2. Requirements
24.4.3. Syntax
24.4.4. Data Staging
24.5. Netezza Connector
24.5.1. Extra arguments
24.5.2. Direct Mode
24.5.3. Null string handling
25. Getting Support
26. Troubleshooting
26.1. General Troubleshooting Process
26.2. Specific Troubleshooting Tips
26.2.1. Oracle: Connection Reset Errors
26.2.2. Oracle: Case-Sensitive Catalog Query Errors
26.2.3. MySQL: Connection Failure
26.2.4. Oracle: ORA-00933 error (SQL command not properly ended)
26.2.5. MySQL: Import of TINYINT(1) from MySQL behaves strangely
  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.

1. Introduction

Sqoop is a tool designed to transfer data between Hadoop and +relational databases. You can use Sqoop to import data from a +relational database management system (RDBMS) such as MySQL or Oracle +into the Hadoop Distributed File System (HDFS), +transform the data in Hadoop MapReduce, and then export the data back +into an RDBMS.

Sqoop automates most of this process, relying on the database to +describe the schema for the data to be imported. Sqoop uses MapReduce +to import and export the data, which provides parallel operation as +well as fault tolerance.

This document describes how to get started using Sqoop to move data +between databases and Hadoop and provides reference information for +the operation of the Sqoop command-line tool suite. This document is +intended for:

  • +System and application programmers +
  • +System administrators +
  • +Database administrators +
  • +Data analysts +
  • +Data engineers +

2. Supported Releases

This documentation applies to Sqoop v1.4.4.

3. Sqoop Releases

Sqoop is an open source software product of the Apache Software Foundation.

Software development for Sqoop occurs at http://sqoop.apache.org +At that site you can obtain:

  • +New releases of Sqoop as well as its most recent source code +
  • +An issue tracker +
  • +A wiki that contains Sqoop documentation +

4. Prerequisites

The following prerequisite knowledge is required for this product:

  • +Basic computer technology and terminology +
  • +Familiarity with command-line interfaces such as bash +
  • +Relational database management systems +
  • +Basic familiarity with the purpose and operation of Hadoop +

Before you can use Sqoop, a release of Hadoop must be installed and +configured. Sqoop is currently supporting 4 major Hadoop releases - 0.20, +0.23, 1.0 and 2.0.

This document assumes you are using a Linux or Linux-like environment. +If you are using Windows, you may be able to use cygwin to accomplish +most of the following tasks. If you are using Mac OS X, you should see +few (if any) compatibility errors. Sqoop is predominantly operated and +tested on Linux.

5. Basic Usage

With Sqoop, you can import data from a relational database system into +HDFS. The input to the import process is a database table. Sqoop +will read the table row-by-row into HDFS. The output of this import +process is a set of files containing a copy of the imported table. +The import process is performed in parallel. For this reason, the +output will be in multiple files. These files may be delimited text +files (for example, with commas or tabs separating each field), or +binary Avro or SequenceFiles containing serialized record data.

A by-product of the import process is a generated Java class which +can encapsulate one row of the imported table. This class is used +during the import process by Sqoop itself. The Java source code for +this class is also provided to you, for use in subsequent MapReduce +processing of the data. This class can serialize and deserialize data +to and from the SequenceFile format. It can also parse the +delimited-text form of a record. These abilities allow you to quickly +develop MapReduce applications that use the HDFS-stored records in +your processing pipeline. You are also free to parse the delimiteds +record data yourself, using any other tools you prefer.

After manipulating the imported records (for example, with MapReduce +or Hive) you may have a result data set which you can then export +back to the relational database. Sqoop’s export process will read +a set of delimited text files from HDFS in parallel, parse them into +records, and insert them as new rows in a target database table, for +consumption by external applications or users.

Sqoop includes some other commands which allow you to inspect the +database you are working with. For example, you can list the available +database schemas (with the sqoop-list-databases tool) and tables +within a schema (with the sqoop-list-tables tool). Sqoop also +includes a primitive SQL execution shell (the sqoop-eval tool).

Most aspects of the import, code generation, and export processes can +be customized. You can control the specific row range or columns imported. +You can specify particular delimiters and escape characters for the +file-based representation of the data, as well as the file format +used. You can also control the class or package names used in +generated code. Subsequent sections of this document explain how to +specify these and other arguments to Sqoop.

6. Sqoop Tools

Sqoop is a collection of related tools. To use Sqoop, you specify the +tool you want to use and the arguments that control the tool.

If Sqoop is compiled from its own source, you can run Sqoop without a formal +installation process by running the bin/sqoop program. Users +of a packaged deployment of Sqoop (such as an RPM shipped with Apache Bigtop) +will see this program installed as /usr/bin/sqoop. The remainder of this +documentation will refer to this program as sqoop. For example:

$ sqoop tool-name [tool-arguments]
[Note]Note

The following examples that begin with a $ character indicate +that the commands must be entered at a terminal prompt (such as +bash). The $ character represents the prompt itself; you should +not start these commands by typing a $. You can also enter commands +inline in the text of a paragraph; for example, sqoop help. These +examples do not show a $ prefix, but you should enter them the same +way. Don’t confuse the $ shell prompt in the examples with the $ +that precedes an environment variable name. For example, the string +literal $HADOOP_HOME includes a "$".

Sqoop ships with a help tool. To display a list of all available +tools, type the following command:

$ sqoop help
+usage: sqoop COMMAND [ARGS]
+
+Available commands:
+  codegen            Generate code to interact with database records
+  create-hive-table  Import a table definition into Hive
+  eval               Evaluate a SQL statement and display the results
+  export             Export an HDFS directory to a database table
+  help               List available commands
+  import             Import a table from a database to HDFS
+  import-all-tables  Import tables from a database to HDFS
+  list-databases     List available databases on a server
+  list-tables        List available tables in a database
+  version            Display version information
+
+See 'sqoop help COMMAND' for information on a specific command.

You can display help for a specific tool by entering: sqoop help +(tool-name); for example, sqoop help import.

You can also add the --help argument to any command: sqoop import +--help.

6.1. Using Command Aliases

In addition to typing the sqoop (toolname) syntax, you can use alias +scripts that specify the sqoop-(toolname) syntax. For example, the +scripts sqoop-import, sqoop-export, etc. each select a specific +tool.

6.2. Controlling the Hadoop Installation

You invoke Sqoop through the program launch capability provided by +Hadoop. The sqoop command-line program is a wrapper which runs the +bin/hadoop script shipped with Hadoop. If you have multiple +installations of Hadoop present on your machine, you can select the +Hadoop installation by setting the $HADOOP_COMMON_HOME and +$HADOOP_MAPRED_HOME environment variables.

For example:

$ HADOOP_COMMON_HOME=/path/to/some/hadoop \
+  HADOOP_MAPRED_HOME=/path/to/some/hadoop-mapreduce \
+  sqoop import --arguments...

or:

$ export HADOOP_COMMON_HOME=/some/path/to/hadoop
+$ export HADOOP_MAPRED_HOME=/some/path/to/hadoop-mapreduce
+$ sqoop import --arguments...

If either of these variables are not set, Sqoop will fall back to +$HADOOP_HOME. If it is not set either, Sqoop will use the default +installation locations for Apache Bigtop, /usr/lib/hadoop and +/usr/lib/hadoop-mapreduce, respectively.

The active Hadoop configuration is loaded from $HADOOP_HOME/conf/, +unless the $HADOOP_CONF_DIR environment variable is set.

6.3. Using Generic and Specific Arguments

To control the operation of each Sqoop tool, you use generic and +specific arguments.

For example:

$ sqoop help import
+usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
+
+Common arguments:
+   --connect <jdbc-uri>     Specify JDBC connect string
+   --connect-manager <jdbc-uri>     Specify connection manager class to use
+   --driver <class-name>    Manually specify JDBC driver class to use
+   --hadoop-mapred-home <dir>+      Override $HADOOP_MAPRED_HOME
+   --help                   Print usage instructions
+   --password-file          Set path for file containing authentication password
+   -P                       Read password from console
+   --password <password>    Set authentication password
+   --username <username>    Set authentication username
+   --verbose                Print more information while working
+   --hadoop-home <dir>+     Deprecated. Override $HADOOP_HOME
+
+[...]
+
+Generic Hadoop command-line arguments:
+(must preceed any tool-specific arguments)
+Generic options supported are
+-conf <configuration file>     specify an application configuration file
+-D <property=value>            use value for given property
+-fs <local|namenode:port>      specify a namenode
+-jt <local|jobtracker:port>    specify a job tracker
+-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
+-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
+-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.
+
+The general command line syntax is
+bin/hadoop command [genericOptions] [commandOptions]

You must supply the generic arguments -conf, -D, and so on after the +tool name but before any tool-specific arguments (such as +--connect). Note that generic Hadoop arguments are preceeded by a +single dash character (-), whereas tool-specific arguments start +with two dashes (--), unless they are single character arguments such as -P.

The -conf, -D, -fs and -jt arguments control the configuration +and Hadoop server settings. For example, the -D mapred.job.name=<job_name> can +be used to set the name of the MR job that Sqoop launches, if not specified, +the name defaults to the jar name for the job - which is derived from the used +table name.

The -files, -libjars, and -archives arguments are not typically used with +Sqoop, but they are included as part of Hadoop’s internal argument-parsing +system.

6.4. Using Options Files to Pass Arguments

When using Sqoop, the command line options that do not change from +invocation to invocation can be put in an options file for convenience. +An options file is a text file where each line identifies an option in +the order that it appears otherwise on the command line. Option files +allow specifying a single option on multiple lines by using the +back-slash character at the end of intermediate lines. Also supported +are comments within option files that begin with the hash character. +Comments must be specified on a new line and may not be mixed with +option text. All comments and empty lines are ignored when option +files are expanded. Unless options appear as quoted strings, any +leading or trailing spaces are ignored. Quoted strings if used must +not extend beyond the line on which they are specified.

Option files can be specified anywhere in the command line as long as +the options within them follow the otherwise prescribed rules of +options ordering. For instance, regardless of where the options are +loaded from, they must follow the ordering such that generic options +appear first, tool specific options next, finally followed by options +that are intended to be passed to child programs.

To specify an options file, simply create an options file in a +convenient location and pass it to the command line via +--options-file argument.

Whenever an options file is specified, it is expanded on the +command line before the tool is invoked. You can specify more than +one option files within the same invocation if needed.

For example, the following Sqoop invocation for import can +be specified alternatively as shown below:

$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST
+
+$ sqoop --options-file /users/homer/work/import.txt --table TEST

where the options file /users/homer/work/import.txt contains the following:

import
+--connect
+jdbc:mysql://localhost/db
+--username
+foo

The options file can have empty lines and comments for readability purposes. +So the above example would work exactly the same if the options file +/users/homer/work/import.txt contained the following:

#
+# Options file for Sqoop import
+#
+
+# Specifies the tool being invoked
+import
+
+# Connect parameter and value
+--connect
+jdbc:mysql://localhost/db
+
+# Username parameter and value
+--username
+foo
+
+#
+# Remaining options should be specified in the command line.
+#

6.5. Using Tools

The following sections will describe each tool’s operation. The +tools are listed in the most likely order you will find them useful.

7. sqoop-import

7.1. Purpose

The import tool imports an individual table from an RDBMS to HDFS. +Each row from a table is represented as a separate record in HDFS. +Records can be stored as text files (one record per line), or in +binary representation as Avro or SequenceFiles.

7.2. Syntax

$ sqoop import (generic-args) (import-args)
+$ sqoop-import (generic-args) (import-args)

While the Hadoop generic arguments must precede any import arguments, +you can type the import arguments in any order with respect to one +another.

[Note]Note

In this document, arguments are grouped into collections +organized by function. Some collections are present in several tools +(for example, the "common" arguments). An extended description of their +functionality is given only on the first presentation in this +document.

Table 1. Common arguments

+ Argument + + Description +
+ --connect <jdbc-uri> + + Specify JDBC connect string +
+ --connection-manager <class-name> + + Specify connection manager class to use +
+ --driver <class-name> + + Manually specify JDBC driver class to use +
+ --hadoop-mapred-home <dir> + + Override $HADOOP_MAPRED_HOME +
+ --help + + Print usage instructions +
+ --password-file + + Set path for a file containing the authentication password +
+ -P + + Read password from console +
+ --password <password> + + Set authentication password +
+ --username <username> + + Set authentication username +
+ --verbose + + Print more information while working +
+ --connection-param-file <filename> + + Optional properties file that provides connection parameters +

7.2.1. Connecting to a Database Server

Sqoop is designed to import tables from a database into HDFS. To do +so, you must specify a connect string that describes how to connect to the +database. The connect string is similar to a URL, and is communicated to +Sqoop with the --connect argument. This describes the server and +database to connect to; it may also specify the port. For example:

$ sqoop import --connect jdbc:mysql://database.example.com/employees

This string will connect to a MySQL database named employees on the +host database.example.com. It’s important that you do not use the URL +localhost if you intend to use Sqoop with a distributed Hadoop +cluster. The connect string you supply will be used on TaskTracker nodes +throughout your MapReduce cluster; if you specify the +literal name localhost, each node will connect to a different +database (or more likely, no database at all). Instead, you should use +the full hostname or IP address of the database host that can be seen +by all your remote nodes.

You might need to authenticate against the database before you can +access it. You can use the --username to supply a username to the database. +Sqoop provides couple of different ways to supply a password, +secure and non-secure, to the database which is detailed below.

Secure way of supplying password to the database. You should save the password in a file on the users home directory with 400 +permissions and specify the path to that file using the --password-file +argument, and is the preferred method of entering credentials. Sqoop will +then read the password from the file and pass it to the MapReduce cluster +using secure means with out exposing the password in the job configuration. +The file containing the password can either be on the Local FS or HDFS. +For example:

$ sqoop import --connect jdbc:mysql://database.example.com/employees \
+    --username venkatesh --passwordFile ${user.home}/.password

Another way of supplying passwords is using the -P argument which will +read a password from a console prompt.

[Warning]Warning

The --password parameter is insecure, as other users may +be able to read your password from the command-line arguments via +the output of programs such as ps. The -P argument is the preferred +method over using the --password argument. Credentials may still be +transferred between nodes of the MapReduce cluster using insecure means. +For example:

$ sqoop import --connect jdbc:mysql://database.example.com/employees \
+    --username aaron --password 12345

Sqoop automatically supports several databases, including MySQL. Connect +strings beginning with jdbc:mysql:// are handled automatically in Sqoop. (A +full list of databases with built-in support is provided in the "Supported +Databases" section. For some, you may need to install the JDBC driver +yourself.)

You can use Sqoop with any other +JDBC-compliant database. First, download the appropriate JDBC +driver for the type of database you want to import, and install the .jar +file in the $SQOOP_HOME/lib directory on your client machine. (This will +be /usr/lib/sqoop/lib if you installed from an RPM or Debian package.) +Each driver .jar file also has a specific driver class which defines +the entry-point to the driver. For example, MySQL’s Connector/J library has +a driver class of com.mysql.jdbc.Driver. Refer to your database +vendor-specific documentation to determine the main driver class. +This class must be provided as an argument to Sqoop with --driver.

For example, to connect to a SQLServer database, first download the driver from +microsoft.com and install it in your Sqoop lib path.

Then run Sqoop. For example:

$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \
+    --connect <connect-string> ...

When connecting to a database using JDBC, you can optionally specify extra +JDBC parameters via a property file using the option +--connection-param-file. The contents of this file are parsed as standard +Java properties and passed into the driver while creating a connection.

[Note]Note

The parameters specified via the optional property file are only +applicable to JDBC connections. Any fastpath connectors that use connections +other than JDBC will ignore these parameters.

Table 2. Validation arguments More Details

+ Argument + + Description +
+ --validate + + Enable validation of data copied, supports single table copy only. --validator <class-name> Specify validator class to use. +
+ --validation-threshold <class-name> + + Specify validation threshold class to use. +
+ +--validation-failurehandler <class-name + + >+ Specify validation failure handler class to use. +

Table 3. Import control arguments:

+ Argument + + Description +
+ --append + + Append data to an existing dataset in HDFS +
+ --as-avrodatafile + + Imports data to Avro Data Files +
+ --as-sequencefile + + Imports data to SequenceFiles +
+ --as-textfile + + Imports data as plain text (default) +
+ --boundary-query <statement> + + Boundary query to use for creating splits +
+ --columns <col,col,col…> + + Columns to import from table +
+ --delete-target-dir + + Delete the import target directory if it exists +
+ --direct + + Use direct import fast path +
+ --direct-split-size <n> + + Split the input stream every n bytes when importing in direct mode +
+ --fetch-size <n> + + Number of entries to read from database at once. +
+ --inline-lob-limit <n> + + Set the maximum size for an inline LOB +
+ -m,--num-mappers <n> + + Use n map tasks to import in parallel +
+ -e,--query <statement> + + Import the results of statement. +
+ --split-by <column-name> + + Column of the table used to split work units +
+ --table <table-name> + + Table to read +
+ --target-dir <dir> + + HDFS destination dir +
+ --warehouse-dir <dir> + + HDFS parent for table destination +
+ --where <where clause> + + WHERE clause to use during import +
+ -z,--compress + + Enable compression +
+ --compression-codec <c> + + Use Hadoop codec (default gzip) +
+ --null-string <null-string> + + The string to be written for a null value for string columns +
+ --null-non-string <null-string> + + The string to be written for a null value for non-string columns +

The --null-string and --null-non-string arguments are optional.\ +If not specified, then the string "null" will be used.

7.2.2. Selecting the Data to Import

Sqoop typically imports data in a table-centric fashion. Use the +--table argument to select the table to import. For example, --table +employees. This argument can also identify a VIEW or other table-like +entity in a database.

By default, all columns within a table are selected for import. +Imported data is written to HDFS in its "natural order;" that is, a +table containing columns A, B, and C result in an import of data such +as:

A1,B1,C1
+A2,B2,C2
+...

You can select a subset of columns and control their ordering by using +the --columns argument. This should include a comma-delimited list +of columns to import. For example: --columns "name,employee_id,jobtitle".

You can control which rows are imported by adding a SQL WHERE clause +to the import statement. By default, Sqoop generates statements of the +form SELECT <column list> FROM <table name>. You can append a +WHERE clause to this with the --where argument. For example: --where +"id > 400". Only rows where the id column has a value greater than +400 will be imported.

By default sqoop will use query select min(<split-by>), max(<split-by>) from +<table name> to find out boundaries for creating splits. In some cases this query +is not the most optimal so you can specify any arbitrary query returning two +numeric columns using --boundary-query argument.

7.2.3. Free-form Query Imports

Sqoop can also import the result set of an arbitrary SQL query. Instead of +using the --table, --columns and --where arguments, you can specify +a SQL statement with the --query argument.

When importing a free-form query, you must specify a destination directory +with --target-dir.

If you want to import the results of a query in parallel, then each map task +will need to execute a copy of the query, with results partitioned by bounding +conditions inferred by Sqoop. Your query must include the token $CONDITIONS +which each Sqoop process will replace with a unique condition expression. +You must also select a splitting column with --split-by.

For example:

$ sqoop import \
+  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
+  --split-by a.id --target-dir /user/foo/joinresults

Alternately, the query can be executed once and imported serially, by +specifying a single map task with -m 1:

$ sqoop import \
+  --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
+  -m 1 --target-dir /user/foo/joinresults
[Note]Note

If you are issuing the query wrapped with double quotes ("), +you will have to use \$CONDITIONS instead of just $CONDITIONS +to disallow your shell from treating it as a shell variable. +For example, a double quoted query may look like: +"SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"

[Note]Note

The facility of using free-form query in the current version of Sqoop +is limited to simple queries where there are no ambiguous projections and +no OR conditions in the WHERE clause. Use of complex queries such as +queries that have sub-queries or joins leading to ambiguous projections can +lead to unexpected results.

7.2.4. Controlling Parallelism

Sqoop imports data in parallel from most database sources. You can +specify the number +of map tasks (parallel processes) to use to perform the import by +using the -m or --num-mappers argument. Each of these arguments +takes an integer value which corresponds to the degree of parallelism +to employ. By default, four tasks are used. Some databases may see +improved performance by increasing this value to 8 or 16. Do not +increase the degree of parallelism greater than that available within +your MapReduce cluster; tasks will run serially and will likely +increase the amount of time required to perform the import. Likewise, +do not increase the degree of parallism higher than that which your +database can reasonably support. Connecting 100 concurrent clients to +your database may increase the load on the database server to a point +where performance suffers as a result.

When performing parallel imports, Sqoop needs a criterion by which it +can split the workload. Sqoop uses a splitting column to split the +workload. By default, Sqoop will identify the primary key column (if +present) in a table and use it as the splitting column. The low and +high values for the splitting column are retrieved from the database, +and the map tasks operate on evenly-sized components of the total +range. For example, if you had a table with a primary key column of +id whose minimum value was 0 and maximum value was 1000, and Sqoop +was directed to use 4 tasks, Sqoop would run four processes which each +execute SQL statements of the form SELECT * FROM sometable WHERE id +>= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), +(500, 750), and (750, 1001) in the different tasks.

If the actual values for the primary key are not uniformly distributed +across its range, then this can result in unbalanced tasks. You should +explicitly choose a different column with the --split-by argument. +For example, --split-by employee_id. Sqoop cannot currently split on +multi-column indices. If your table has no index column, or has a +multi-column key, then you must also manually choose a splitting +column.

7.2.5. Controlling the Import Process

By default, the import process will use JDBC which provides a +reasonable cross-vendor import channel. Some databases can perform +imports in a more high-performance fashion by using database-specific +data movement tools. For example, MySQL provides the mysqldump tool +which can export data from MySQL to other systems very quickly. By +supplying the --direct argument, you are specifying that Sqoop +should attempt the direct import channel. This channel may be +higher performance than using JDBC. Currently, direct mode does not +support imports of large object columns.

When importing from PostgreSQL in conjunction with direct mode, you +can split the import into separate files after +individual files reach a certain size. This size limit is controlled +with the --direct-split-size argument.

By default, Sqoop will import a table named foo to a directory named +foo inside your home directory in HDFS. For example, if your +username is someuser, then the import tool will write to +/user/someuser/foo/(files). You can adjust the parent directory of +the import with the --warehouse-dir argument. For example:

$ sqoop import --connnect <connect-str> --table foo --warehouse-dir /shared \
+    ...

This command would write to a set of files in the /shared/foo/ directory.

You can also explicitly choose the target directory, like so:

$ sqoop import --connnect <connect-str> --table foo --target-dir /dest \
+    ...

This will import the files into the /dest directory. --target-dir is +incompatible with --warehouse-dir.

When using direct mode, you can specify additional arguments which +should be passed to the underlying tool. If the argument +-- is given on the command-line, then subsequent arguments are sent +directly to the underlying tool. For example, the following adjusts +the character set used by mysqldump:

$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
+    --direct -- --default-character-set=latin1

By default, imports go to a new target location. If the destination directory +already exists in HDFS, Sqoop will refuse to import and overwrite that +directory’s contents. If you use the --append argument, Sqoop will import +data to a temporary directory and then rename the files into the normal +target directory in a manner that does not conflict with existing filenames +in that directory.

[Note]Note

When using the direct mode of import, certain database client utilities +are expected to be present in the shell path of the task process. For MySQL +the utilities mysqldump and mysqlimport are required, whereas for +PostgreSQL the utility psql is required.

7.2.6. Controlling type mapping

Sqoop is preconfigured to map most SQL types to appropriate Java or Hive +representatives. However the default mapping might not be suitable for +everyone and might be overridden by --map-column-java (for changing +mapping to Java) or --map-column-hive (for changing Hive mapping).

Table 4. Parameters for overriding mapping

+ Argument + + Description +
+ --map-column-java <mapping> + + Override mapping from SQL to Java type for configured columns. +
+ --map-column-hive <mapping> + + Override mapping from SQL to Hive type for configured columns. +

Sqoop is expecting comma separated list of mapping in form <name of column>=<new type>. For example:

$ sqoop import ... --map-column-java id=String,value=Integer

Sqoop will rise exception in case that some configured mapping will not be used.

7.2.7. Incremental Imports

Sqoop provides an incremental import mode which can be used to retrieve +only rows newer than some previously-imported set of rows.

The following arguments control incremental imports:

Table 5. Incremental import arguments:

+ Argument + + Description +
+ --check-column (col) + + Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR) +
+ --incremental (mode) + + Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified. +
+ --last-value (value) + + Specifies the maximum value of the check column from the previous import. +

Sqoop supports two types of incremental imports: append and lastmodified. +You can use the --incremental argument to specify the type of incremental +import to perform.

You should specify append mode when importing a table where new rows are +continually being added with increasing row id values. You specify the column +containing the row’s id with --check-column. Sqoop imports rows where the +check column has a value greater than the one specified with --last-value.

An alternate table update strategy supported by Sqoop is called lastmodified +mode. You should use this when rows of the source table may be updated, and +each such update will set the value of a last-modified column to the current +timestamp. Rows where the check column holds a timestamp more recent than the +timestamp specified with --last-value are imported.

At the end of an incremental import, the value which should be specified as +--last-value for a subsequent import is printed to the screen. When running +a subsequent import, you should specify --last-value in this way to ensure +you import only the new or updated data. This is handled automatically by +creating an incremental import as a saved job, which is the preferred +mechanism for performing a recurring incremental import. See the section on +saved jobs later in this document for more information.

7.2.8. File Formats

You can import data in one of two file formats: delimited text or +SequenceFiles.

Delimited text is the default import format. You can also specify it +explicitly by using the --as-textfile argument. This argument will write +string-based representations of each record to the output files, with +delimiter characters between individual columns and rows. These +delimiters may be commas, tabs, or other characters. (The delimiters +can be selected; see "Output line formatting arguments.") The +following is the results of an example text-based import:

1,here is a message,2010-05-01
+2,happy new year!,2010-01-01
+3,another message,2009-11-12

Delimited text is appropriate for most non-binary data types. It also +readily supports further manipulation by other tools, such as Hive.

SequenceFiles are a binary format that store individual records in +custom record-specific data types. These data types are manifested as +Java classes. Sqoop will automatically generate these data types for +you. This format supports exact storage of all data in binary +representations, and is appropriate for storing binary data +(for example, VARBINARY columns), or data that will be principly +manipulated by custom MapReduce programs (reading from SequenceFiles +is higher-performance than reading from text files, as records do not +need to be parsed).

Avro data files are a compact, efficient binary format that provides +interoperability with applications written in other programming +languages. Avro also supports versioning, so that when, e.g., columns +are added or removed from a table, previously imported data files can +be processed along with new ones.

By default, data is not compressed. You can compress your data by +using the deflate (gzip) algorithm with the -z or --compress +argument, or specify any Hadoop compression codec using the +--compression-codec argument. This applies to SequenceFile, text, +and Avro files.

7.2.9. Large Objects

Sqoop handles large objects (BLOB and CLOB columns) in particular +ways. If this data is truly large, then these columns should not be +fully materialized in memory for manipulation, as most columns are. +Instead, their data is handled in a streaming fashion. Large objects +can be stored inline with the rest of the data, in which case they are +fully materialized in memory on every access, or they can be stored in +a secondary storage file linked to the primary data storage. By +default, large objects less than 16 MB in size are stored inline with +the rest of the data. At a larger size, they are stored in files in +the _lobs subdirectory of the import target directory. These files +are stored in a separate format optimized for large record storage, +which can accomodate records of up to 2^63 bytes each. The size at +which lobs spill into separate files is controlled by the +--inline-lob-limit argument, which takes a parameter specifying the +largest lob size to keep inline, in bytes. If you set the inline LOB +limit to 0, all large objects will be placed in external +storage.

Table 6. Output line formatting arguments:

+ Argument + + Description +
+ --enclosed-by <char> + + Sets a required field enclosing character +
+ --escaped-by <char> + + Sets the escape character +
+ --fields-terminated-by <char> + + Sets the field separator character +
+ --lines-terminated-by <char> + + Sets the end-of-line character +
+ --mysql-delimiters + + Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: ' +
+ --optionally-enclosed-by <char> + + Sets a field enclosing character +

When importing to delimited files, the choice of delimiter is +important. Delimiters which appear inside string-based fields may +cause ambiguous parsing of the imported data by subsequent analysis +passes. For example, the string "Hello, pleased to meet you" should +not be imported with the end-of-field delimiter set to a comma.

Delimiters may be specified as:

  • +a character (--fields-terminated-by X) +
  • +an escape character (--fields-terminated-by \t). Supported escape + characters are: +

    • +\b (backspace) +
    • +\n (newline) +
    • +\r (carriage return) +
    • +\t (tab) +
    • +\" (double-quote) +
    • +\\' (single-quote) +
    • +\\ (backslash) +
    • +\0 (NUL) - This will insert NUL characters between fields or lines, + or will disable enclosing/escaping if used for one of the --enclosed-by, + --optionally-enclosed-by, or --escaped-by arguments. +
  • +The octal representation of a UTF-8 character’s code point. This + should be of the form \0ooo, where ooo is the octal value. + For example, --fields-terminated-by \001 would yield the ^A character. +
  • +The hexadecimal representation of a UTF-8 character’s code point. This + should be of the form \0xhhh, where hhh is the hex value. + For example, --fields-terminated-by \0x10 would yield the carriage + return character. +

The default delimiters are a comma (,) for fields, a newline (\n) for records, no quote +character, and no escape character. Note that this can lead to +ambiguous/unparsible records if you import database records containing +commas or newlines in the field data. For unambiguous parsing, both must +be enabled. For example, via --mysql-delimiters.

If unambiguous delimiters cannot be presented, then use enclosing and +escaping characters. The combination of (optional) +enclosing and escaping characters will allow unambiguous parsing of +lines. For example, suppose one column of a dataset contained the +following values:

Some string, with a comma.
+Another "string with quotes"

The following arguments would provide delimiters which can be +unambiguously parsed:

$ sqoop import --fields-terminated-by , --escaped-by \\ --enclosed-by '\"' ...

(Note that to prevent the shell from mangling the enclosing character, +we have enclosed that argument itself in single-quotes.)

The result of the above arguments applied to the above dataset would +be:

"Some string, with a comma.","1","2","3"...
+"Another \"string with quotes\"","4","5","6"...

Here the imported strings are shown in the context of additional +columns ("1","2","3", etc.) to demonstrate the full effect of enclosing +and escaping. The enclosing character is only strictly necessary when +delimiter characters appear in the imported text. The enclosing +character can therefore be specified as optional:

$ sqoop import --optionally-enclosed-by '\"' (the rest as above)...

Which would result in the following import:

"Some string, with a comma.",1,2,3...
+"Another \"string with quotes\"",4,5,6...
[Note]Note

Even though Hive supports escaping characters, it does not +handle escaping of new-line character. Also, it does not support +the notion of enclosing characters that may include field delimiters +in the enclosed string. It is therefore recommended that you choose +unambiguous field and record-terminating delimiters without the help +of escaping and enclosing characters when working with Hive; this is +due to limitations of Hive’s input parsing abilities.

The --mysql-delimiters argument is a shorthand argument which uses +the default delimiters for the mysqldump program. +If you use the mysqldump delimiters in conjunction with a +direct-mode import (with --direct), very fast imports can be +achieved.

While the choice of delimiters is most important for a text-mode +import, it is still relevant if you import to SequenceFiles with +--as-sequencefile. The generated class' toString() method +will use the delimiters you specify, so subsequent formatting of +the output data will rely on the delimiters you choose.

Table 7. Input parsing arguments:

+ Argument + + Description +
+ --input-enclosed-by <char> + + Sets a required field encloser +
+ --input-escaped-by <char> + [... 2285 lines stripped ...]