Sqoop 是什么及安装
Apache sqoop (SQL to Hadoop)
Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
Sqoop是一种用于在Hadoop和关系数据库或大型机之间传输数据的工具。你可以使用Sqoop从一个关系数据库管理系统(RDBMS)中导入数据,比如MySQL或Oracle,或者一个大型机到Hadoop分布式文件系统(HDFS),在Hadoop MapReduce中转换数据,然后将数据导出到RDBMS中。sqoop 安装
1.解压
2.配置sqoop-env.sh文件
#Set path to where bin/hadoop is availableexport HADOOP_COMMON_HOME=/opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6#Set path to where hadoop-*-core.jar is availableexport HADOOP_MAPRED_HOME=/opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6#Set the path to where bin/hive is availableexport HIVE_HOME=/opt/cdh-5.3.6/hive-0.13.1-cdh5.3.6
3.拷贝 MySQL 连接 jar 包到 lib 目录下
4.使用工具
# 帮助文档bin/sqoop helpbin/sqoop list-databases --help# 连接数据库bin/sqoop list-databases \--connect jdbc:mysql://hostname-ubuntu:3306 \--username root \--password ubuntu
Imports( RDBMS > HDFS )
底层通过JDBC连接数据库,通过生成 JAVA (只有 map 程序) 代码,编译成 jar ,经过 YARN 执行
#执行命令(五要素)bin/sqoop import \--connect jdbc:mysql://hostname-ubuntu:3306/test \--username root \--password ubuntu \--table user \--target-dir /user/cen/sqoop/imp_user \# 注意文件夹不能存在# 若此项不设置默认会在 /user/cen/tablename/ 创建文件夹# 可添加以下选项解决# 删除目标目录--delete-target-dir \# 选择列--columns id,name \# 增加查询条件--where "age = '18'" \# 设置压缩--compress \--compression-codec org.apache.hadoop.io.compress.SnappyCodec# 设置输出的分隔符--fields-terminated-by '\t' \# 设置map数目--num-mappers 1 \# 设置文件存储格式(默认是textfile)--as-parquetfile \# 设置直接从 MySQL 直接导入(不经过MapReduce)--direct #日志分析Running Sqoop version: 1.4.5-cdh5.3.6 说明执行版本Setting your password on the command-line is insecure. Consider using -P instead. 这样使用密码是不安全的,推荐使用-p的方式Preparing to use a MySQL streaming resultset. 使用MySQL流导出数据Beginning code generation 开始生成Java代码Executing SQL statement: SELECT t.* FROM `user` AS t LIMIT 1 执行SQL查询语句Executing SQL statement: SELECT t.* FROM `user` AS t LIMIT 1 执行SQL查询语句HADOOP_MAPRED_HOME is /opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6 Hadoop的目录是。。。Note: /tmp/sqoop-cen/compile/55ff43412ccc6eae25cf74ff6a4b119a/user.java uses or overrides a deprecated API. 生成jarNote: Recompile with -Xlint:deprecation for details.Writing jar file: /tmp/sqoop-cen/compile/55ff43412ccc6eae25cf74ff6a4b119a/user.jar 写入jar文件It looks like you are importing from mysql.This transfer can be faster! Use the --direct 这个转换可以更快,使用直接模式option to exercise a MySQL-specific fast path. Setting zero DATETIME behavior to convertToNull (mysql)Beginning import of user 开始导入数据Unable to load native-hadoop library for your platform... using builtin-java classes where applicablemapred.jar is deprecated. Instead, use mapreduce.job.jar 使用mapreduce的jar包
使用SQL查询语句(无需 tablename 无需选择列 无需where )
bin/sqoop import \--connect jdbc:mysql://hostname-ubuntu:3306/test \--username root \--password ubuntu \--query 'select id from user where $CONDITIONS' \--target-dir /user/cen/sqoop/imp_user_query \--split-by id
- 错误:When importing query results in parallel, you must specify --split-by.
增加:--split-by id
- 错误:Query [select id from user] must contain '$CONDITIONS' in WHERE clause.
- 解决:--query 'select id from user where $CONDITIONS'
增量数据导入Incremental import(只导入新增加的)
1. query 对时间戳进行条件查询(更好用)
where createtime => 20170721000000 and createtimt < 201707220000
2. 使用 sqoop 的参数
--check-column Source column to check for incremental
change 根据哪一个字段进行查询 --incremental Define an incremental import of type 'append' or 'lastmodified' 追加方式 --last-value Last imported value in the incremental check column 查询的开始值Export( HDFS > RDBMS )( Hive > RDBMS )
底层通过JDBC连接数据库,通过生成 JAVA (只有 map 程序) 代码,编译成 jar ,经过 YARN 执行
#执行命令(五要素)bin/sqoop export \--connect jdbc:mysql://hostname-ubuntu:3306/test \--username root \--password ubuntu \--table user \--export-dir /user/cen/sqoop/imp_user
Import Hive table
从 RDBMS 导入到 Hive 表中,底层实现:正常 Import > HDFS > load
# 关系型数据库 4 要素(IP+database/username/pw/table) + Hive 4 要素(import/database/table/分隔符)bin/sqoop import \--connect jdbc:mysql://hostname-ubuntu:3306/test \--username root \--password ubuntu \--table user \--hive-import \--hive-database default \--hive-table user \--fields-terminated-by '\t' \--delete-target-dir \# 需要缓冲目录,若目录存在则删除--create-hive-table #若表不存在,需要创建的话
通过文件执行 sqoop 命令
!!格式如下:
#这是注释!!bin/sqoopimport--connectjdbc:mysql://hostname-ubuntu:3306/test--usernameroot--passwordubuntu--tableuser--target-dir/user/cen/sqoop/imp_user
执行:
sqoop --options-file /opt/datas/xxx.xx