共计 7010 个字符,预计需要花费 18 分钟才能阅读完成。
CDH 内嵌 spark 版本不支持 spark-sql,sparkR, 如果要使用,需要将 hive 的相关依赖包打进 spark assembly jar 中,下面就是针对 spark-sql 的编译、安装步骤
一. 在任意一台 linux 机器上准备编译环境
spark-1.5.0.tgz 下载地址:https://spark.apache.org/downloads.html
jdk1.7.0_79
scala2.10.4
maven3.3.9
版本都是 spark 官网要求如下,详情可参考:https://spark.apache.org/docs/
Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.5.0 uses Scala 2.10. You will need to use a compatible Scala version (2.10.x).
Building Spark using Maven requires Maven 3.3.3 or newer and Java 7+. The Spark build can supply a suitable Maven binary;
配置环境变量如下,并使其生效:source /etc/profile
export JAVA_HOME=/data/jdk1.7.0_79
export M2_HOME=/data/apache-maven-3.3.9
export SCALA_HOME=/data/scala-2.10.4
export PATH=$JAVA_HOME/bin:$M2_HOME/bin:$SCALA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
二. 编译步骤
更多详情点击查看官网:https://spark.apache.org/docs/1.5.0/building-spark.html
1. 重新设置 maven 编译所占空间,因为编译过程复杂、时间长
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
2. 解压 spark-1.5.0.tgz(例如到 /data 目录下), 执行 nohup mvn 命令开始后台编译,结果输出到日志文件 )
nohup mvn -Pyarn -PHadoop-2.6 -Dhadoop.version=hadoop2.6.0-cdh5.5.1 -Dscala-2.10.4 -Phive -Phive-thriftserver -DskipTests clean package > ./spark-mvn-`date +%Y%m%d%H`.log 2>&1 &
首次编译,需要 2 - 3 小时,具体看网络情况,(我编译多次,最后成功) 编译成功日志末尾如下
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [3.200 s]
[INFO] Spark Project Launcher ............................. SUCCESS [8.887 s]
[INFO] Spark Project Networking ........................... SUCCESS [8.270 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [4.832 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [6.082 s]
[INFO] Spark Project Core ................................. SUCCESS [01:52 min]
[INFO] Spark Project Bagel ................................ SUCCESS [5.129 s]
[INFO] Spark Project GraphX ............................... SUCCESS [13.442 s]
[INFO] Spark Project Streaming ............................ SUCCESS [30.683 s]
[INFO] Spark Project Catalyst ............................. SUCCESS [43.622 s]
[INFO] Spark Project SQL .................................. SUCCESS [53.463 s]
[INFO] Spark Project ML Library ........................... SUCCESS [01:06 min]
[INFO] Spark Project Tools ................................ SUCCESS [2.225 s]
[INFO] Spark Project Hive ................................. SUCCESS [42.020 s]
[INFO] Spark Project REPL ................................. SUCCESS [8.500 s]
[INFO] Spark Project YARN ................................. SUCCESS [9.665 s]
[INFO] Spark Project Hive Thrift Server ................... SUCCESS [7.255 s]
[INFO] Spark Project Assembly ............................. SUCCESS [02:15 min]
[INFO] Spark Project External Twitter ..................... SUCCESS [7.330 s]
[INFO] Spark Project External Flume Sink .................. SUCCESS [5.103 s]
[INFO] Spark Project External Flume ....................... SUCCESS [8.405 s]
[INFO] Spark Project External Flume Assembly .............. SUCCESS [2.928 s]
[INFO] Spark Project External MQTT ........................ SUCCESS [15.932 s]
[INFO] Spark Project External MQTT Assembly ............... SUCCESS [7.792 s]
[INFO] Spark Project External ZeroMQ ...................... SUCCESS [6.057 s]
[INFO] Spark Project External Kafka ....................... SUCCESS [10.135 s]
[INFO] Spark Project Examples ............................. SUCCESS [01:49 min]
[INFO] Spark Project External Kafka Assembly .............. SUCCESS [8.111 s]
[INFO] Spark Project YARN Shuffle Service ................. SUCCESS [5.814 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 12:28 min
[INFO] Finished at: 2016-07-26T16:05:11+08:00
[INFO] Final Memory: 90M/1589M
[INFO] ------------------------------------------------------------------------
同时在如下目录会找到生成的 spark assembly 的 jar
/data/spark-1.5.0/assembly/target/scala-2.10/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
三. 安装 spark assembly
1. 拷贝 assembly jar 包
到 CDH 机器 180.153..,将 jar 包远程拷贝过来,例如到 /home/hadoop 目录下
scp -P 50201 /data/spark-1.5.0/assembly/target/scala-2.10/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar root@180.153.*.*:/home/hadoop
然后再复制到 CDH 的 jars 目录下, 如果已存在,将其备份后删除
cp -p /home/hadoop/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars
2. 替换 CDH 中 spark 下的 assembly jar 包
其实就是修改软连接 spark-assembly.jar 指向 CDH 的 jars 目录下的 spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar,软连接所在路径:
/opt/cloudera/parcels/CDH/lib/spark/lib,删除原来的,新增连接
ln -s ../../../jars/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
ln -s spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar spark-assembly.jar
查看软连接情况
[root@db1 lib]# ll
total 209204
-rw-r--r-- 1 root root 21645 Dec 3 2015 python.tar.gz
lrwxrwxrwx 1 root root 68 Jan 14 2016 spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar -> ../../../jars/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
lrwxrwxrwx 1 root root 54 Jan 14 2016 spark-assembly.jar -> spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
lrwxrwxrwx 1 root root 68 Jan 14 2016 spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar -> ../../../jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
lrwxrwxrwx 1 root root 54 Jan 14 2016 spark-examples.jar -> spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
[root@db1 lib]#
3. 拷贝 spark-sql 运行文件
从 spark 源文件的 bin 下拷贝到 CDH 的 spark 的 bin 目录下
scp -P 50201 /data/spark-1.5.0/bin/spark-sql root@180.153.*.*:/opt/cloudera/parcels/CDH/lib/spark/bin
4. 配置环境变量
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_CMD=/opt/cloudera/parcels/CDH/bin/hadoop
export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export SCALA_HOME=/usr/local/scala-2.10.4
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:$SCALA_HOME/bin
5. 拷贝 assembly jar 包拷贝到 HDFS
首先需要将 assembly jar 拷贝到 HDFS 的 /user/spark/share/lib 目录下,修改文件权限为 755
hadoop fs -put /home/hadoop/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar /user/spark/share/lib
6. 在 CM 上配置
- 登陆 CM, 修改 spark 的服务范围为 assembly jar 在 HDFS 中的路径
/user/spark/share/lib/spark-assembly-1.5.0-hadoop2.6.0.jar
- 修改 spark 的高级配置
spark.yarn.jar=hdfs://bestCluster/user/spark/share/lib/spark-assembly-1.5.0-hadoop2.6.0.jar
export HIVE_CONF_DIR=/opt/cloudera/parcels/CDH/lib/hive/conf
- 点击保存更改,再部署客户端配置即可。
7. 运行 spark-sql
已配置过环境变量,可在任意目录下执行 spark-sql
[hadoop@db1 ~]$ spark-sql
...
...
16/07/27 16:04:52 INFO metastore: Trying to connect to metastore with URI thrift://nn1.hadoop:9083
16/07/27 16:04:52 INFO metastore: Connected to metastore.
16/07/27 16:04:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/27 16:04:53 INFO SessionState: Created local directory: /tmp/462a9698-5bb6-4d17-bce3-9e162cfd40f8_resources
16/07/27 16:04:53 INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/462a9698-5bb6-4d17-bce3-9e162cfd40f8
16/07/27 16:04:53 INFO SessionState: Created local directory: /tmp/hadoop/462a9698-5bb6-4d17-bce3-9e162cfd40f8
16/07/27 16:04:53 INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/462a9698-5bb6-4d17-bce3-9e162cfd40f8/_tmp_space.db
SET spark.sql.hive.version=1.2.1
SET spark.sql.hive.version=1.2.1
spark-sql>
tips:
1. 新建 / 拷贝的文件要赋予读写权限
2. 替换原有文件前,注意查看原有文件所属用户、软连接等信息
以上,完结!
本文永久更新链接地址 :http://www.linuxidc.com/Linux/2016-08/133847.htm