共计 5074 个字符,预计需要花费 13 分钟才能阅读完成。
samza 是一个分布式的流式数据处理框架(streaming processing),它是基于 Kafka 消息队列来实现类实时的流式数据处理的。(准确的说,samza 是通过模块化的形式来使用 kafka 的,因此可以构架在其他消息队列框架上,但出发点和默认实现是基于 kafka)
Apache Kafka 主要是用来控制发消息的
Apache Hadoop YARN 会提供错误信息,隔离处理器,安全和资源管理.
本文将介绍怎么在 Ubuntu 14.04 的 32 位 系统上安装 Samza.
安装准备:
要安装和配置 Apache-Samza,需要以下东西
JDK 1.7
maven2
kafka
yarn
zookeeper
# apt-get install curl gem
下载并设置 JDK 路径:
我们需要安装 JDK 并设置好其环境变量.
# cd /usr/java
# wget –no-cookies –no-check-certificate –header “Cookie: gpw_e24=http%3A%2F%2Fwww.Oracle.com%2F; oraclelicense=accept-securebackup-cookie” “http://download.oracle.com/otn-pub/java/jdk/7u79-b15/jdk-7u79-linux-i586.tar.gz”
# tar xzf jdk-7u79-linux-i586.tar.gz
解压并设置好 JAVA_HOME 路径
# tar -zxvf jdk-7u79-linux-i586.tar.gz
# JAVA_HOME=/usr/java/jdk1.7.0_79
# export JAVA_HOME
# PATH=$JAVA_HOME/bin:$PATH
# export PATH
把上面的加入到 ~/.bashrc 和 /etc/bashrc 文件去
安装 Maven2:
接下来下载安装 maven
# wget https://launchpad.net/~bneijt/+archive/ubuntu/ppa/+build/2139203/+files/maven3_3.0.1-0~ppa2_all.deb
# dpkg -i maven3_3.0.1-0~ppa2_all.deb
检查 maven 版本好
# mvn3 -version
Apache Maven 3.0.1 (r1038046; 2010-11-23 16:28:32+0530)
Java version: 1.7.0_79
Java home: /usr/java/jdk1.7.0_79/jre
Default locale: en_IN, platform encoding: UTF-8
OS name: “linux” version: “3.8.0-29-generic” arch: “i386” Family: “unix”
安装 Hello-Samza :
我们就按照在 /usr/local 文件夹下面把
# cd /usr/local
把 hello-samza 复制进来,
# git clone git://git.apache.org/samza-hello-samza.git hello-samza
本项目中含有一个 ”grid” 的脚本,其中有 hello-samza 变量,有了这个你可以搞定一切了. 使用它可以安装 Kafka, Yarn 和 Zookeeper.
执行下面的命令,
# cd /usr/local/hello-samza
root@dev:/usr/local/hello-samza# bin/grid install kafka
EXECUTING: install kafka
Downloading kafka_2.10-0.8.2.1.tgz…
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
15 15.4M 15 2406k 0 0 304k 0 0:00:51 0:00:07 0:00:44 443k
root@dev:/usr/local/hello-samza# bin/grid install yarn
EXECUTING: install yarn
Downloading hadoop-2.6.1.tar.gz…
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
77 187M 77 145M 0 0 239k 0 0:13:23 0:10:22 0:03:01 204k
root@dev:/usr/local/hello-samza# bin/grid install zookeeper
EXECUTING: install zookeeper
Downloading zookeeper-3.4.3.tar.gz…
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
8 15.4M 8 1324k 0 0 212k 0 0:01:14 0:00:06 0:01:08 266k
现在你会发现所有的包都在 hello-samza 根目录下面的一个名字叫“deploy”文件夹里面.
root@dev:/usr/local/hello-samza# cd deploy
root@dev:/usr/local/hello-samza/deploy# ls
kafka yarn zookeeper
执行 bin/grid bootstrap 命令
root@dev:/usr/local/hello-samza# bin/grid bootstrap
Download http://repo1.maven.org/maven2/org/fusesource/scalate/scalate-util_2.10/1.6.1/scalate-util_2.10-1.6.1.jar
:samza-yarn_2.10:processResources
:samza-yarn_2.10:classes
:samza-yarn_2.10:lesscss
….
….
BUILD SUCCESSFUL
Total time: 20 mins 32.855 secs
/usr/local/hello-samza
EXECUTING: install zookeeper
Using previously downloaded file /root/.samza/download/zookeeper-3.4.3.tar.gz
EXECUTING: install yarn
Using previously downloaded file /root/.samza/download/hadoop-2.6.1.tar.gz
EXECUTING: install kafka
Using previously downloaded file /root/.samza/download/kafka_2.10-0.8.2.1.tgz
EXECUTING: start zookeeper
JMX enabled by default
Using config: /usr/local/hello-samza/deploy/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper … STARTED
EXECUTING: start yarn
starting resourcemanager, logging to /usr/local/hello-samza/deploy/yarn/logs/yarn-root-resourcemanager-dev.out
starting nodemanager, logging to /usr/local/hello-samza/deploy/yarn/logs/yarn-root-nodemanager-dev.out
EXECUTING: start kafka
上面的 grid 执行完后, 你就可以验证 YARN 是否安装好了并在运行,访问 URL http://localhost:8088. 看到的就是 YARN UI 界面.
Build 一个 Samza 工作包:
你需要 build 下这个包,YARN 就是通过这个包来执行 grid 的.
注: 比如你 build 的是 hello-samza 项目的最新版的话,记得首先执行下下面的命令。
root@dev:/usr/local/hello-samza#./gradlew publishToMavenLocal
你可以在 hello-samza 项目中使用这些命令:
root@dev:/usr/local/hello-samza# mvn clean package
root@dev:/usr/local/hello-samza# mkdir -p deploy/samza
root@dev:/usr/local/hello-samza# tar -xvf ./target/hello-samza-0.10.0-dist.tar.gz -C deploy/samza
执行 Samza 任务:
完成 build Samza 包之后,你就可以在 grid 使用 t run-job.sh 脚本来完成一些任务了
root@dev:/usr/local/hello-samza # deploy/samza/bin/run-job.sh –config-factory=org.apache.samza.config.factories.PropertiesConfigFactory –config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
上面的这个任务将会从 Wikipedia 上把实施反馈编辑撤销掉,会把这些编辑放到一个叫“thelinuxfaq-raw”的主题里面去.
让这个主题运行几分钟后,你再来看下 Kafka 最后面的更新情况:
root@dev:/usr/local/hello-samza# deploy/kafka/bin/kafka-console-consumer.sh –zookeeper localhost:2181 –topic thelinuxfaq-raw
再次访问 YARN UI 界面 (http://localhost:8088). 你就看到 Samza 很正常的运行而不是有错误提示了!
关闭 Samza:
一切都弄好了, 你就可以使用 grid 脚本关闭所有的相关服务器了.
root@dev:/usr/local/hello-samza # bin/grid stop all
输出示例:
EXECUTING: stop all
EXECUTING: stop kafka
EXECUTING: stop yarn
stopping resourcemanager
stopping nodemanager
EXECUTING: stop zookeeper
JMX enabled by default
Using config: /usr/local/hello-samza/deploy/zookeeper/bin/../conf/zoo.cfg
Stopping zookeeper … STOPPED
启动 Samza :
同意的, 你可以使用 grid 脚本来启动所有服务,
root@dev:/usr/local/hello-samza # bin/grid start all
输出示例:
EXECUTING: start all
EXECUTING: start zookeeper
JMX enabled by default
Using config: /usr/local/hello-samza/deploy/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper … STARTED
EXECUTING: start yarn
….
EXECUTING: start kafka
本文永久更新链接地址 :http://www.linuxidc.com/Linux/2016-03/128858.htm