实战Spark分布式SQL引擎

232次阅读

共计 8160 个字符，预计需要花费 21 分钟才能阅读完成。

一、概览
Spark SQL 除了使用 spark-sql 命令进入交互式执行环境之外，还能够使用 JDBC/ODBC 或命令行接口进行分布式查询，在这个模式下，终端用户或应用可以直接和 Spark SQL 进行交互式 SQL 查询而不需要写任何 scala 代码。

二、使用 Thrift JDBC server

spark 版本：1.4.0

Yarn 版本：CDH5.4.0

1、准备工作

将 hive-site.xml 拷贝或 link 到 $SPARK_HOME/conf 下

2、使用 spark 安装目录下脚本启动 hive thrift server，默认不加参数时，会以 local 模式启动，占用本地一个 JVM 进程

sbin/start-thriftserver.sh

3、yarn-client 模式启动，默认启动在 10001 端口

sbin/start-thriftserver.sh –master yarn

接下来，我们观察 yarn UI 的 UI 上，启动了 25 个 container

实战 Spark 分布式 SQL 引擎

为什么启动了一个 JDBC 服务就占用这么多资源呢？这是因为 conf/spark-env.sh 中配置了 SPARK_EXECUTOR_INSTANCES 为 24 个实例，再加上一个 yarn client 的 driver 实例

export SPARK_EXECUTOR_INSTANCES=24

观察 Yarn NodeManager 节点上的进程，thriftserver 会常驻一个叫 org.apache.spark.executor.CoarseGrainedExecutorBackend 的进程，随时为之后的 SQL 作业启动 Task。这样做的好处是运行 Spark SQL 时，减少了启动 container 上的时间消耗，同时代价是在 thrift server 空闲的时候，这些 container 资源仍然占用着不会释放给其他 spark 或 mapreduce 作业使用。

实战 Spark 分布式 SQL 引擎

4、使用 beeline 连接 Spark SQL 交互式引擎

bin/beeline -u jdbc:hive2://localhost:10001 -n root -p root

注意，在非安全 Hadoop 模式下，用户名使用当前系统用户，密码为空或随意传值都可以；在 kerberos Hadoop 模式下，需要传递有效的 principal 令牌才可以登录 beeline。

三、命令行帮助

1、Thrift server

Mandatory arguments to long options are mandatory for short options too.
-a, –all do not ignore entries starting with .
-A, –almost-all do not list implied . and ..
–author with -l, print the author of each file
-b, –escape print octal escapes for nongraphic characters
–block-size=SIZE use SIZE-byte blocks. See SIZE format below
-B, –ignore-backups do not list implied entries ending with ~
-c with -lt: sort by, and show, ctime (time of last
modification of file status information)
with -l: show ctime and sort by name
otherwise: sort by ctime
-C list entries by columns
–color[=WHEN] colorize the output. WHEN defaults to `always’
or can be `never’ or `auto’. More info below
-d, –directory list directory entries instead of contents,
and do not dereference symbolic links
-D, –dired generate output designed for Emacs’ dired mode
-f do not sort, enable -aU, disable -ls –color
-F, –classify append indicator (one of */=>@|) to entries
–file-type likewise, except do not append `*’
–format=WORD across -x, commas -m, horizontal -x, long -l,
single-column -1, verbose -l, vertical -C
–full-time like -l –time-style=full-iso
-g like -l, but do not list owner
–group-directories-first
group directories before files.
augment with a –sort option, but any
use of –sort=none (-U) disables grouping
-G, –no-group in a long listing, don’t print group names
-h, –human-readable with -l, print sizes in human readable format
(e.g., 1K 234M 2G)
–si likewise, but use powers of 1000 not 1024
-H, –dereference-command-line
follow symbolic links listed on the command line
–dereference-command-line-symlink-to-dir
follow each command line symbolic link
that points to a directory
–hide=PATTERN do not list implied entries matching shell PATTERN
(overridden by -a or -A)
–indicator-style=WORD append indicator with style WORD to entry names:
none (default), slash (-p),
file-type (–file-type), classify (-F)
-i, –inode print the index number of each file
-I, –ignore=PATTERN do not list implied entries matching shell PATTERN
-k like –block-size=1K
-l use a long listing format
-L, –dereference when showing file information for a symbolic
link, show information for the file the link
references rather than for the link itself
-m fill width with a comma separated list of entries
-n, –numeric-uid-gid like -l, but list numeric user and group IDs
-N, –literal print raw entry names (don’t treat e.g. control
characters specially)
-o like -l, but do not list group information
-p, –indicator-style=slash
append / indicator to directories
-q, –hide-control-chars print ? instead of non graphic characters
–show-control-chars show non graphic characters as-is (default
unless program is `ls’ and output is a terminal)
-Q, –quote-name enclose entry names in double quotes
–quoting-style=WORD use quoting style WORD for entry names:
literal, locale, shell, shell-always, c, escape
-r, –reverse reverse order while sorting
-R, –recursive list subdirectories recursively
-s, –size print the allocated size of each file, in blocks
-S sort by file size
–sort=WORD sort by WORD instead of name: none -U,
extension -X, size -S, time -t, version -v
–time=WORD with -l, show time as WORD instead of modification
time: atime -u, access -u, use -u, ctime -c,
or status -c; use specified time as sort key
if –sort=time
–time-style=STYLE with -l, show times using style STYLE:
full-iso, long-iso, iso, locale, +FORMAT.
FORMAT is interpreted like `date’; if FORMAT is
FORMAT1<newline>FORMAT2, FORMAT1 applies to
non-recent files and FORMAT2 to recent files;
if STYLE is prefixed with `posix-‘, STYLE
takes effect only outside the POSIX locale
-t sort by modification time
-T, –tabsize=COLS assume tab stops at each COLS instead of 8
-u with -lt: sort by, and show, access time
with -l: show access time and sort by name
otherwise: sort by access time
-U do not sort; list entries in directory order
-v natural sort of (version) numbers within text
-w, –width=COLS assume screen width instead of current value
-x list entries by lines instead of by columns
-X sort alphabetically by entry extension
-1 list one file per line

SELinux options:

–lcontext Display security context. Enable -l. Lines
will probably be too wide for most displays.
-Z, –context Display security context so it fits on most
displays. Displays only mode, user, group,
security context and file name.
–scontext Display only security context and file name.
–help display this help and exit
–version output version information and exit

2、beeline

-u <database url> the JDBC URL to connect to
-n <username> the username to connect as
-p <password> the password to connect as
-d <driver class> the driver class to use
-e <query> query that should be executed
-f <file> script file that should be executed
–hiveconf property=value Use value for given property
–hivevar name=value hive variable name and value
This is Hive specific settings in which variables
can be set at session level and referenced in Hive
commands or queries.
–color=[true/false] control whether color is used for display
–showHeader=[true/false] show column names in query results
–headerInterval=ROWS; the interval between which heades are displayed
–fastConnect=[true/false] skip building table/column list for tab-completion
–autoCommit=[true/false] enable/disable automatic transaction commit
–verbose=[true/false] show verbose error messages and debug info
–showWarnings=[true/false] display connection warnings
–showNestedErrs=[true/false] display nested errors
–numberFormat=[pattern] format numbers using DecimalFormat pattern
–force=[true/false] continue running script even after errors
–maxWidth=MAXWIDTH the maximum width of the terminal
–maxColumnWidth=MAXCOLWIDTH the maximum width to use when displaying columns
–silent=[true/false] be more silent
–autosave=[true/false] automatically save preferences
–outputformat=[table/vertical/csv/tsv] format mode for result display
–isolation=LEVEL set the transaction isolation level
–nullemptystring=[true/false] set to true to get historic behavior of printing null as empty string
–help display this message