Flink 侧流输出源码示例深度解析

101次阅读

共计 5121 个字符，预计需要花费 13 分钟才能阅读完成。

导读	这篇文章主要为大家介绍了 Flink 侧流输出源码示例解析，有需要的朋友可以借鉴参考下，希望能够有所帮助，祝大家多多进步，早日升职加薪

Flink 侧流输出源码解析

Flink 的 side output 为我们提供了侧流（分流）输出的功能，根据条件可以把一条流分为多个不同的流，之后做不同的处理逻辑，下面就来看下侧流输出相关的源码。

先来看下面的一个 Demo，一个流被分成了 3 个流，一个主流，两个侧流输出。

 SingleOutputStreamOperator process =
        kafka_source1.process(new ProcessFunction() {
                    @Override
                    public void processElement( 
                            JasonLeePOJO value,
                            ProcessFunction.Context ctx,
                            Collector out)
                            throws Exception {
                        // 这个是主流输出
                        if (value.getName().equals("flink")) {out.collect(value);
                        // 下面两个是测流输出
                        } else if (value.getName().equals("spark")) {ctx.output(test, value);
                        // 测流
                        } else if (value.getName().equals("hadoop")) {ctx.output(test1, value);
                        }
                    }
                });

为了更加清楚的查看每一个算子，我禁用了 operator chain，任务的 DAG 图如下所示：

这样就比较清晰了，很明显从 process 算子开始，1 个数据流分为了 3 个数据流，当然，在默认情况下没有禁止

operator chain 所有的算子都是 chain 在一起的。

源码解析

我们先来看第一个主流输出也就是 out.collect(value) 的源码，这里的 out 实际上是 TimestampedCollector 对象。

TimestampedCollector#collect

 @Override
public void collect(T record) {output.collect(reuse.replace(record));
}

在 collect 方法中持有一个 output 对象，用来输出数据，在这里实际上是一个 CountingOutput 它是一个包装了 Output 的对象，主要用于更新发送数据的 metric，并输出数据。

CountingOutput#collect

 @Override
public void collect(StreamRecord record) {numRecordsOut.inc();
    output.collect(record);
}

在 CountingOutput 中也持有一个 output 对象，但是这里的 output 是 BroadcastingOutputCollector 对象，从名字就可以看出它是往下游广播数据的，这里就有一个疑问？把数据广播到下游，那岂不是下游的每个数据流都有这条数据吗？这样的话是怎么实现分流的呢？带着这个疑问，我们来看 BroadcastingOutputCollector 的 collect 方法是怎么实现的。

BroadcastingOutputCollector#collect

 @Override
public void collect(StreamRecord record) {
    // 这里的 outputs 数组有三个 output 分别对应上面的三个输出流
    for (Output> output : outputs) {output.collect(record);
    }
}

在 BroadcastingOutputCollector 对象里也持有一个 output 对象，其实他们都实现了 Output 接口，用来往下游发送数据，这里的 outputs 是一个 Output 数组，代表了下游的所有 Output，因为上面有三个输出流，所以数组里面就包含了 3 个 Output 对象。

循环的调用 output 的 collect 方法往下游发送数据，因为我打断了 operator chain，所以 process 算子和下游的 Print 算子不在同一个 operatorChain 内，那么上下游算子之间数据传输用的就是 RecordWriterOutput，否则用的是 CopyingChainingOutput 或者 ChainingOutput，具体使用的是哪个 Output 这里就不多介绍了，后面有时间的话会单独介绍。

RecordWriterOutput#collect

 @Override
public void collect(StreamRecord record) {
    // 主流是没有 outputTag 的，只有测流有 outputTag
    if (this.outputTag != null) {
        // we are not responsible for emitting to the main output.
        return;
    }
 
    pushToRecordWriter(record);
}

接着来看 RecordWriterOutput 的 collect 方法，在 collect 方法里面会先判断 outputTag 是否为空，如果不为空不做任何处理，直接返回，否则就把数据推送到下游算子，只有侧流输出才需要定义 outputTag，主流（正常流）是没有 outputTag 的，所以这里会走 pushToRecordWriter 方法把数据写入到下游，也就是说虽然会以广播的形式把数据广播到所有下游，但其实另外两个侧流是直接返回的，只有主流才会把数据推送到下游，这也就解释了上面的疑问。

然后再来看第二个侧流输出 ctx.output(test, value) 的源码，这里的 ctx 实际上是 ProcessOperator#ContextImpl 对象。

ProcessOperator#ContextImpl#output

 @Override
public  void output(OutputTag outputTag, X value) {if (outputTag == null) {throw new IllegalArgumentException("OutputTag must not be null.");
    }
    output.collect(outputTag, new StreamRecord(value, element.getTimestamp()));
}

如果 outputTag 是空，直接抛出异常，因为这个是侧流，所以必须要定义 OutputTag。这里的 output 实际上是父类 AbstractStreamOperator 所持有的变量，如果 outputTag 不为空，就调用 output 的 collect 方法把数据发送到下游，这里的 output 和上面的一样是 CountingOutput 但是 collect 方法是另外一个重载的方法。

CountingOutput#collect

 @Override
public  void collect(OutputTag outputTag, StreamRecord record) {numRecordsOut.inc();
    output.collect(outputTag, record);
}

可以发现，这个 collect 方法比上面那个多了一个 OutputTag 参数，也就是使用侧流输出的时候定义的 OutputTag 对象，然后调用 output 的 collect 方法发送数据，这个也和上面的一样，同样是 BroadcastingOutputCollector 对象的另外一个重载方法，多了一个 OutputTag 参数。

BroadcastingOutputCollector#collect

 @Override
public  void collect(OutputTag outputTag, StreamRecord record) {for (Output> output : outputs) {output.collect(outputTag, record);
    }
}

这里的逻辑和上面是一样的，同样的循环调用 collect 方法发送数据。

RecordWriterOutput#collect

 @Override
public  void collect(OutputTag outputTag, StreamRecord record) {
    // 先要判断两个 OutputTag 是否一样
    if (OutputTag.isResponsibleFor(this.outputTag, outputTag)) {pushToRecordWriter(record);
    }
}

在这个 collect 方法中会先判断传入的 OutputTag 对象和成员变量 this.outputTag 是不是相等，如果是的话，就发送数据，否则不做任何处理，所以这里每次只会选择一个下游侧流输出数据，这样就实现了所谓的分流。

OutputTag#isResponsibleFor

 public static boolean isResponsibleFor(@Nullable OutputTag> owner, @Nonnull OutputTag> other) {return other.equals(owner);
}

可以看到在 isResponsibleFor 方法内是直接调用 OutputTag 的 equals 方法判断两个对象是否相等的。

第三个侧流 test1 ctx.output(test1, value) 和第二个侧流 test 是完全一样的情况，这里就不在看代码了。

上面是完成了分流操作，那怎么获取到分流后结果呢（数据流）？我们可以通过 getSideOutput 方法获取。

 DataStream sideOutput = process.getSideOutput(test);
DataStream sideOutput1 = process.getSideOutput(test1);

getSideOutput 源码

 public  DataStream getSideOutput(OutputTag sideOutputTag) {sideOutputTag = clean(requireNonNull(sideOutputTag));
 
    // make a defensive copy
    sideOutputTag = new OutputTag(sideOutputTag.getId(), sideOutputTag.getTypeInfo());
 
    TypeInformation> type = requestedSideOutputs.get(sideOutputTag);
    if (type != null && !type.equals(sideOutputTag.getTypeInfo())) {
        throw new UnsupportedOperationException(
                "A side output with a matching id was"
                        + "already requested with a different type. This is not allowed, side output"
                        + "ids need to be unique.");
    }
 
    requestedSideOutputs.put(sideOutputTag, sideOutputTag.getTypeInfo());
 
    SideOutputTransformation sideOutputTransformation =
            new SideOutputTransformation(this.getTransformation(), sideOutputTag);
    return new DataStream(this.getExecutionEnvironment(), sideOutputTransformation);
}

getSideOutput 方法里先是构建了一个 SideOutputTransformation 对象，然后又构建了 DataStream 对象，这样我们就可以基于分流后的 DataStream 做不同的处理逻辑了，从而实现了把一个 DataStream 分流成多个 DataStream 功能。

总结

通过对侧流输出的源码进行解析，在分流的时候，数据是通过广播的方式发送到下游算子的，对于主流的数据来说，只有 OutputTag 为空的才会处理，侧流因为 OutputTag 不为空，所以直接返回，不做任何处理，那对于侧流的数据来说，是通过判断两个 OutputTag 是否相等，所以每次只会把数据发送到下游对应的那一个侧流上去，这样即可实现分流逻辑。

阿里云 2 核 2G 服务器 3M 带宽 61 元 1 年，有高配

腾讯云新客低至 82 元 / 年，老客户 99 元 / 年

代金券：在阿里云专用满减优惠券

正文完

星哥玩云-微信公众号

发表至： linux教程

2024-07-24

0

转载说明：除特殊说明外本站文章皆由CC-4.0协议发布，转载请注明出处。

RADIUS服务器的演变过程

在 Ubuntu 16.04 上安装 Bro 网络分析器

shell脚本案例-mysql备份脚本

Linux面试题：怎么清屏？怎么退出当前命令？怎么执行睡眠？怎么查看当前用户 id？查看指定帮助用什么命令？

小白入门之十三：sed命令实现文本处

总说Linux，到底什么是Linux？

威联通（QNAP）使用Docker安装Zdir目录列表程序

Linux运维常用的命令介绍-文本查看命令

修改Nginx配置返回指定content-type的方法详解

Flink 侧流输出源码示例深度解析

开源堡垒机JumpServer配置教程：使用步骤与配置

申请腾讯混元的API Key并且使用LobeChat调用混元AI

【开源安全保护】如何安装JumpServer堡垒机

基于Docker快速搭建一个开源的IT人员在线工具箱-it-tools

Docker部署搭建一个开源强大的图书管理系统

通知：阿里云对象存储OSS价格下调（最高优惠降幅55%）

开源通义万相本地部署方案，文生视频、图生视频、视频生成大模型，支持消费级显卡！

Linux下Docker安装配置

教你如何快速切换Linux PHP版本

在Docker下部署专属的下载神器qBittorrent

	SingleOutputStreamOperator process =
	kafka_source1.process(new ProcessFunction() {
	@Override
	public void processElement(
	JasonLeePOJO value,
	ProcessFunction.Context ctx,
	Collector out)
	throws Exception {
	// 这个是主流输出
	if (value.getName().equals("flink")) {out.collect(value);
	// 下面两个是测流输出
	} else if (value.getName().equals("spark")) {ctx.output(test, value);
	// 测流
	} else if (value.getName().equals("hadoop")) {ctx.output(test1, value);
	}
	}
	});

	@Override
	public void collect(T record) {output.collect(reuse.replace(record));
	}

	@Override
	public void collect(StreamRecord record) {numRecordsOut.inc();
	output.collect(record);
	}

	@Override
	public void collect(StreamRecord record) {
	// 这里的 outputs 数组有三个 output 分别对应上面的三个输出流
	for (Output> output : outputs) {output.collect(record);
	}
	}

	@Override
	public void collect(StreamRecord record) {
	// 主流是没有 outputTag 的，只有测流有 outputTag
	if (this.outputTag != null) {
	// we are not responsible for emitting to the main output.
	return;
	}

	pushToRecordWriter(record);
	}

	@Override
	public void output(OutputTag outputTag, X value) {if (outputTag == null) {throw new IllegalArgumentException("OutputTag must not be null.");
	}
	output.collect(outputTag, new StreamRecord(value, element.getTimestamp()));
	}

	@Override
	public void collect(OutputTag outputTag, StreamRecord record) {numRecordsOut.inc();
	output.collect(outputTag, record);
	}

	@Override
	public void collect(OutputTag outputTag, StreamRecord record) {
	// 先要判断两个 OutputTag 是否一样
	if (OutputTag.isResponsibleFor(this.outputTag, outputTag)) {pushToRecordWriter(record);
	}
	}

	public static boolean isResponsibleFor(@Nullable OutputTag> owner, @Nonnull OutputTag> other) {return other.equals(owner);
	}

	DataStream sideOutput = process.getSideOutput(test);
	DataStream sideOutput1 = process.getSideOutput(test1);

	public DataStream getSideOutput(OutputTag sideOutputTag) {sideOutputTag = clean(requireNonNull(sideOutputTag));

	// make a defensive copy
	sideOutputTag = new OutputTag(sideOutputTag.getId(), sideOutputTag.getTypeInfo());

	TypeInformation> type = requestedSideOutputs.get(sideOutputTag);
	if (type != null && !type.equals(sideOutputTag.getTypeInfo())) {
	throw new UnsupportedOperationException(
	"A side output with a matching id was"
	+ "already requested with a different type. This is not allowed, side output"
	+ "ids need to be unique.");
	}

	requestedSideOutputs.put(sideOutputTag, sideOutputTag.getTypeInfo());

	SideOutputTransformation sideOutputTransformation =
	new SideOutputTransformation(this.getTransformation(), sideOutputTag);
	return new DataStream(this.getExecutionEnvironment(), sideOutputTransformation);
	}