2024-03-27 16:37:06,439 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: ****[4] (3/4) (b32683d65134001706acbda4898a4828_5e1cbbbc1a9055ae0faa9b9a2f8da913_2_0) switched from INITIALIZING to RUNNING. 2024-03-27 16:37:07,878 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 1 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1711528627828 for job 5114bd1336c34f02ea59e8f48f7ad8c8. 2024-03-27 16:37:08,436 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 1 for job 5114bd1336c34f02ea59e8f48f7ad8c8 (0 bytes, checkpointDuration=604 ms, finalizationTime=4 ms). 2024-03-27 16:37:12,826 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 2 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1711528632825 for job 5114bd1336c34f02ea59e8f48f7ad8c8. 2024-03-27 16:37:12,862 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 2 for job 5114bd1336c34f02ea59e8f48f7ad8c8 (0 bytes, checkpointDuration=35 ms, finalizationTime=2 ms). 2024-03-27 16:37:17,838 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 3 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1711528637825 for job 5114bd1336c34f02ea59e8f48f7ad8c8. 2024-03-27 16:37:25,669 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 3 for job 5114bd1336c34f02ea59e8f48f7ad8c8 (0 bytes, checkpointDuration=7843 ms, finalizationTime=1 ms). 2024-03-27 16:37:26,169 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 4 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1711528646169 for job 5114bd1336c34f02ea59e8f48f7ad8c8. 2024-03-27 16:37:35,268 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 4 for job 5114bd1336c34f02ea59e8f48f7ad8c8 (0 bytes, checkpointDuration=9098 ms, finalizationTime=0 ms). 2024-03-27 16:37:35,782 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 5 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1711528655772 for job 5114bd1336c34f02ea59e8f48f7ad8c8. 2024-03-27 16:37:45,809 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering Checkpoint 5 for job 5114bd1336c34f02ea59e8f48f7ad8c8 failed due to java.util.concurrent.TimeoutException: Invocation of [RemoteRpcInvocation(TaskExecutorGateway.triggerCheckpoint(ExecutionAttemptID, long, long, CheckpointOptions))] at recipient [akka.tcp://flink@node01:3546/user/rpc/taskmanager_0] timed out. This is usually caused by: 1) Akka failed sending the message silently, due to problems like oversized payload or serialization failures. In that case, you should find detailed error information in the logs. 2) The recipient needs more time for responding, due to problems like slow machines or network jitters. In that case, you can try to increase akka.ask.timeout. 2024-03-27 16:37:45,811 WARN org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger or complete checkpoint 5 for job 5114bd1336c34f02ea59e8f48f7ad8c8. (0 consecutive failed attempts so far) org.apache.flink.runtime.checkpoint.CheckpointException: Trigger checkpoint failure. at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$triggerCheckpointRequest$10(CheckpointCoordinator.java:706) ~[flink-dist-1.16.1.jar:1.16.1] ... 2024-03-27 16:37:47,676 WARN org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to trigger or complete checkpoint 7 for job 5114bd1336c34f02ea59e8f48f7ad8c8. (0 consecutive failed attempts so far) org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint Coordinator is suspending. at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1926) ~[flink-dist-1.16.1.jar:1.16.1] ... 2024-03-27 16:37:57,219 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received late message for now expired checkpoint attempt 7 from task b32683d65134001706acbda4898a4828_5e1cbbbc1a9055ae0faa9b9a2f8da913_3_0 of job 5114bd1336c34f02ea59e8f48f7ad8c8 at container_1710904673777_0069_01_000002 @ node01 (dataPort=14516).
2024-03-27 17:38:00,868 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No checkpoint found during restore. ... 2024-03-27 17:38:07,983 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: ****[4] (2/4) (25fd5975f0d7c9d7351722295793fee4_5e1cbbbc1a9055ae0faa9b9a2f8da913_1_0) switched from INITIALIZING to RUNNING. 2024-03-27 17:38:49,552 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job **** (5721b22dfe821f225c9d5b06996b7ebb) switched from state RUNNING to CANCELLING. ... 2024-03-27 17:38:59,415 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: ****M[4] (2/4) (25fd5975f0d7c9d7351722295793fee4_5e1cbbbc1a9055ae0faa9b9a2f8da913_1_0) switched from CANCELING to CANCELED. ... 2024-03-27 17:40:24,092 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Heartbeat of TaskManager with id container_1710904673777_0074_01_000002(node01:6683) timed out. ... 2024-03-27 17:40:24,106 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job 5721b22dfe821f225c9d5b06996b7ebb reached terminal state CANCELED. 2024-03-27 17:40:24,494 INFO org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap [] - Application CANCELED: java.util.concurrent.CompletionException: org.apache.flink.client.deployment.application.UnsuccessfulExecutionException: Application Status: CANCELED at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$unwrapJobResultException$7(ApplicationDispatcherBootstrap.java:403) ~[flink-dist-1.16.1.jar:1.16.1] ...
其实认真查看日志,也可以看到相关的提示 Heartbeat of TaskManager with id container_1710904673777_0074_01_000002(node01:6683) timed out. ,不过由于这条输出的日志不是以抛出异常的格式输出,在众多日志信息中被忽略了,导致排查花费了较多的时间。