在yarn上跑spark任务,NodeManager的Container频繁自杀

在yarn上提交了一个spark任务,每个几个小时某个NodeManager就与ResourceManager失联,重启后过几个小时还是自杀,查看失联的NodeManager日志显示如下

首先一直内存提示,持续几个小时
Memory usage of ProcessTree 17575 for container-
id container_1555224787985_0001_01_000002: 471.1 MB of 4.5 GB physical memory used; 6.1 GB of 18 GB virtual memory used

之后等待被杀:Waiting for containers to be killed

从RUNNING到KILLING: Container container_1555224787985_0001_01_000002 transitio
ned from RUNNING to KILLING

到杀死完成: Container Finished - Killed
最终NodeManager与ResourceManager失去联系

调整过的参数
yarn.scheduler.maximum-allocation-mb=20G
yarn.nodemanager.resource.memory-mb=15G
yarn.nodemanager.vmem-check-enabled=false
yarn.nodemanager.vmem-pmem-ratio=4

5个节点,每个节点内存是32G,以上这几个参数都调过了,还是不行,不知道是什么问题,烦请知道的大佬帮助解答下
已邀请:

过往记忆

赞同来自:

最好提供完整日志。

linefly - 对大数据有研究,兴趣浓厚

赞同来自:

2019-04-18 12:08:57,871 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6112 for container-id container_1555410826123_0001_02_000004: 472.4 MB of 4 GB physical memory used; 4.0 GB of 10 GB virtual memory used
2019-04-18 12:08:57,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6027 for container-id container_1555410826123_0001_02_000001: 393.7 MB of 4 GB physical memory used; 2.4 GB of 10 GB virtual memory used
2019-04-18 12:08:57,883 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6111 for container-id container_1555410826123_0001_02_000002: 465.9 MB of 4 GB physical memory used; 4.0 GB of 10 GB virtual memory used
2019-04-18 12:09:00,889 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6341 for container-id container_1555410826123_0001_02_000007: 465.8 MB of 4 GB physical memory used; 4.0 GB of 10 GB virtual memory used
2019-04-18 12:09:00,896 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6198 for container-id container_1555410826123_0001_02_000005: 477.7 MB of 4 GB physical memory used; 4.0 GB of 10 GB virtual memory used
2019-04-18 12:09:00,902 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6197 for container-id container_1555410826123_0001_02_000006: 464.5 MB of 4 GB physical memory used; 4.0 GB of 10 GB virtual memory used
2019-04-18 12:09:00,908 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6110 for container-id container_1555410826123_0001_02_000003: 464.4 MB of 4 GB physical memory used; 4.0 GB of 10 GB virtual memory used
2019-04-18 12:09:00,914 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6112 for container-id container_1555410826123_0001_02_000004: 472.3 MB of 4 GB physical memory used; 4.0 GB of 10 GB virtual memory used
2019-04-18 12:09:00,929 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6027 for container-id container_1555410826123_0001_02_000001: 393.9 MB of 4 GB physical memory used; 2.4 GB of 10 GB virtual memory used
2019-04-18 12:09:00,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6111 for container-id container_1555410826123_0001_02_000002: 465.8 MB of 4 GB physical memory used; 4.0 GB of 10 GB virtual memory used
2019-04-18 12:09:01,869 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000002 transitioned from RUNNING to KILLING
2019-04-18 12:09:01,869 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000003 transitioned from RUNNING to KILLING
2019-04-18 12:09:01,869 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000004 transitioned from RUNNING to KILLING
2019-04-18 12:09:01,869 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000005 transitioned from RUNNING to KILLING
2019-04-18 12:09:01,869 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000006 transitioned from RUNNING to KILLING
2019-04-18 12:09:01,869 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000007 transitioned from RUNNING to KILLING
2019-04-18 12:09:01,869 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1555410826123_0001_02_000002
2019-04-18 12:09:01,877 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1555410826123_0001_02_000002 is : 143
2019-04-18 12:09:01,889 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1555410826123_0001_02_000003
2019-04-18 12:09:01,896 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1555410826123_0001_02_000003 is : 143
2019-04-18 12:09:01,907 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1555410826123_0001_02_000004
2019-04-18 12:09:01,914 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1555410826123_0001_02_000004 is : 143
2019-04-18 12:09:01,924 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1555410826123_0001_02_000005
2019-04-18 12:09:01,933 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1555410826123_0001_02_000005 is : 143
2019-04-18 12:09:01,941 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1555410826123_0001_02_000006
2019-04-18 12:09:01,947 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1555410826123_0001_02_000006 is : 143
2019-04-18 12:09:01,959 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1555410826123_0001_02_000007
2019-04-18 12:09:01,965 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1555410826123_0001_02_000007 is : 143
2019-04-18 12:09:01,976 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000002 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2019-04-18 12:09:01,976 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000003 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2019-04-18 12:09:01,976 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000004 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2019-04-18 12:09:01,976 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000005 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2019-04-18 12:09:01,976 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000006 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2019-04-18 12:09:01,976 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000007 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2019-04-18 12:09:01,977 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /yarn/nm/usercache/hdfs/appcache/application_1555410826123_0001/container_1555410826123_0001_02_000002
2019-04-18 12:09:01,977 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /yarn/nm/usercache/hdfs/appcache/application_1555410826123_0001/container_1555410826123_0001_02_000003
2019-04-18 12:09:01,977 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /yarn/nm/usercache/hdfs/appcache/application_1555410826123_0001/container_1555410826123_0001_02_000004
2019-04-18 12:09:01,978 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /yarn/nm/usercache/hdfs/appcache/application_1555410826123_0001/container_1555410826123_0001_02_000005
2019-04-18 12:09:01,979 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs    OPERATION=Container Finished - Killed    TARGET=ContainerImpl    RESULT=SUCCESS    APPID=application_1555410826123_0001    CONTAINERID=container_1555410826123_0001_02_000002
2019-04-18 12:09:01,979 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000002 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2019-04-18 12:09:01,979 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs    OPERATION=Container Finished - Killed    TARGET=ContainerImpl    RESULT=SUCCESS    APPID=application_1555410826123_0001    CONTAINERID=container_1555410826123_0001_02_000003
2019-04-18 12:09:01,979 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000003 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2019-04-18 12:09:01,979 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs    OPERATION=Container Finished - Killed    TARGET=ContainerImpl    RESULT=SUCCESS    APPID=application_1555410826123_0001    CONTAINERID=container_1555410826123_0001_02_000004
2019-04-18 12:09:01,979 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000004 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2019-04-18 12:09:01,979 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /yarn/nm/usercache/hdfs/appcache/application_1555410826123_0001/container_1555410826123_0001_02_000006
2019-04-18 12:09:01,979 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs    OPERATION=Container Finished - Killed    TARGET=ContainerImpl    RESULT=SUCCESS    APPID=application_1555410826123_0001    CONTAINERID=container_1555410826123_0001_02_000005
2019-04-18 12:09:01,979 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000005 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2019-04-18 12:09:01,979 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs    OPERATION=Container Finished - Killed    TARGET=ContainerImpl    RESULT=SUCCESS    APPID=application_1555410826123_0001    CONTAINERID=container_1555410826123_0001_02_000006
2019-04-18 12:09:01,979 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000006 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2019-04-18 12:09:01,979 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs    OPERATION=Container Finished - Killed    TARGET=ContainerImpl    RESULT=SUCCESS    APPID=application_1555410826123_0001    CONTAINERID=container_1555410826123_0001_02_000007
2019-04-18 12:09:01,979 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1555410826123_0001_02_000007 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1555410826123_0001_02_000002 from application application_1555410826123_0001
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1555410826123_0001_02_000002 for log-aggregation
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1555410826123_0001
2019-04-18 12:09:01,980 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1555410826123_0001_02_000002
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1555410826123_0001_02_000003 from application application_1555410826123_0001
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1555410826123_0001_02_000003 for log-aggregation
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1555410826123_0001
2019-04-18 12:09:01,980 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1555410826123_0001_02_000003
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1555410826123_0001_02_000004 from application application_1555410826123_0001
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1555410826123_0001_02_000004 for log-aggregation
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1555410826123_0001
2019-04-18 12:09:01,980 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1555410826123_0001_02_000004
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1555410826123_0001_02_000005 from application application_1555410826123_0001
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /yarn/nm/usercache/hdfs/appcache/application_1555410826123_0001/container_1555410826123_0001_02_000007
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1555410826123_0001_02_000005 for log-aggregation
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1555410826123_0001
2019-04-18 12:09:01,980 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1555410826123_0001_02_000005
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1555410826123_0001_02_000006 from application application_1555410826123_0001
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1555410826123_0001_02_000006 for log-aggregation
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1555410826123_0001
2019-04-18 12:09:01,980 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1555410826123_0001_02_000006
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1555410826123_0001_02_000007 from application application_1555410826123_0001
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1555410826123_0001_02_000007 for log-aggregation
2019-04-18 12:09:01,980 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1555410826123_0001
2019-04-18 12:09:01,980 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1555410826123_0001_02_000007
2019-04-18 12:09:01,981 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_1555410826123_0001_02_000007, container_1555410826123_0001_02_000005, container_1555410826123_0001_02_000006, container_1555410826123_0001_02_000003, container_1555410826123_0001_02_000004, container_1555410826123_0001_02_000002]
2019-04-18 12:09:03,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1555410826123_0001_02_000002
2019-04-18 12:09:03,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1555410826123_0001_02_000003
2019-04-18 12:09:03,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1555410826123_0001_02_000004
2019-04-18 12:09:03,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1555410826123_0001_02_000005
2019-04-18 12:09:03,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1555410826123_0001_02_000006
2019-04-18 12:09:03,935 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1555410826123_0001_02_000007
2019-04-18 12:09:03,941 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6027 for container-id container_1555410826123_0001_02_000001: 394.1 MB of 4 GB physical memory used; 2.4 GB of 10 GB virtual memory used
2019-04-18 12:09:06,947 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6027 for container-id container_1555410826123_0001_02_000001: 394.1 MB of 4 GB physical memory used; 2.4 GB of 10 GB virtual memory used
2019-04-18 12:09:09,952 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6027 for container-id container_1555410826123_0001_02_000001: 394.1 MB of 4 GB physical memory used; 2.4 GB of 10 GB virtual memory used
2019-04-18 12:09:12,958 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6027 for container-id container_1555410826123_0001_02_000001: 394.1 MB of 4 GB physical memory used; 2.4 GB of 10 GB virtual memory used
2019-04-18 12:09:15,964 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6027 for container-id container_1555410826123_0001_02_000001: 394.1 MB of 4 GB physical memory used; 2.4 GB of 10 GB virtual memory used
2019-04-18 12:09:18,970 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6027 for container-id container_1555410826123_0001_02_000001: 394.1 MB of 4 GB physical memory used; 2.4 GB of 10 GB virtual memory used
2019-04-18 12:09:21,976 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 6027 for container-id container_1555410826123_0001_02_000001: 394.1 MB of 4 GB physical memory used; 2.4 GB of 10 GB virtual memory used
以上是NodeManager报错一段的部分的日志,用的CDH5.16.1Parcel包,CentOS 7.0系统,其中的某个NodeManager每几个小时或者半天失联,出现这样的问题, 因为处于测试阶段,数据量不大。

要回复问题请先登录注册


中国HBase技术社区微信公众号:
hbasegroup

欢迎加入HBase生态+Spark社区钉钉大群