如何保证sc读到的内容为空时,saveAsHadoopFile可以不生成空文件

场景是:每分钟通过spark判断redis行程数据,写入hdfs,有时没有数据时就会生成空文件,有什么好的解决办法吗
rdd都是空的,我加上判空和不判空的时间,除了判空,还有没有好的办法
 
微信图片编辑_20181228182255.jpg
已邀请:

过往记忆

赞同来自:

可以判断 RDD是不是空的,isEmpty() 函数了解下。它的实现:
  /**
* @note due to complications in the internal implementation, this method will raise an
* exception if called on an RDD of `Nothing` or `Null`. This may be come up in practice
* because, for example, the type of `parallelize(Seq())` is `RDD[Nothing]`.
* (`parallelize(Seq())` should be avoided anyway in favor of `parallelize(Seq[T]())`.)
* @return true if and only if the RDD contains no elements at all. Note that an RDD
* may be empty even when it has at least 1 partition.
*/
def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0
}

machuan

赞同来自:

耗时边长是因为 判断 isEmpty里面 take(1).length 这个是在driver端完成的,很慢

要回复问题请先登录注册


中国HBase技术社区微信公众号:
hbasegroup

欢迎加入HBase生态+Spark社区钉钉大群