spark

Spark Journal : Return Multiple dataframes from a Scala method

Until now, I have been focusing on keeping the posts limited to spark, but as you know Scala is one of the main languages used for when using Spark Framework, I will start using both Spark API and Scala language to showcase some interesting use cases.

This time, the task at hand was to return multiple dataframes from a Scala method. I have been returning values, which maybe Int, String, Dataframe , but I have always done it with 1 value in return part of method.
My Colleague and Architect helped me here to show different options on how this can be done very easily.

Note : Before reading further, I would recommend going through this post on StackOverFlow, this will help you to clear conceptual difference between List and Tuple in Scala.

Approach 1
Using List as the return value

import org.apache.spark.sql.DataFrame

def returMultipleDf  : List[DataFrame] = {
    val dataList1 = List((1,"abc"),(2,"def"))
    val df1 = dataList1.toDF("id","Name")
    
    val dataList2 = List((3,"ghi","home"),(4,"jkl","ctrl"))
    val df2 = dataList2.toDF("id","Name","Type")
    
    List(df1, df2)

}

val dfList = returMultipleDf 
val dataFrame1 = dfList(0)
val dataFrame2 = dfList(1)

dataFrame2.show

+---+----+----+
| id|Name|Type|
+---+----+----+
|  3| ghi|home|
|  4| jkl|ctrl|
+---+----+----+

Approach 2
Using Tuple as the return value

import org.apache.spark.sql.DataFrame

def returMultipleDf : (DataFrame, DataFrame) = {
    val dataList1 = List((1,"abc"),(2,"def"))
    val df1 = dataList1.toDF("id","Name")
    
    val dataList2 = List((3,"ghi","home"),(4,"jkl","ctrl"))
    val df2 = dataList2.toDF("id","Name","Type")
    
    (df1, df2)

}

val (df1, df2) = returMultipleDf


df2.show

+---+----+----+
| id|Name|Type|
+---+----+----+
|  3| ghi|home|
|  4| jkl|ctrl|
+---+----+----+

I personally prefer the Approach 2, as it has its own advantages of using Tuple and is more flexible when compared to List.