Comment
Author: Admin | 2025-04-28
Spark application in all the four languages supported inside Synapse, PySpark (Python), Spark (Scala), SparkSQL, and .NET for Spark (C#).Create a GPU-enabled pool.Create a notebook and attach it to the GPU-enabled pool you created in the first step.Set the configurations as explained in the previous section.Create a sample dataframe by copying the below code in the first cell of your notebook:ScalaPythonC#import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}import org.apache.spark.sql.Rowimport scala.collection.JavaConversions._val schema = StructType( Array( StructField("emp_id", IntegerType), StructField("name", StringType), StructField("emp_dept_id", IntegerType), StructField("salary", IntegerType)))val emp = Seq(Row(1, "Smith", 10, 100000), Row(2, "Rose", 20, 97600), Row(3, "Williams", 20, 110000), Row(4, "Jones", 10, 80000), Row(5, "Brown", 40, 60000), Row(6, "Brown", 30, 78000) )val empDF = spark.createDataFrame(emp, schema)emp = [(1, "Smith", 10, 100000), (2, "Rose", 20, 97600), (3, "Williams", 20, 110000), (4, "Jones", 10, 80000), (5, "Brown", 40, 60000), (6, "Brown", 30, 78000)]empColumns = ["emp_id", "name", "emp_dept_id", "salary"]empDF = spark.createDataFrame(data=emp, schema=empColumns)using Microsoft.Spark.Sql.Types;var emp = new List{ new GenericRow(new object[] { 1, "Smith", 10, 100000 }), new GenericRow(new object[] { 2, "Rose", 20, 97600 }), new GenericRow(new object[] { 3, "Williams", 20, 110000 }), new GenericRow(new object[] { 4, "Jones", 10, 80000 }), new GenericRow(new object[] { 5, "Brown", 40, 60000 }), new GenericRow(new object[] { 6, "Brown", 30, 78000 })};var schema = new StructType(new List(){ new StructField("emp_id", new IntegerType()), new StructField("name", new StringType()), new StructField("emp_dept_id", new IntegerType()), new StructField("salary", new IntegerType())});DataFrame empDF = spark.CreateDataFrame(emp, schema);Now let's do an aggregate by getting the maximum salary per department ID and display the result:ScalaPythonC#val resultDF = empDF.groupBy("emp_dept_id").max("salary")resultDF.show()resultDF = empDF.groupBy("emp_dept_id").max("salary")resultDF.show()DataFrame resultDF = empDF.GroupBy("emp_dept_id").Max("salary");resultDF.Show();You can see the operations in your query that ran on GPUs by looking into the SQL plan through the Spark History Server:How to tune your application for GPUsMost Spark jobs can see improved performance through tuning configuration settings from defaults, and the same holds true for jobs leveraging the
Add Comment