🔓 Synapse / PySpark problem and solutions¶

Sql Server to PySpark in Synapse¶

To create a DataFrame in PySpark that reads data from a table in SQL Server using Synapse, you can use the following code as an example:

Python

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("Read from SQL Server table") \
    .config("spark.sql.databricks.directSQL.enabled", "true") \
    .config("spark.databricks.io.cache.enabled", "true") \
    .config("spark.databricks.io.cache.maxBytes", "2g") \
    .config("spark.databricks.io.cache.ttl", "10m") \
    .getOrCreate()

# Create a DataFrame that reads data from the specified SQL Server table
table_name = "[dbo].[your_table_name]"
url = "jdbc:sqlserver://your_server_name.database.windows.net:1433;database=your_database_name"
properties = {"user": "your_username", "password": "your_password", "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"}
df = spark.read.jdbc(url=url, table=table_name, properties=properties)

# Show the DataFrame
df.show()

Please modify the values in the properties such as user, password, table name, server and database name accordingly to make sure the code works for you. You also have to have the sqlserver connector downloaded in your environment.

Bash Session

!pip install pyspark-sql-mssql

Check file exists in PySpark or Python¶

In PySpark, you can use the os.path module from the Python standard library to check if a file exists. Here is an example of how you can use the os.path.exists() function to check if a file exists:

Python

import os

file_path = "/path/to/your/file"

if os.path.exists(file_path):
    print("File exists.")
else:
    print("File does not exist.")

Alternatively, you could also use os.path.isfile() function to check if a path points to a regular file.

Python

if os.path.isfile(file_path):
    print("File exists.")
else:
    print("File does not exist.")

!!!note that in both cases, file_path should be a string representing the path to the file you want to check.

Text Only

You also can use spark.sql to check the existance of a file in hadoop filesystem, you can use the spark.sql("""SHOW PARTITIONS tablename""") or spark.sql("""describe formatted tablename""") to check if a file exist or not.

Python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CheckFileExist").getOrCreate()

file_path = "/path/to/your/file"

results = spark.sql("describe formatted {file_path}").collect()
if results:
    print("File exists.")
else:
    print("File does not exist.")

It will work for file that is in hadoop filesystem.

Check file in ADLS¶

To check if a file exists in Azure Data Lake Storage (ADLS) using PySpark, you can use the dbutils.fs module from the Databricks environment. The dbutils.fs.ls() function can be used to list the files in a directory, and you can check the length of the returned list to see if a file exists. Here is an example:

Python

from pyspark.dbutils import DBUtils

file_path = "adl://yourstorageaccount.azuredatalakestore.net/path/to/your/file"

dbutils = DBUtils(spark)
results = dbutils.fs.ls(file_path)

if len(results) > 0:
    print("File exists.")
else:
    print("File does not exist.")

Alternatively, you could also use dbutils.fs.exists(path) function to check file exist or not

Python

from pyspark.dbutils import DBUtils

file_path = "adl://yourstorageaccount.azuredatalakestore.net/path/to/your/file"

dbutils = DBUtils(spark)
exist = dbutils.fs.exists(file_path)

if exist:
    print("File exists.")
else:
    print("File does not exist.")

Please make sure that the connection to the ADLS is setup properly, by providing the right credentials, such as client ID, client secret, tenant ID, and subscription ID or using service principle. You have to have the azure-storage and azure-storage-file-datalake package installed in your environment.

Bash Session

!pip install azure-storage
!pip install azure-storage-file-datalake

Try catch in python¶

In Python, try and except are used to handle exceptions, which are events that occur during the execution of a program that disrupt the normal flow of instructions.

The try block contains the code that may raise an exception. If an exception is raised within the try block, the code in the corresponding except block will be executed. If no exception is raised, the code in the except block will be skipped.

Here is an example of how to use the try and except statements in Python:

Python

try:
    # some code that may raise an exception
    result = 1/0
except ZeroDivisionError:
    # code to be executed if the exception is raised
    print("Cannot divide by zero.")

In this example, the try block contains code that tries to divide 1 by 0, which raises a ZeroDivisionError exception. The except block contains code that prints a message if this exception is raised.

You can also use finally block, which contains code that will be executed no matter whether an exception is raised or not.

Python

try:
    # some code that may raise an exception
    result = 1/0
except ZeroDivisionError:
    # code to be executed if the exception is raised
    print("Cannot divide by zero.")
finally:
    print("Closing any open resources")

You also can chain multiple except block, to handle different type of exception and different action correspondingly

Python

try:
    # some code that may raise an exception
    result = 1/0
except ZeroDivisionError:
    # code to be executed if ZeroDivisionError exception is raised
    print("Cannot divide by zero.")
except Exception as e:
    # code to be executed if any exception other than ZeroDivisionError exception is raised
    print(f"an error occur : {e}")

It's a good practice to always handle the exceptions, to prevent the application from crashing, or unexpected behavior.

Get latest file from datalake using pyspark in synapse¶

You can use the dbutils.fs module from the Databricks environment in Synapse to get the latest file from a directory in Azure Data Lake Storage (ADLS). Here's an example of how you can do this:

Python

from pyspark.dbutils import DBUtils

dbutils = DBUtils(spark)

# directory path in ADLS
dir_path = "adl://yourstorageaccount.azuredatalakestore.net/path/to/directory"

# list all files in the directory
files = dbutils.fs.ls(dir_path)

# sort files by last modification time and get the latest file
latest_file = sorted(files, key=lambda x: x.modtime, reverse=True)[0]

print(f'latest file is {latest_file.path}')

his code lists all the files in the specified directory in ADLS using the dbutils.fs.ls() function, then sorts the files by the last modification time in descending order, so that the latest file is at the top of the list. Then it access the first element of the list which is the latest file. It will print the path of the latest file.

Please make sure that the connection to the ADLS is setup properly, by providing the right credentials, such as client ID, client secret, tenant ID, and subscription ID or using service principle. You have to have the azure-storage and azure-storage-file-datalake package installed in your environment.

Python

!pip install azure-storage
!pip install azure-storage-file-datalake

You also can filter based on file type or regex on filename if needed before sorting the files

Python

files = dbutils.fs.ls(dir_path)
files = filter(lambda x: x.name.endswith('.csv'), files)
latest_file = sorted(files, key=lambda x: x.modtime, reverse=True)[0]

Using mssparkutils in synapse to get latest file from ADLS¶

In Synapse, you can use the mssparkutils library to retrieve the latest file from Azure Data Lake Storage (ADLS) using PySpark. This library provides a getLatestFile() function that allows you to retrieve the latest file from a given directory in ADLS.

Here's an example of how you can use the mssparkutils.fileio.datalake.getLatestFile() function to retrieve the latest file from ADLS:

Python

from mssparkutils.fileio.datalake import getLatestFile

path = "adl://yourstorageaccount.azuredatalakestore.net/path/to/your/directory"
latest_file = getLatestFile(path)
print(latest_file)

You can also specify a date pattern and/or a file extension to narrow down the search:

Python

latest_file = getLatestFile(path, date_pattern = "yyyyMMdd", extension=".csv")

You have to have the mssparkutils package installed in your environment

Python

!pip install mssparkutils

This function use the dbutils module to get the files, so make sure that the connection to the ADLS is set up properly, by providing the right credentials, such as client ID, client secret, tenant ID, and subscription ID or using service principle. This solution is a more robust and versatile approach, that considers both the lexicographical ordering of the files and their modification date to ensure that the correct latest file is returned.

How to partition and query efficiently parquet file in data lake or ADLS using pyspark¶

Parquet is a columnar storage format that is well-suited for storing large, complex datasets. Partitioning the data in Parquet format can improve the efficiency of queries by allowing Spark to only scan the necessary data, rather than the entire dataset.

Here's an example of how you can partition a DataFrame in PySpark and write it to a Parquet file in Azure Data Lake Storage (ADLS) using the partitionBy() method:

Python

from pyspark.sql.functions import *

# Create a DataFrame
df = spark.range(100).selectExpr("id", "id % 10 as mod_10")

# Partition the DataFrame by the "mod_10" column
df.write.partitionBy("mod_10").parquet("adl://yourstorageaccount.azuredatalakestore.net/path/to/your/directory")

Once the data is partitioned, you can use the filter() function to only scan the necessary partitions when querying the data:

Python

# read parquet data
df_parquet = spark.read.parquet("adl://yourstorageaccount.azuredatalakestore.net/path/to/your/directory")

# filter only data where `mod_10` is 5
df_parquet.filter(col("mod_10") == 5).show()

You can also use predicate pushdown feature to push the filter conditions to the storage layer instead of doing the filtering in Spark, this approach further improves query performance by reducing the amount of data that needs to be read and processed.

Python

# read parquet data and enable predicate pushdown
df_parquet = spark.read.parquet("adl://yourstorageaccount.azuredatalakestore.net/path/to/your/directory").filter("mod_10 = 5")
df_parquet.show()

You also have to have the azure-storage and azure-storage-file-datalake package installed in your environment.

Bash Session

!pip install azure-storage
!pip install azure-storage-file-datalake

And make sure that the connection to the ADLS is set up properly, by providing the right credentials, such as client ID, client secret, tenant ID, and subscription ID or using service principle. By partitioning and filtering the data in this way, you can improve the performance of your queries by reducing the amount of data that Spark needs to scan and process.

It's also important to mention that it would be best to be selective in the partitioning key, using a high cardinality column, such as date or timestamp, will increase the number of small partitions which will increase the number of metadata that needs to be scanned and may cause performance degradation rather than improvement.

What should be the value for spark.sql.analyzer.maxIternations¶

The spark.sql.analyzer.maxIterations configuration property sets the maximum number of iterations that Spark's Catalyst optimizer will perform when analyzing a query plan. The Catalyst optimizer is the optimization engine that is used by Spark SQL to analyze and optimize query plans.

The default value for spark.sql.analyzer.maxIterations is 100. This means that the Catalyst optimizer will perform up to 100 iterations to optimize a query plan. In most cases, this default value should be sufficient for most queries.

However, in certain cases, increasing the value of spark.sql.analyzer.maxIterations may be beneficial. For example, if you have a particularly complex query or if the Catalyst optimizer is not able to find the optimal query plan in the default 100 iterations, then increasing this value may help. However, increasing the number of iterations will increase the amount of time required to analyze and optimize the query plan.

It's worth noting that in some cases, an increase in the number of iterations does not necessary lead to a better optimized plan, it may lead to an increase in the cost of the plan instead, causing performance degradation.

It is recommended to test different values for spark.sql.analyzer.maxIterations to find

Create SQL user-defined table type¶

To create a user-defined table type with the specified columns in SQL, you can use the following script:

SQL

CREATE TYPE dbo.MyTableType AS TABLE
(
    Security_ID INT,
    IsInit BIT,
    IsBuyList BIT,
    IsComposite BIT,
    IsPenalty BIT,
    PROJECT_NAME VARCHAR(255),
    CreatedDatetime DATETIME,
    CreatedBy VARCHAR(255),
    UpdatedDatetime DATETIME,
    UpdatedBy VARCHAR(255)
);

This script creates a user-defined table type called dbo.MyTableType with the specified columns: Security_ID, IsInit, IsBuyList, IsComposite, IsPenalty, PROJECT_NAME, CreatedDatetime, CreatedBy, UpdatedDatetime, UpdatedBy.

The script uses the CREATE TYPE statement to create a new user-defined table type and the AS TABLE clause to define the structure of the table type. Each column is defined with a name and data type.

You can use this table type as a parameter in stored procedures, User defined functions or even in other table types.

SQL

CREATE PROCEDURE sp_MyProcedure
    (@table_param dbo.MyTableType READONLY)
AS
BEGIN
    SELECT * FROM @table_param;
END