Azure Databricks Tutorial

In this article, I’ll guide you through a comprehensive, hands-on tutorial that will provide you indepth knowledge on Azure Databricks.

Azure Databricks Tutorial

What is Azure Databricks?

Azure Databricks is Microsoft’s collaborative analytics platform built on Apache Spark, designed specifically for the Microsoft Azure cloud services platform. Think of it as a unified workspace where data engineers, data scientists, and business analysts can collaborate seamlessly on big data and machine learning projects.

Key Components

Workspace: Your collaborative environment for notebooks, libraries, and experiments
Clusters: Managed Apache Spark compute resources
Notebooks: Interactive documents combining code, visualizations, and narrative text
Jobs: Automated workflows for production data pipelines
MLflow: Machine learning lifecycle management

Setting Up Your Azure Databricks Environment

Step 1: Creating Your Azure Databricks Workspace

First, let me walk you through setting up your workspace.

Prerequisites:

Active Azure subscription
Contributor or Owner permissions on the Azure resource group
Basic understanding of Azure portal navigation

Creating the Workspace:

Using Azure Portal

Navigate to Azure Portal (portal.azure.com)
Search for “Databricks” in the top search bar
Click “Create” to start the setup wizard
Configure basic settings as shown in the screenshot below.

how to create databricks workspace in azure

Setting	Recommended Value	Purpose
Subscription	Your active subscription	Billing management
Resource Group	databricks-production-rg	Organization
Workspace Name	company-databricks-workspace	Identification
Region	East US 2 or West US 2	Low latency
Pricing Tier	Premium	Advanced features

Using Azure CLI

az databricks workspace create --resource-group myresgrp --name company-databricks-workspace --location eastus2 --sku standard

After executing the above Azure CLI script, the workspace has been created successfully.

Step 2: Configuring Network Security (Optional but Recommended)

Network security configuration is crucial:

{
  "customParameters": {
    "enableNoPublicIp": {
      "value": true
    },
    "customVirtualNetworkId": {
      "value": "/subscriptions/{subscription-id}/resourceGroups/{rg-name}/providers/Microsoft.Network/virtualNetworks/{vnet-name}"
    }
  }
}

Method 1: Interactive Data Exploration with Notebooks

Creating Your First Notebook

Once your workspace is ready, let’s create your first notebook.

Step-by-Step Process:

Launch Databricks Workspace from the Azure portal
Click “New” → “Notebook”
Configure notebook settings:
- Name: “Sales_Analysis_Tutorial”
- Language: Python
- Cluster: (We’ll create this next)

Check out the below screenshot for your reference.

Setting Up Your Compute Cluster

# Cluster configuration I recommend for beginners
{
  "cluster_name": "tutorial-cluster",
  "spark_version": "13.3.x-scala2.12",
  "node_type_id": "Standard_DS3_v2",
  "driver_node_type_id": "Standard_DS3_v2",
  "num_workers": 2,
  "autoscale": {
    "min_workers": 1,
    "max_workers": 4
  },
  "auto_termination_minutes": 30
}

Sample Data Analysis Workflow

Let me demonstrate with a practical example using retail sales data—similar to projects I’ve implemented for a client:

# Cell 1: Import libraries and create sample data
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, count, when
import matplotlib.pyplot as plt

# Create sample sales data
data = [
    ("2024-01-15", "Electronics", "Laptop", 1200.00, "California"),
    ("2024-01-16", "Electronics", "Phone", 800.00, "Texas"),
    ("2024-01-17", "Clothing", "Jacket", 150.00, "New York"),
    ("2024-01-18", "Electronics", "Tablet", 500.00, "Florida"),
    ("2024-01-19", "Clothing", "Shoes", 120.00, "California"),
    ("2024-01-20", "Home", "Furniture", 2000.00, "Illinois")
]

columns = ["date", "category", "product", "amount", "state"]
df = spark.createDataFrame(data, columns)
df.show()

# Cell 2: Data aggregation and analysis
# Sales by category
category_sales = df.groupBy("category").agg(
    sum("amount").alias("total_sales"),
    avg("amount").alias("avg_sale"),
    count("*").alias("transaction_count")
).orderBy(col("total_sales").desc())

category_sales.show()

# Sales by state
state_sales = df.groupBy("state").agg(
    sum("amount").alias("total_sales")
).orderBy(col("total_sales").desc())

state_sales.show()

Method 2: Automated Data Pipeline Development

Building Production-Ready ETL Pipelines

Automated pipelines are essential for scalable data processing. Here’s how I approach pipeline development:

# ETL Pipeline Structure
class DataPipeline:
    def __init__(self, source_path, target_path):
        self.source_path = source_path
        self.target_path = target_path
        self.spark = SparkSession.builder.appName("DataPipeline").getOrCreate()
    
    def extract(self):
        """Extract data from source"""
        print(f"Extracting data from {self.source_path}")
        return self.spark.read.option("header", "true").csv(self.source_path)
    
    def transform(self, df):
        """Apply transformations"""
        print("Applying transformations...")
        # Clean data
        cleaned_df = df.filter(col("amount") > 0)
        
        # Add calculated columns
        enriched_df = cleaned_df.withColumn(
            "amount_category",
            when(col("amount") > 1000, "High")
            .when(col("amount") > 500, "Medium")
            .otherwise("Low")
        )
        
        return enriched_df
    
    def load(self, df):
        """Load data to target"""
        print(f"Loading data to {self.target_path}")
        df.write.mode("overwrite").parquet(self.target_path)
        print("Pipeline completed successfully!")

# Execute pipeline
pipeline = DataPipeline("/mnt/source/sales_data.csv", "/mnt/processed/sales_data")
raw_data = pipeline.extract()
transformed_data = pipeline.transform(raw_data)
pipeline.load(transformed_data)

Scheduling Jobs for Automation

Creating a Production Job:

# Job configuration
job_config = {
    "name": "daily-sales-processing",
    "new_cluster": {
        "spark_version": "13.3.x-scala2.12",
        "node_type_id": "Standard_DS3_v2",
        "num_workers": 2
    },
    "notebook_task": {
        "notebook_path": "/Users/john.smith@company.com/sales_pipeline",
        "base_parameters": {
            "input_date": "{{ ds }}"
        }
    },
    "timeout_seconds": 3600,
    "max_retries": 3
}

Method 3: Machine Learning with MLflow Integration

Setting Up Your ML Experiment

Here’s one of the proven approach:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Start MLflow experiment
mlflow.set_experiment("/Users/sarah.johnson@company.com/sales_prediction")

with mlflow.start_run():
    # Prepare data for ML
    feature_cols = ["category_encoded", "state_encoded", "month", "day_of_week"]
    
    # Feature engineering
    from pyspark.ml.feature import StringIndexer, VectorAssembler
    from pyspark.ml import Pipeline
    
    indexers = [StringIndexer(inputCol=column, outputCol=column+"_encoded") 
                for column in ["category", "state"]]
    
    assembler = VectorAssembler(
        inputCols=[indexer.getOutputCol() for indexer in indexers],
        outputCol="features"
    )
    
    pipeline = Pipeline(stages=indexers + [assembler])
    model = pipeline.fit(df)
    transformed_df = model.transform(df)
    
    # Convert to Pandas for sklearn
    pandas_df = transformed_df.select("features", "amount").toPandas()
    
    # Prepare features and target
    X = np.array([x.toArray() for x in pandas_df["features"]])
    y = pandas_df["amount"].values
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train model
    rf = RandomForestRegressor(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    
    # Make predictions
    predictions = rf.predict(X_test)
    
    # Log metrics
    mse = mean_squared_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    
    mlflow.log_metric("mse", mse)
    mlflow.log_metric("r2", r2)
    mlflow.sklearn.log_model(rf, "random_forest_model")
    
    print(f"MSE: {mse:.2f}")
    print(f"R2 Score: {r2:.2f}")

Model Deployment and Serving

# Register model for production use
model_name = "sales_prediction_model"
model_version = mlflow.register_model(
    model_uri=f"runs:/{mlflow.active_run().info.run_id}/random_forest_model",
    name=model_name
)

# Transition to production
from mlflow.tracking import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
    name=model_name,
    version=model_version.version,
    stage="Production"
)

Method 4: Real-Time Streaming Analytics

Setting Up Structured Streaming

For real-time analytics projects, streaming is essential:

# Read streaming data from Event Hub or Kafka
streaming_df = spark.readStream \
    .format("eventhubs") \
    .options(**connection_string) \
    .load()

# Process streaming data
processed_stream = streaming_df \
    .select("body") \
    .withColumn("data", col("body").cast("string")) \
    .withColumn("timestamp", current_timestamp())

# Apply transformations
windowed_counts = processed_stream \
    .withWatermark("timestamp", "10 minutes") \
    .groupBy(
        window(col("timestamp"), "5 minutes"),
        col("category")
    ) \
    .count()

# Write streaming output
query = windowed_counts.writeStream \
    .outputMode("update") \
    .format("delta") \
    .option("checkpointLocation", "/mnt/checkpoints/streaming_job") \
    .table("streaming_analytics")

query.start().awaitTermination()

Cost Optimization Strategies

Auto-termination Configuration:

# Automatic cluster termination
cluster_config = {
    "auto_termination_minutes": 15,  # Terminate after 15 minutes of inactivity
    "autoscale": {
        "min_workers": 1,
        "max_workers": 10
    }
}

Spot Instance Usage:

# Use spot instances for non-critical workloads
spot_config = {
    "aws_attributes": {
        "zone_id": "us-west-2a",
        "spot_bid_price_percent": 50,  # Bid 50% of on-demand price
        "instance_profile_arn": "arn:aws:iam::123456789012:instance-profile/databricks-role"
    }
}

Conclusion:

After guiding you through this comprehensive Azure Databricks tutorial, from basic workspace setup to advanced streaming analytics and machine learning implementation, you can now able to deploy solutions without any issue.

Key Takeaways

Remember these critical points:

Start with proper workspace configuration and security settings
Use the four core methods: interactive notebooks, automated pipelines, ML workflows, and streaming analytics
Always prioritize performance optimization and cost management
Security and access control are non-negotiable in enterprise environments

You may also like the following articles

Rajkishore

I am Rajkishore, and I am a Microsoft Certified IT Consultant. I have over 14 years of experience in Microsoft Azure and AWS, with good experience in Azure Functions, Storage, Virtual Machines, Logic Apps, PowerShell Commands, CLI Commands, Machine Learning, AI, Azure Cognitive Services, DevOps, etc. Not only that, I do have good real-time experience in designing and developing cloud-native data integrations on Azure or AWS, etc. I hope you will learn from these practical Azure tutorials. Read more.