Azure Databricks Tutorial

In this article, I’ll guide you through a comprehensive, hands-on tutorial that will provide you indepth knowledge on Azure Databricks.

Azure Databricks Tutorial

What is Azure Databricks?

Azure Databricks is Microsoft’s collaborative analytics platform built on Apache Spark, designed specifically for the Microsoft Azure cloud services platform. Think of it as a unified workspace where data engineers, data scientists, and business analysts can collaborate seamlessly on big data and machine learning projects.

Key Components

  • Workspace: Your collaborative environment for notebooks, libraries, and experiments
  • Clusters: Managed Apache Spark compute resources
  • Notebooks: Interactive documents combining code, visualizations, and narrative text
  • Jobs: Automated workflows for production data pipelines
  • MLflow: Machine learning lifecycle management

Setting Up Your Azure Databricks Environment

Step 1: Creating Your Azure Databricks Workspace

First, let me walk you through setting up your workspace.

Prerequisites:

Creating the Workspace:

Using Azure Portal

  1. Navigate to Azure Portal (portal.azure.com)
  2. Search for “Databricks” in the top search bar
  3. Click “Create” to start the setup wizard
  4. Configure basic settings as shown in the screenshot below.
how to create databricks workspace in azure
SettingRecommended ValuePurpose
SubscriptionYour active subscriptionBilling management
Resource Groupdatabricks-production-rgOrganization
Workspace Namecompany-databricks-workspaceIdentification
RegionEast US 2 or West US 2Low latency
Pricing TierPremiumAdvanced features

Using Azure CLI

az databricks workspace create --resource-group myresgrp --name company-databricks-workspace --location eastus2 --sku standard

After executing the above Azure CLI script, the workspace has been created successfully.

Azure Databricks Tutorial

Step 2: Configuring Network Security (Optional but Recommended)

Network security configuration is crucial:

{
  "customParameters": {
    "enableNoPublicIp": {
      "value": true
    },
    "customVirtualNetworkId": {
      "value": "/subscriptions/{subscription-id}/resourceGroups/{rg-name}/providers/Microsoft.Network/virtualNetworks/{vnet-name}"
    }
  }
}

Method 1: Interactive Data Exploration with Notebooks

Creating Your First Notebook

Once your workspace is ready, let’s create your first notebook.

Step-by-Step Process:

  1. Launch Databricks Workspace from the Azure portal
  2. Click “New” → “Notebook”
  3. Configure notebook settings:
    • Name: “Sales_Analysis_Tutorial”
    • Language: Python
    • Cluster: (We’ll create this next)

Check out the below screenshot for your reference.

microsoft azure databricks

Setting Up Your Compute Cluster

# Cluster configuration I recommend for beginners
{
  "cluster_name": "tutorial-cluster",
  "spark_version": "13.3.x-scala2.12",
  "node_type_id": "Standard_DS3_v2",
  "driver_node_type_id": "Standard_DS3_v2",
  "num_workers": 2,
  "autoscale": {
    "min_workers": 1,
    "max_workers": 4
  },
  "auto_termination_minutes": 30
}

Sample Data Analysis Workflow

Let me demonstrate with a practical example using retail sales data—similar to projects I’ve implemented for a client:

# Cell 1: Import libraries and create sample data
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, count, when
import matplotlib.pyplot as plt

# Create sample sales data
data = [
    ("2024-01-15", "Electronics", "Laptop", 1200.00, "California"),
    ("2024-01-16", "Electronics", "Phone", 800.00, "Texas"),
    ("2024-01-17", "Clothing", "Jacket", 150.00, "New York"),
    ("2024-01-18", "Electronics", "Tablet", 500.00, "Florida"),
    ("2024-01-19", "Clothing", "Shoes", 120.00, "California"),
    ("2024-01-20", "Home", "Furniture", 2000.00, "Illinois")
]

columns = ["date", "category", "product", "amount", "state"]
df = spark.createDataFrame(data, columns)
df.show()
# Cell 2: Data aggregation and analysis
# Sales by category
category_sales = df.groupBy("category").agg(
    sum("amount").alias("total_sales"),
    avg("amount").alias("avg_sale"),
    count("*").alias("transaction_count")
).orderBy(col("total_sales").desc())

category_sales.show()

# Sales by state
state_sales = df.groupBy("state").agg(
    sum("amount").alias("total_sales")
).orderBy(col("total_sales").desc())

state_sales.show()

Method 2: Automated Data Pipeline Development

Building Production-Ready ETL Pipelines

Automated pipelines are essential for scalable data processing. Here’s how I approach pipeline development:

# ETL Pipeline Structure
class DataPipeline:
    def __init__(self, source_path, target_path):
        self.source_path = source_path
        self.target_path = target_path
        self.spark = SparkSession.builder.appName("DataPipeline").getOrCreate()
    
    def extract(self):
        """Extract data from source"""
        print(f"Extracting data from {self.source_path}")
        return self.spark.read.option("header", "true").csv(self.source_path)
    
    def transform(self, df):
        """Apply transformations"""
        print("Applying transformations...")
        # Clean data
        cleaned_df = df.filter(col("amount") > 0)
        
        # Add calculated columns
        enriched_df = cleaned_df.withColumn(
            "amount_category",
            when(col("amount") > 1000, "High")
            .when(col("amount") > 500, "Medium")
            .otherwise("Low")
        )
        
        return enriched_df
    
    def load(self, df):
        """Load data to target"""
        print(f"Loading data to {self.target_path}")
        df.write.mode("overwrite").parquet(self.target_path)
        print("Pipeline completed successfully!")

# Execute pipeline
pipeline = DataPipeline("/mnt/source/sales_data.csv", "/mnt/processed/sales_data")
raw_data = pipeline.extract()
transformed_data = pipeline.transform(raw_data)
pipeline.load(transformed_data)

Scheduling Jobs for Automation

Creating a Production Job:

# Job configuration
job_config = {
    "name": "daily-sales-processing",
    "new_cluster": {
        "spark_version": "13.3.x-scala2.12",
        "node_type_id": "Standard_DS3_v2",
        "num_workers": 2
    },
    "notebook_task": {
        "notebook_path": "/Users/john.smith@company.com/sales_pipeline",
        "base_parameters": {
            "input_date": "{{ ds }}"
        }
    },
    "timeout_seconds": 3600,
    "max_retries": 3
}

Method 3: Machine Learning with MLflow Integration

Setting Up Your ML Experiment

Here’s one of the proven approach:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Start MLflow experiment
mlflow.set_experiment("/Users/sarah.johnson@company.com/sales_prediction")

with mlflow.start_run():
    # Prepare data for ML
    feature_cols = ["category_encoded", "state_encoded", "month", "day_of_week"]
    
    # Feature engineering
    from pyspark.ml.feature import StringIndexer, VectorAssembler
    from pyspark.ml import Pipeline
    
    indexers = [StringIndexer(inputCol=column, outputCol=column+"_encoded") 
                for column in ["category", "state"]]
    
    assembler = VectorAssembler(
        inputCols=[indexer.getOutputCol() for indexer in indexers],
        outputCol="features"
    )
    
    pipeline = Pipeline(stages=indexers + [assembler])
    model = pipeline.fit(df)
    transformed_df = model.transform(df)
    
    # Convert to Pandas for sklearn
    pandas_df = transformed_df.select("features", "amount").toPandas()
    
    # Prepare features and target
    X = np.array([x.toArray() for x in pandas_df["features"]])
    y = pandas_df["amount"].values
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train model
    rf = RandomForestRegressor(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    
    # Make predictions
    predictions = rf.predict(X_test)
    
    # Log metrics
    mse = mean_squared_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    
    mlflow.log_metric("mse", mse)
    mlflow.log_metric("r2", r2)
    mlflow.sklearn.log_model(rf, "random_forest_model")
    
    print(f"MSE: {mse:.2f}")
    print(f"R2 Score: {r2:.2f}")

Model Deployment and Serving

# Register model for production use
model_name = "sales_prediction_model"
model_version = mlflow.register_model(
    model_uri=f"runs:/{mlflow.active_run().info.run_id}/random_forest_model",
    name=model_name
)

# Transition to production
from mlflow.tracking import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
    name=model_name,
    version=model_version.version,
    stage="Production"
)

Method 4: Real-Time Streaming Analytics

Setting Up Structured Streaming

For real-time analytics projects, streaming is essential:

# Read streaming data from Event Hub or Kafka
streaming_df = spark.readStream \
    .format("eventhubs") \
    .options(**connection_string) \
    .load()

# Process streaming data
processed_stream = streaming_df \
    .select("body") \
    .withColumn("data", col("body").cast("string")) \
    .withColumn("timestamp", current_timestamp())

# Apply transformations
windowed_counts = processed_stream \
    .withWatermark("timestamp", "10 minutes") \
    .groupBy(
        window(col("timestamp"), "5 minutes"),
        col("category")
    ) \
    .count()

# Write streaming output
query = windowed_counts.writeStream \
    .outputMode("update") \
    .format("delta") \
    .option("checkpointLocation", "/mnt/checkpoints/streaming_job") \
    .table("streaming_analytics")

query.start().awaitTermination()

Cost Optimization Strategies

Auto-termination Configuration:

# Automatic cluster termination
cluster_config = {
    "auto_termination_minutes": 15,  # Terminate after 15 minutes of inactivity
    "autoscale": {
        "min_workers": 1,
        "max_workers": 10
    }
}

Spot Instance Usage:

# Use spot instances for non-critical workloads
spot_config = {
    "aws_attributes": {
        "zone_id": "us-west-2a",
        "spot_bid_price_percent": 50,  # Bid 50% of on-demand price
        "instance_profile_arn": "arn:aws:iam::123456789012:instance-profile/databricks-role"
    }
}

Conclusion:

After guiding you through this comprehensive Azure Databricks tutorial, from basic workspace setup to advanced streaming analytics and machine learning implementation, you can now able to deploy solutions without any issue.

Key Takeaways

Remember these critical points:

  • Start with proper workspace configuration and security settings
  • Use the four core methods: interactive notebooks, automated pipelines, ML workflows, and streaming analytics
  • Always prioritize performance optimization and cost management
  • Security and access control are non-negotiable in enterprise environments

You may also like the following articles

Azure Virtual Machine

DOWNLOAD FREE AZURE VIRTUAL MACHINE PDF

Download our free 25+ page Azure Virtual Machine guide and master cloud deployment today!