Build Your Own LinkedIn Analytics Part 9: Making a Maintainable Pipeline
Orginally published on Medium on 22 December 2025
By the end of the previous post, we had built our data pipeline for LinkedIn analytics from ingestion all the way to the final dashboard product, as well as put in place proper orchestration and automation of our data product.
The next step then is making sure that all of what we’ve just built is easy to maintain.
TL;DR
- Use Git and Databricks Asset Bundles to turn your LinkedIn analytics pipeline from a Databricks POC into a versioned, reproducible project with dev/prod separation, safe rollbacks, and one‑click deployments.
I. Options for Maintainability
A maintainable pipeline is one that has the following qualities:
- Easy to change;
- Easy to roll back; and
- Safe to run autonomously without manual babysitting.
You can thus safely introduce changes such as new engagement metrics, knowing you can roll back if something breaks. For a data‑driven LinkedIn strategy, that reliability is what lets you experiment aggressively without losing trust in your numbers.
A lot of what makes a pipeline maintainable is already part of our design for our LinkedIn analytics data product.
- Well-documented assets: This includes readable and well-structured code, clear comments, and relevant descriptions of data assets.
- Modular notebooks as tasks: The notebooks that make up our task nodes can be swapped out for code that does the equivalent; there is no dependency on any hidden state or special flag.
- Idempotent tasks and runs: We can re-run a failed task or pipeline without worrying about data consistency.
- Data versioning via Delta Lake tables: Databricks managed tables have versioning enabled by default, so any errors committed during an update can be rolled back with little issue.
But we need some additional qualities:
- Version control for notebooks, jobs, dashboards and schema definitions: If a pipeline were to be tampered with or erroneous code introduced, we have no easy way of restoring the previous state.
- Separation of development from production: We want to have a version of the pipeline and assets that can be used for testing new features or fixes (development), before the changes are pushed to the main pipeline (production).
- Continuous integration and deployment: With this distinction between development and production, we also want to have a way to manage all these differences in a controlled and automated fashion.
We have two main ways of achieving these qualities, which we will be using in combination.
a. Git repositories
Git repositories are the modern way of managing code versions, and Databricks does support them out of the box. The problem is that we don’t yet have a code version of our Databricks infrastructure such as our Databricks Jobs or our dashboard. How do we version those?
b. Databricks Asset Bundles
Databricks introduced this feature in 2024 and has been iterating on it ever since. As the name suggests, this bundles Databricks assets together in code form, including Jobs, Dashboards and even schemas. In other words, this is Databricks’ version of Infrastructure as Code (IaC). This concept was popularized by HashiCorp Terraform, and indeed Databricks Asset Bundles (or DABs) were initially built on top of Terraform, though there is now a push to move away from that dependency.
Note that DABs are meant to be used together with Git repositories; in fact, in the Databricks UI, you can only create a DAB in a Git repository folder.
II. Setting up Git in Databricks
This step assumes that you already have a Git repository at hand. Databricks supports a comprehensive array of Git providers, including the various flavours of GitHub, GitLab, BitBucket, Azure DevOps and AWS CodeCommit. For our purposes, GitHub works just fine.
In the Databricks UI, click on your profile icon and choose the ‘Settings’ option. You should see the following menu on the left:

Click on ‘Linked accounts’ under the ‘User’ section. As of this writing the only service available for connection is Git integration, which is exactly what we want. Click on ‘Add Git credentials’

Here you choose your Git provider. If you choose GitHub, you can then link your Git account in a few clicks. You can also opt to use a personal access token if you don’t wish to publish your associated email to the repository (which may be a concern if you intend to connect to or create a public repository). Other Git providers require you to provide similar access tokens.

The next step is to create a Git folder in our workspace. Click ‘+ New’ at the top left corner, navigate to ‘More’, and then choose ‘Git folder’.

You can now create your Git folder. Obtain your Git repository URL (the steps to obtain it for GitHub are here), name the folder, then proceed to create it.

Below is the end result of this process: a Git folder on Databricks.

Git is a fundamental tool for code versioning, and is a required competency for any aspiring data engineer (and developer in general). If you want to dive deeper, there is a comprehensive guide here.
III. Setting up a service principal
If you look at the ‘Run as’ column in Jobs and Pipelines, you can see that all our runs so far have been using our personal credentials. This is fine for development work, but for a production run, that is misleading at best (we aren’t doing a manual run) and potentially catastrophic at worst (what happens when the pipeline developer leaves the organization?).
In general, we should always use synthetic credentials for production runs and deployments, whether they come in the form of service principals or service accounts.
With that in mind, let’s create a Databricks service principal. Back in the ‘Settings’ menu, click on the ‘Identity and access’ option under ‘Workspace admin’. You will see the following section:

Administering users and groups is beyond the scope of our exploration, especially in our Databricks Free instance, but it is important to set these up for any enterprise deployment. Databricks has some recommended best practices in their documentation.
Click on ‘Manage’ for Service principals, then ‘Add service principal’ on the next page, and finally ‘Add new’ in the subsequent page.


Type a descriptive name for the service principal, then click ‘Add’.

Note the Application ID of our newly created service principal; we will be using it later in our DAB deployment.
IV. Creating our Databricks Asset Bundle
Navigate to our Git folder, and click ‘Create’ at the top right corner. You will see ‘Asset bundle’ as one of the options; click on it.

This brings you to the list of options for creating a Databricks Asset Bundle as below:

For the purposes of this project, choose ‘Empty project’; you can explore the more guided options separately.
This is a good time to outline what our structure for our DAB is going to look like.

- There are a bunch of additional YAML files in the ‘resources’ folder. This is best practice for separation of different types of resources; we don’t want to dump everything into a single databricks.yml file, for the same reason we don’t want to have a monolithic notebook as our pipeline.
- Note the ‘variable-override.json’ file in the ‘.databricks’ folder. Because this folder is listed in the ‘.gitignore’ file, it and its contents will not be written to the Git repository. This is a good way to store any local configurations that you don’t want to expose in the repository, such as the Databricks warehouse ID and the service principal ID.
- The notebooks from our POC pipeline have been copied into this DAB. I have arranged them in numbered folders under ‘src’ (aka the source folder). I have also included the POC notebook that was the source for the individual notebook tasks; this can prove useful in trying out end-to-end ideas in the development stage.
Let’s go through the different YAML configuration files in turn.
a. variables.yml
This file serves as a single place for defining our pipeline-wide constants. With this, we no longer have to repeat them through multiple notebooks, or manually keep the naming of data assets such as schemas and tables consistent.
The contents of the file are as below:
variables:
# 0. Sensitive variables
linkedin_profile_name:
description: >
The name of the LinkedIn profile
warehouse_id:
description: >
The ID of the Databricks SQL warehouse cluster
pipeline_runner:
description: >
The name of the service principal used to deploy the DAB and run the pipelines
# 1. Settings for the medallion layer catalogs and schemas
landing_catalog:
description: >
The ID of the landing catalog
default: landing
bronze_catalog:
description: >
The ID of the bronze catalog
default: bronze
silver_catalog:
description: >
The ID of the silver catalog
default: silver
gold_catalog:
description: >
The ID of the gold catalog
default: gold
common_schema:
description: >
The ID of the schema used in all medallion catalogs
default: linkedin
# 2. Settings for the landing volumes
landing_content_daily_volume:
description: >
The ID of the landing volume for daily content metrics Excel files
default: content_daily
landing_content_historical_volume:
description: >
The ID of the landing volume for historical content metrics Excel files
default: historical_daily
landing_posts_volume:
description: >
The ID of the landing volume for post metrics Excel files
default: posts
landing_patch_volume:
description: >
The ID of the landing volume for patch CSV files
default: patch
ingestion_folder:
description: >
The folder in each volume for uploading files to be ingested
default: pending
processed_folder:
description: >
The folder in each volume for storing successfully ingested files
default: processed
errors_folder:
description: >
The folder in each volume for storing files with errors during ingestion
default: errors
patch_post_subfolder:
description: >
The subfolder in the landing patch volume for patching posts data
default: posts
# 3. Settings for the bronze tables
bronze_profile_discovery_metrics_table:
description: >
The ID of the bronze table for profile daily discovery metrics
default: discovery
bronze_profile_followers_table:
description: >
The ID of the bronze table for profile daily follower metrics
default: followers
bronze_post_patch_table:
description: >
The ID of the bronze table for post patching
default: linkedin_patch
bronze_post_metrics_table:
description: >
The ID of the bronze table for post metrics (taken from post metrics Excel files)
default: post_details
bronze_post_metadata_table:
description: >
The ID of the bronze table for post metadata
default: posts
bronze_profile_metrics_table:
description: >
The ID of the bronze table for profile daily engagement and impressions metrics
default: totals
# 3.1 Settings for the bronze staging tables
bronze_staging_impressions_prefix:
description: >
The prefix for bronze staging tables for daily post impressions
default: impressions
bronze_staging_engagements_prefix:
description: >
The prefix for bronze staging tables for daily post engagements
default: engagements
# 4. Settings for the silver tables
silver_post_impressions:
description: >
The ID of the silver table for post impressions
default: impressions
silver_post_engagements:
description: >
The ID of the silver table for post engagements
default: engagements
silver_profile_followers:
description: >
The ID of the silver table for profile follower metrics
default: followers
silver_post_metadata:
description: >
The ID of the silver table for consolidated post metadata
default: posts
silver_profile_metrics:
description: >
The ID of the silver table for profile engagement and impressions metrics
default: totals
# 4. Settings for the gold tables
gold_fact_daily_post_statistics_table:
description: >
The ID of the gold fact table/view for daily posts statistics
default: fct_daily_post_statistics
gold_fact_daily_profile_statistics_table:
description: >
The ID of the gold fact table/view for daily profile statistics
default: fct_daily_profile_statistics
gold_dimension_date_table:
description: >
The ID of the gold dimension table/view for dates
default: dim_date
gold_dimension_timestamp_table:
description: >
The ID of the gold dimension table/view for timestamps
default: dim_timestamp
# 5. Settings for the dashboard
dashboard_subscriber:
description: >
The email of the subscriber to the dashboard
default: ${workspace.current_user.userName}Now changing a table name or adding a new volume is a single variable change, not a hunt through multiple notebooks and jobs.
For the variables with no defaults defined due to their sensitivity, they are defined in the override JSON located in the .databricks folder.

The structure of the file is as follows.
{
"linkedin_profile_name": "placeholder",
"warehouse_id": "placeholder",
"pipeline_runner": "placeholder"
}b. schemas.yml
We define our schemas as well as our volumes here:
resources:
schemas:
landing:
catalog_name: ${var.landing_catalog}
comment: LinkedIn landing schema created by DAB
name: ${var.common_schema}
bronze:
catalog_name: ${var.bronze_catalog}
comment: LinkedIn bronze schema created by DAB
name: ${var.common_schema}
silver:
catalog_name: ${var.silver_catalog}
comment: LinkedIn silver schema created by DAB
name: ${var.common_schema}
gold:
catalog_name: ${var.gold_catalog}
comment: LinkedIn gold schema created by DAB
name: ${var.common_schema}
volumes:
content_daily:
catalog_name: ${var.landing_catalog}
schema_name: ${resources.schemas.landing.name}
name: ${var.landing_content_daily_volume}
content_historical:
catalog_name: ${var.landing_catalog}
schema_name: ${resources.schemas.landing.name}
name: ${var.landing_content_historical_volume}
posts:
catalog_name: ${var.landing_catalog}
schema_name: ${resources.schemas.landing.name}
name: ${var.landing_posts_volume}
patch:
catalog_name: ${var.landing_catalog}
schema_name: ${resources.schemas.landing.name}
name: ${var.landing_patch_volume}Here we see our variables in action in deploying consistent namings for schemas and volumes that are also fed into our notebooks.
c. dashboard.yml
We define our dashboard as below:
sync:
exclude:
- ../src/*.lvdash.json
- ../src/*/*.lvdash.json
resources:
dashboards:
linkedin_statistics:
display_name: LinkedIn Statistics
file_path: >
../src/linkedin_analytics_jobs/
4. data product/LinkedIn Statistics.lvdash.json
warehouse_id: ${var.warehouse_id}The exclusions are to avoid duplication of dashboards, per Databricks’ advice.
As of this writing, there is no straightforward, documented way to pull DAB variables directly into the dataset queries inside a Databricks dashboard, as evidenced by an open Databricks CLI GitHub issue. Dashboard parameters are instead designed for end-user interactivity via widgets. A workaround using third-party CI/CD tools such as GitHub Actions and Jenkins to template the dashboard JSON before deployment is possible; we touch on such tools towards the end of this article.
d. jobs.yml
Recall the pipeline structure that we finalized in the previous post:

We want to embed this all in code. But how?
Navigate to any of the Databricks Jobs. Click the three dots next to the ‘Run now’ button, and choose ‘View as code’

Now we can see the code version of the Databricks Job, which has defaulted to YAML format. We can copy this and paste it into ‘jobs.yml’, then edit accordingly (e.g. replacing hard-coded values with our constants). We need to repeat this for all of our pipelines.

One of the items that needs changing for this Databricks Job is the job_id that is triggered at the end of the pipeline. We can substitute an inferred variable in place of the hard-coded number, as below:

For more details on the format for the types of tasks that can be added to Databricks Jobs in a DAB, refer to this link.
At the end of this import and refactoring process, we end up with a file that consolidates all our job definitions and uses our defined constants and inferred variables, as per below:
resources:
jobs:
linkedin_transformation:
name: linkedin transformation
tasks:
- task_key: silver_engagements_transformation
notebook_task:
base_parameters:
BRONZE_CATALOG: >
${var.bronze_catalog}
BRONZE_SCHEMA: >
${resources.schemas.bronze.name}
BRONZE_ENGAGEMENTS_TABLE_PREFIX: >
${var.bronze_staging_engagements_prefix}
BRONZE_TOTALS_TABLE: >
${var.bronze_profile_metrics_table}
SILVER_CATALOG: >
${var.silver_catalog}
SILVER_SCHEMA: >
${resources.schemas.silver.name}
SILVER_ENGAGEMENTS_TABLE: >
${var.silver_post_engagements}
notebook_path: >
../src/linkedin_analytics_jobs/
2. silver transformation/
silver engagements consolidation.ipynb
source: WORKSPACE
- task_key: silver_impressions_transformation
notebook_task:
base_parameters:
BRONZE_CATALOG: >
${var.bronze_catalog}
BRONZE_SCHEMA: >
${resources.schemas.bronze.name}
BRONZE_IMPRESSIONS_TABLE_PREFIX: >
${var.bronze_staging_impressions_prefix}
BRONZE_TOTALS_TABLE: >
${var.bronze_profile_metrics_table}
SILVER_CATALOG: >
${var.silver_catalog}
SILVER_SCHEMA: >
${resources.schemas.silver.name}
SILVER_IMPRESSIONS_TABLE: >
${var.silver_post_impressions}
notebook_path: >
../src/linkedin_analytics_jobs/
2. silver transformation/
silver impressions consolidation.ipynb
source: WORKSPACE
- task_key: silver_views_update
notebook_task:
base_parameters:
BRONZE_CATALOG: >
${var.bronze_catalog}
BRONZE_SCHEMA: >
${resources.schemas.bronze.name}
BRONZE_FOLLOWERS_TABLE: >
${var.bronze_profile_followers_table}
BRONZE_TOTALS_TABLE: >
${var.bronze_profile_metrics_table}
SILVER_CATALOG: >
${var.silver_catalog}
SILVER_SCHEMA: >
${resources.schemas.silver.name}
SILVER_FOLLOWERS_TABLE: >
${var.silver_profile_followers}
SILVER_TOTALS_TABLE: >
${var.silver_profile_metrics}
notebook_path: >
../src/linkedin_analytics_jobs/
2. silver transformation/
silver views update.ipynb
source: WORKSPACE
- task_key: silver_posts_enrichment
notebook_task:
base_parameters:
BRONZE_CATALOG: >
${var.bronze_catalog}
BRONZE_SCHEMA: >
${resources.schemas.bronze.name}
BRONZE_POSTS_TABLE: >
${var.bronze_post_metadata_table}
BRONZE_POST_DETAILS_TABLE: >
${var.bronze_post_metrics_table}
BRONZE_POST_PATCH_TABLE: >
${var.bronze_post_patch_table}
SILVER_CATALOG: >
${var.silver_catalog}
SILVER_SCHEMA: >
${resources.schemas.silver.name}
SILVER_POSTS_TABLE: >
${var.silver_post_metadata}
notebook_path: >
../src/linkedin_analytics_jobs/
2. silver transformation/
silver post enrichment.ipynb
source: WORKSPACE
- task_key: gold_create_or_refresh
depends_on:
- task_key: silver_engagements_transformation
- task_key: silver_impressions_transformation
- task_key: silver_posts_enrichment
- task_key: silver_views_update
pipeline_task:
pipeline_id: >
${resources.pipelines.gold_create.id}
max_retries: 3
min_retry_interval_millis: 60000
retry_on_timeout: true
- task_key: dashboard_refresh
depends_on:
- task_key: gold_create_or_refresh
dashboard_task:
subscription:
subscribers:
- user_name: >
${var.dashboard_subscriber}
warehouse_id: >
${var.warehouse_id}
dashboard_id: >
${resources.dashboards.linkedin_statistics.id}
queue:
enabled: true
performance_target: PERFORMANCE_OPTIMIZED
linkedin_daily_ingest:
name: linkedin daily ingest
trigger:
pause_status: UNPAUSED
file_arrival:
url: >
/Volumes/${var.landing_catalog}/
${resources.schemas.landing.name}/
${resources.volumes.content_daily.name}/
${var.ingestion_folder}/
tasks:
- task_key: bronze_daily_ingest
notebook_task:
base_parameters:
LINKEDIN_PROFILE_NAME: >
${var.linkedin_profile_name}
LANDING_CATALOG: >
${var.landing_catalog}
LANDING_SCHEMA: >
${resources.schemas.landing.name}
LANDING_DAILY_VOLUME: >
${resources.volumes.content_daily.name}
PENDING_FOLDER: >
${var.ingestion_folder}
PROCESSED_FOLDER: >
${var.processed_folder}
ERRORS_FOLDER: >
${var.errors_folder}
BRONZE_CATALOG: >
${var.bronze_catalog}
BRONZE_SCHEMA: >
${resources.schemas.bronze.name}
BRONZE_DISCOVERY_TABLE: >
${var.bronze_profile_discovery_metrics_table}
BRONZE_TOTALS_TABLE: >
${var.bronze_profile_metrics_table}
BRONZE_FOLLOWERS_TABLE: >
${var.bronze_profile_followers_table}
BRONZE_IMPRESSIONS_TABLE: >
${var.bronze_staging_impressions_prefix}
BRONZE_ENGAGEMENTS_TABLE: >
${var.bronze_staging_engagements_prefix}
notebook_path: >
../src/linkedin_analytics_jobs/
1. bronze ingestion/
bronze daily ingest.ipynb
source: WORKSPACE
- task_key: bronze_post_ingest
depends_on:
- task_key: bronze_daily_ingest
notebook_task:
base_parameters:
LINKEDIN_PROFILE_NAME: >
${var.linkedin_profile_name}
LANDING_CATALOG: >
${var.landing_catalog}
LANDING_SCHEMA: >
${resources.schemas.landing.name}
LANDING_POSTS_VOLUME: >
${resources.volumes.posts.name}
PENDING_FOLDER: >
${var.ingestion_folder}
PROCESSED_FOLDER: >
${var.processed_folder}
ERRORS_FOLDER: >
${var.errors_folder}
BRONZE_CATALOG: >
${var.bronze_catalog}
BRONZE_SCHEMA: >
${resources.schemas.bronze.name}
BRONZE_IMPRESSIONS_TABLE: >
${var.bronze_staging_impressions_prefix}
BRONZE_POSTS_TABLE: >
${var.bronze_post_metadata_table}
BRONZE_POST_DETAILS_TABLE: >
${var.bronze_post_metrics_table}
BRONZE_POST_PATCH_TABLE: >
${var.bronze_post_patch_table}
notebook_path: >
../src/linkedin_analytics_jobs/
1. bronze ingestion/
bronze post ingest.ipynb
source: WORKSPACE
- task_key: trigger_transformation
depends_on:
- task_key: bronze_post_ingest
run_job_task:
job_id: >
${resources.jobs.linkedin_transformation.id}
queue:
enabled: true
performance_target: PERFORMANCE_OPTIMIZED
linkedin_historical_ingestion:
name: linkedin historical ingestion
trigger:
pause_status: UNPAUSED
file_arrival:
url: >
/Volumes/${var.landing_catalog}/
${resources.schemas.landing.name}/
${resources.volumes.content_historical.name}/
${var.ingestion_folder}/
tasks:
- task_key: bronze_historical_ingest
notebook_task:
base_parameters:
LINKEDIN_PROFILE_NAME: >
${var.linkedin_profile_name}
LANDING_CATALOG: >
${var.landing_catalog}
LANDING_SCHEMA: >
${resources.schemas.landing.name}
LANDING_DAILY_VOLUME: >
${resources.volumes.content_historical.name}
PENDING_FOLDER: >
${var.ingestion_folder}
PROCESSED_FOLDER: >
${var.processed_folder}
ERRORS_FOLDER: >
${var.errors_folder}
BRONZE_CATALOG: >
${var.bronze_catalog}
BRONZE_SCHEMA: >
${resources.schemas.bronze.name}
BRONZE_TOTALS_TABLE: >
${var.bronze_profile_metrics_table}
BRONZE_FOLLOWERS_TABLE: >
${var.bronze_profile_followers_table}
notebook_path: >
../src/linkedin_analytics_jobs/
1. bronze ingestion/
bronze historical ingest.ipynb
source: WORKSPACE
- task_key: trigger_transformation
depends_on:
- task_key: bronze_historical_ingest
run_job_task:
job_id: >
${resources.jobs.linkedin_transformation.id}
queue:
enabled: true
performance_target: PERFORMANCE_OPTIMIZED
linkedin_post_ingest:
name: linkedin post ingest
trigger:
pause_status: UNPAUSED
file_arrival:
url: >
/Volumes/${var.landing_catalog}/
${resources.schemas.landing.name}/
${resources.volumes.posts.name}/
${var.ingestion_folder}/
tasks:
- task_key: bronze_post_ingest
notebook_task:
base_parameters:
LINKEDIN_PROFILE_NAME: >
${var.linkedin_profile_name}
LANDING_CATALOG: >
${var.landing_catalog}
LANDING_SCHEMA: >
${resources.schemas.landing.name}
LANDING_POSTS_VOLUME: >
${resources.volumes.posts.name}
PENDING_FOLDER: >
${var.ingestion_folder}
PROCESSED_FOLDER: >
${var.processed_folder}
ERRORS_FOLDER: >
${var.errors_folder}
BRONZE_CATALOG: >
${var.bronze_catalog}
BRONZE_SCHEMA: >
${resources.schemas.bronze.name}
BRONZE_IMPRESSIONS_TABLE: >
${var.bronze_staging_impressions_prefix}
BRONZE_POSTS_TABLE: >
${var.bronze_post_metadata_table}
BRONZE_POST_DETAILS_TABLE: >
${var.bronze_post_metrics_table}
notebook_path: >
../src/linkedin_analytics_jobs/
1. bronze ingestion/
bronze post ingest.ipynb
source: WORKSPACE
- task_key: trigger_transformation
depends_on:
- task_key: bronze_post_ingest
run_job_task:
job_id: >
${resources.jobs.linkedin_transformation.id}
queue:
enabled: true
performance_target: PERFORMANCE_OPTIMIZED
linkedin_post_patch_ingest:
name: linkedin post patch ingest
trigger:
pause_status: UNPAUSED
file_arrival:
url: >
/Volumes/${var.landing_catalog}/
${resources.schemas.landing.name}/
${resources.volumes.patch.name}/
${var.ingestion_folder}/
${var.patch_post_subfolder}/
tasks:
- task_key: bronze_post_patch_ingest
notebook_task:
base_parameters:
LANDING_CATALOG: >
${var.landing_catalog}
LANDING_SCHEMA: >
${resources.schemas.landing.name}
LANDING_PATCH_VOLUME: >
${resources.volumes.patch.name}
PENDING_FOLDER: >
${var.ingestion_folder}
PROCESSED_FOLDER: >
${var.processed_folder}
ERRORS_FOLDER: >
${var.errors_folder}
PATCH_POST_SUBFOLDER: >
${var.patch_post_subfolder}
BRONZE_CATALOG: >
${var.bronze_catalog}
BRONZE_SCHEMA: >
${resources.schemas.bronze.name}
BRONZE_POST_PATCH_TABLE: >
${var.bronze_post_patch_table}
notebook_path: >
../src/linkedin_analytics_jobs/
1. bronze ingestion/
bronze post patch ingest.ipynb
source: WORKSPACE
- task_key: trigger_transformation
depends_on:
- task_key: bronze_post_patch_ingest
run_job_task:
job_id: >
${resources.jobs.linkedin_transformation.id}
queue:
enabled: true
performance_target: PERFORMANCE_OPTIMIZEDThere are a few changes to this set of Databricks Jobs that weren’t present in the POC pipeline:
- The post patch ingestion job now triggers on a file arrival event to the newly created patch landing volume, instead of on a bronze patch table update. This is because the bronze schema is being created by the DAB, and thus the bronze patch table can’t exist before that point, which makes the table update trigger invalid at the point of initial job creation. The post patch ingestion task has also been modified accordingly.
- The gold materialized views have been moved into a Spark Declarative Pipeline; more on that later.
- Task parameters have been defined for each notebook, bringing our defined constants into the notebooks.
We also need to edit our notebooks to take in the task parameters. Here’s an example of what needs to be done for Python substitutions:
import re
def get_valid_parameter(parameter_key: str):
"""
Get a valid parameter value from the Databricks widgets.
Hardened to thwart SQL injection attacks.
"""
parameter_value = dbutils.widgets.get(parameter_key)
# Parameter_value must be a string with only
# alphanumeric characters and underscores
if not re.fullmatch(r'[a-zA-Z0-9_]+', parameter_value):
raise ValueError(
f"Invalid parameter value for {parameter_key}: "
f"{parameter_value}"
)
# Disallow dangerous SQL keywords and patterns
forbidden_patterns = [
r'--', r';', r"'", r'"', r'/\*', r'\*/', r'xp_', r'char\(',
r'nchar\(', r'varchar\(', r'\balter\b', r'\bdrop\b',
r'\binsert\b', r'\bdelete\b', r'\bupdate\b',
r'\bselect\b', r'\bcreate\b', r'\bexec\b', r'\bunion\b',
r'\bor\b', r'\band\b'
]
for pattern in forbidden_patterns:
if re.search(pattern, parameter_value, re.IGNORECASE):
raise ValueError(
f"Potentially dangerous value for {parameter_key}: "
f"{parameter_value} (pattern matched: {pattern})"
)
return parameter_value
# 1. define our constants
try:
BRONZE_CATALOG = get_valid_parameter("BRONZE_CATALOG")
BRONZE_SCHEMA = get_valid_parameter("BRONZE_SCHEMA")
BRONZE_IMPRESSIONS_TABLE_PREFIX = \
get_valid_parameter("BRONZE_IMPRESSIONS_TABLE_PREFIX")
BRONZE_TOTALS_TABLE = get_valid_parameter("BRONZE_TOTALS_TABLE")
SILVER_CATALOG = get_valid_parameter("SILVER_CATALOG")
SILVER_SCHEMA = get_valid_parameter("SILVER_SCHEMA")
SILVER_IMPRESSIONS_TABLE = \
get_valid_parameter("SILVER_IMPRESSIONS_TABLE")
print("Loaded widget values")
except:
BRONZE_CATALOG = "bronze"
BRONZE_SCHEMA = "linkedin"
BRONZE_IMPRESSIONS_TABLE_PREFIX = "impressions"
BRONZE_TOTALS_TABLE = "totals"
SILVER_CATALOG = "silver"
SILVER_SCHEMA = "linkedin"
SILVER_IMPRESSIONS_TABLE = "impressions"
print("Failed to load widget values, using default values")We could have retrieved the parameters directly using Databricks’ associated utility function, but instead we’re wrapping it in additional Python code to prevent SQL injection attacks (at least on a basic level).
SQL code is a tougher ask. In theory, Databricks recommends the use of the IDENTIFIER clause for bringing parameters into SQL code, as per below:
CREATE VIEW IF NOT EXISTS IDENTIFIER
(:SILVER_CATALOG || "." || :SILVER_SCHEMA || "." ||
:SILVER_TOTALS_TABLE)
AS
SELECT date, impressions, engagements
FROM IDENTIFIER
(:BRONZE_CATALOG || "." || :BRONZE_SCHEMA || "." ||
:BRONZE_TOTALS_TABLE);
CREATE VIEW IF NOT EXISTS IDENTIFIER
(:SILVER_CATALOG || "." || :SILVER_SCHEMA || "." ||
:SILVER_FOLLOWERS_TABLE)
AS
SELECT date, new_followers
FROM IDENTIFIER
(:BRONZE_CATALOG || "." || :BRONZE_SCHEMA || "." ||
:BRONZE_FOLLOWERS_TABLE);But the IDENTIFIER clause, as of this writing, doesn’t support DDL statements like the above, which unfortunately covers all instances of our parameter usage. According to a Databricks forum thread, this issue should go away once the Databricks runtime on serverless computes are upgraded to the latest version, so this is hopefully a temporary state of affairs.
We can use the Python approach to validate our parameters before executing the SQL code using the spark.sql function. This is what we end up doing for our logical view creation in the silver layer.
But there is an additional wrinkle at the gold layer: SQL statements for materialized views can only run on an SQL warehouse cluster. This precludes using Python to retrieve the parameters properly.
This is why the code for gold materialized views have been moved into a pipeline that replaces the old notebook: that is what we’ll look at next.
e. pipelines.yml
First, here’s the pipeline definition for reference:
resources:
pipelines:
gold_create:
name: gold_create
configuration:
SILVER_CATALOG: >
${var.silver_catalog}
SILVER_SCHEMA: >
${resources.schemas.silver.name}
SILVER_IMPRESSIONS_TABLE: >
${var.silver_post_impressions}
SILVER_ENGAGEMENTS_TABLE: >
${var.silver_post_engagements}
SILVER_POSTS_TABLE: >
${var.silver_post_metadata}
SILVER_TOTALS_TABLE: >
${var.silver_profile_metrics}
SILVER_FOLLOWERS_TABLE: >
${var.silver_profile_followers}
GOLD_CATALOG: >
${var.gold_catalog}
GOLD_SCHEMA: >
${resources.schemas.gold.name}
FCT_DAILY_PROFILE_STATS_TABLE: >
${var.gold_fact_daily_profile_statistics_table}
FCT_DAILY_POST_STATS_TABLE: >
${var.gold_fact_daily_post_statistics_table}
DIM_DATE_TABLE: >
${var.gold_dimension_date_table}
DIM_TIMESTAMP_TABLE: >
${var.gold_dimension_timestamp_table}
libraries:
- glob:
include: >
../src/linkedin_analytics_jobs/
3. gold modelling/pipeline_gold_create/**
schema: ${resources.schemas.gold.name}
continuous: false
development: false
photon: true
channel: CURRENT
catalog: ${var.gold_catalog}
serverless: true
root_path: >
../src/linkedin_analytics_jobs/
3. gold modelling/Just like with the Databricks Jobs, I created a POC manually and configured it to work, then copied the YAML over to the DAB and modified it accordingly. I’ve also copied the associated pipeline files to the DAB, resulting in the following file structure:

Here are the highlights:
- Each gold materialized view definition has been separated into its own SQL file in the same folder (3. gold modelling/pipeline_gold_create).
- For now, each gold materialized view uses the older ${PARAM} parameter syntax because IDENTIFIER does not yet support our DDL materialized view statements on serverless SQL warehouses. Once the runtime is upgraded, these scripts should be migrated to parameter markers plus IDENTIFIER.
See the documentation for IDENTIFIER for more details.
Deprecation of language features, syntax and other dependencies can be a real concern in a production environment. The general guideline is to avoid using deprecated features and to migrate pipelines that depend on such features in a timely fashion and wherever feasible.
One last note before we move on: the risk arising from an SQL injection attack in this instance (a personal LinkedIn project) is relatively low. In addition, our scripts are internal and parameters are controlled values coming from the DAB, which greatly reduces the likelihood of an SQL injection attack.
f. databricks.yml
Now for the main YAML file.
# This is a Databricks asset bundle definition for <DAB name>.
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
bundle:
name: <DAB name>
uuid: <DAB UUID>
include:
- resources/*.yml
- resources/*/*.yml
targets:
dev:
# The default target uses 'mode: development' to create a development copy.
# - Deployed resources get prefixed with '[dev my_user_name]'
# - Any job schedules and triggers are paused by default.
# See also https://docs.databricks.com/dev-tools/bundles/deployment-modes.html.
mode: development
default: true
workspace:
host: https://<redacted>.cloud.databricks.com
prod:
mode: production
workspace:
host: https://<redacted>.cloud.databricks.com
# We explicitly deploy to /Workspace/Users/<current user>
# to make sure we only have a single copy in the workspace.
root_path: >
/Workspace/Users/
${workspace.current_user.userName}/
.bundle/${bundle.name}/${bundle.target}
run_as:
service_principal_name: >
${var.pipeline_runner}
variables:
dashboard_subscriber: >
${workspace.current_user.userName}
# permissions:
# - user_name: <replace with your own user name>
# level: CAN_MANAGEHere we see the service principal being used to run the production pipelines, and the other YAML files being included via the defined paths.
We are working on the assumption that we have only one workspace in one Databricks instance to work with, and indeed Databricks Free Edition only allows a single workspace per instance.
In an enterprise deployment, development and production environments would typically be deployed in different workspaces or different instances; the latter applies where the Databricks instances are associated with a development and production cloud project (e.g. a Google Cloud project or an Azure resource group).
V. Deploying the DAB
Clicking the rocket icon below the folders icon in the DAB interface brings up the Deployments navigation panel. Click ‘Deploy’.

You will be presented with a deployment screen with a progress wheel; this is where Databricks is running in the background to check what can be deployed (equivalent to the ‘terraform plan’ step). If this step fails, there is most likely something wrong with your DAB configuration, and you will need to troubleshoot accordingly.

If the check is successful, you will be shown a list of all resources that are to be created, deleted or updated.

Note the dev prefixes for the dashboards, jobs, pipelines and schemas; this is due to the ‘development: true’ setting in databricks.yml, and is a way to separate the development deployments being created by multiple contributors in the same workspace.
Clicking the ‘Deploy’ button will bring up the deployment output, where you can track the progress of the deployment (this is the equivalent of ‘terraform apply’).

If the deployment is successful, you will see the list of resources created as below. Now you can go to the respective pipelines and test them accordingly.



Once we’re happy with the results, we can deploy the DAB to production, and test there again.
Below is a list of additional fixes I did to the pipeline that were not immediately apparent in the POC:
- I added statements where relevant to create empty source tables, particularly for bronze.linkedin.post_details (which only populates when post analytics Excel files are ingested)
V. Areas of Improvement
We now have a maintainable pipeline where we can experiment in a development environment, with the confidence that we can implement or roll back any changes rapidly and accurately.
There are a few areas that we have not covered that are beyond the scope of the above deployment.
- The ‘pending’ folders in each volume still need to be created manually; Databricks cannot define volume folders directly in asset bundles as of this writing. This can be mitigated by including a small Databricks Job that runs a script or notebook to create said folders.
- We have been working exclusively in the Databricks UI, which is fine for our demonstration purposes. In an enterprise deployment we should be working off of the associated Git repository and following the relevant best practices, including raising pull requests for feature changes and committing them to separate development and production (main) branches. For this to work, we will need to configure the relevant CI/CD actions (e.g. GitHub Actions on GitHub, or Jenkins for an agnostic orchestrator) to make use of the Databricks CLI to manage the DAB. This is also where the ‘pending’ folders can be created, and the queries in our Databricks dashboard JSON updated, as part of the follow-up CI/CD actions. The end result is to make deployments reproducible and reviewable, rather than dependent on manual clicks in the UI.
VI. What’s Next?
Once the above deployment gaps are closed, the final step is to know when the pipeline breaks, and to see at a glance where the pipeline may be breaking. That’s the subject of our next post. See you there!
Leave a Reply