Build Your Own LinkedIn Analytics Part 9: Making a Maintainable Pipeline

Orginally published on Medium on 22 December 2025

By the end of the previous post, we had built our data pipeline for LinkedIn analytics from ingestion all the way to the final dashboard product, as well as put in place proper orchestration and automation of our data product.

The next step then is making sure that all of what we’ve just built is easy to maintain.

This is part of the following blog series: Build Your Own LinkedIn Analytics

Table Of Contents

TL;DR
I. Options for Maintainability
- a. Git repositories
- b. Databricks Asset Bundles
II. Setting up Git in Databricks
III. Setting up a service principal
IV. Creating our Databricks Asset Bundle
V. Deploying the DAB
V. Areas of Improvement
VI. What’s Next?

TL;DR

Use Git and Databricks Asset Bundles to turn your LinkedIn analytics pipeline from a Databricks POC into a versioned, reproducible project with dev/prod separation, safe rollbacks, and one‑click deployments.

I. Options for Maintainability

A maintainable pipeline is one that has the following qualities:

Easy to change;
Easy to roll back; and
Safe to run autonomously without manual babysitting.

You can thus safely introduce changes such as new engagement metrics, knowing you can roll back if something breaks. For a data‑driven LinkedIn strategy, that reliability is what lets you experiment aggressively without losing trust in your numbers.

A lot of what makes a pipeline maintainable is already part of our design for our LinkedIn analytics data product.

Well-documented assets: This includes readable and well-structured code, clear comments, and relevant descriptions of data assets.
Modular notebooks as tasks: The notebooks that make up our task nodes can be swapped out for code that does the equivalent; there is no dependency on any hidden state or special flag.
Idempotent tasks and runs: We can re-run a failed task or pipeline without worrying about data consistency.
Data versioning via Delta Lake tables: Databricks managed tables have versioning enabled by default, so any errors committed during an update can be rolled back with little issue.

But we need some additional qualities:

Version control for notebooks, jobs, dashboards and schema definitions: If a pipeline were to be tampered with or erroneous code introduced, we have no easy way of restoring the previous state.
Separation of development from production: We want to have a version of the pipeline and assets that can be used for testing new features or fixes (development), before the changes are pushed to the main pipeline (production).
Continuous integration and deployment: With this distinction between development and production, we also want to have a way to manage all these differences in a controlled and automated fashion.

We have two main ways of achieving these qualities, which we will be using in combination.

a. Git repositories

Git repositories are the modern way of managing code versions, and Databricks does support them out of the box. The problem is that we don’t yet have a code version of our Databricks infrastructure such as our Databricks Jobs or our dashboard. How do we version those?

b. Databricks Asset Bundles

Databricks introduced this feature in 2024 and has been iterating on it ever since. As the name suggests, this bundles Databricks assets together in code form, including Jobs, Dashboards and even schemas. In other words, this is Databricks’ version of Infrastructure as Code (IaC). This concept was popularized by HashiCorp Terraform, and indeed Databricks Asset Bundles (or DABs) were initially built on top of Terraform, though there is now a push to move away from that dependency.

Note that DABs are meant to be used together with Git repositories; in fact, in the Databricks UI, you can only create a DAB in a Git repository folder.

II. Setting up Git in Databricks

This step assumes that you already have a Git repository at hand. Databricks supports a comprehensive array of Git providers, including the various flavours of GitHub, GitLab, BitBucket, Azure DevOps and AWS CodeCommit. For our purposes, GitHub works just fine.

In the Databricks UI, click on your profile icon and choose the ‘Settings’ option. You should see the following menu on the left:

Click on ‘Linked accounts’ under the ‘User’ section. As of this writing the only service available for connection is Git integration, which is exactly what we want. Click on ‘Add Git credentials’

Here you choose your Git provider. If you choose GitHub, you can then link your Git account in a few clicks. You can also opt to use a personal access token if you don’t wish to publish your associated email to the repository (which may be a concern if you intend to connect to or create a public repository). Other Git providers require you to provide similar access tokens.

The ‘Add Git credential’ dialog box, with GitHub selected as the Git provider.

The next step is to create a Git folder in our workspace. Click ‘+ New’ at the top left corner, navigate to ‘More’, and then choose ‘Git folder’.

You can now create your Git folder. Obtain your Git repository URL (the steps to obtain it for GitHub are here), name the folder, then proceed to create it.

Below is the end result of this process: a Git folder on Databricks.

Example of a Git folder in Databricks, which also indicates the Git branch that the folder is currently pointing to.

Git is a fundamental tool for code versioning, and is a required competency for any aspiring data engineer (and developer in general). If you want to dive deeper, there is a comprehensive guide here.

III. Setting up a service principal

If you look at the ‘Run as’ column in Jobs and Pipelines, you can see that all our runs so far have been using our personal credentials. This is fine for development work, but for a production run, that is misleading at best (we aren’t doing a manual run) and potentially catastrophic at worst (what happens when the pipeline developer leaves the organization?).

In general, we should always use synthetic credentials for production runs and deployments, whether they come in the form of service principals or service accounts.

With that in mind, let’s create a Databricks service principal. Back in the ‘Settings’ menu, click on the ‘Identity and access’ option under ‘Workspace admin’. You will see the following section:

Options for identity and access in a Databricks workspace.

Administering users and groups is beyond the scope of our exploration, especially in our Databricks Free instance, but it is important to set these up for any enterprise deployment. Databricks has some recommended best practices in their documentation.

Click on ‘Manage’ for Service principals, then ‘Add service principal’ on the next page, and finally ‘Add new’ in the subsequent page.

Type a descriptive name for the service principal, then click ‘Add’.

The ‘Add service principal’ dialog box with a service principal about to be created.

Note the Application ID of our newly created service principal; we will be using it later in our DAB deployment.

IV. Creating our Databricks Asset Bundle

Navigate to our Git folder, and click ‘Create’ at the top right corner. You will see ‘Asset bundle’ as one of the options; click on it.

‘Create’ options available in a Git folder. Note that ‘Asset bundle’ only appears as an option in a Git folder.

This brings you to the list of options for creating a Databricks Asset Bundle as below:

Options for creating an asset bundle in Databricks.

For the purposes of this project, choose ‘Empty project’; you can explore the more guided options separately.

This is a good time to outline what our structure for our DAB is going to look like.

Outline of structure for the DAB for our LinkedIn analytics data product.

There are a bunch of additional YAML files in the ‘resources’ folder. This is best practice for separation of different types of resources; we don’t want to dump everything into a single databricks.yml file, for the same reason we don’t want to have a monolithic notebook as our pipeline.
Note the ‘variable-override.json’ file in the ‘.databricks’ folder. Because this folder is listed in the ‘.gitignore’ file, it and its contents will not be written to the Git repository. This is a good way to store any local configurations that you don’t want to expose in the repository, such as the Databricks warehouse ID and the service principal ID.
The notebooks from our POC pipeline have been copied into this DAB. I have arranged them in numbered folders under ‘src’ (aka the source folder). I have also included the POC notebook that was the source for the individual notebook tasks; this can prove useful in trying out end-to-end ideas in the development stage.

Let’s go through the different YAML configuration files in turn.

a. variables.yml

This file serves as a single place for defining our pipeline-wide constants. With this, we no longer have to repeat them through multiple notebooks, or manually keep the naming of data assets such as schemas and tables consistent.

The contents of the file are as below:

variables:

  # 0. Sensitive variables
  linkedin_profile_name:
    description: >
      The name of the LinkedIn profile
  warehouse_id:
    description: >
      The ID of the Databricks SQL warehouse cluster
  pipeline_runner:
    description: >
      The name of the service principal used to deploy the DAB and run the pipelines

  # 1. Settings for the medallion layer catalogs and schemas
  landing_catalog: 
    description: >
      The ID of the landing catalog
    default: landing
  bronze_catalog: 
    description: >
      The ID of the bronze catalog
    default: bronze
  silver_catalog: 
    description: >
      The ID of the silver catalog
    default: silver
  gold_catalog: 
    description: >
      The ID of the gold catalog
    default: gold
  common_schema:
    description: >
      The ID of the schema used in all medallion catalogs
    default: linkedin

  # 2. Settings for the landing volumes
  landing_content_daily_volume:
    description: >
      The ID of the landing volume for daily content metrics Excel files
    default: content_daily
  landing_content_historical_volume:
    description: >
      The ID of the landing volume for historical content metrics Excel files
    default: historical_daily
  landing_posts_volume:
    description: >
      The ID of the landing volume for post metrics Excel files
    default: posts
  landing_patch_volume:
    description: >
      The ID of the landing volume for patch CSV files
    default: patch
  ingestion_folder:
    description: >
      The folder in each volume for uploading files to be ingested
    default: pending
  processed_folder:
    description: >
      The folder in each volume for storing successfully ingested files
    default: processed
  errors_folder:
    description: >
      The folder in each volume for storing files with errors during ingestion
    default: errors
  patch_post_subfolder:
    description: >
      The subfolder in the landing patch volume for patching posts data
    default: posts

  # 3. Settings for the bronze tables
  bronze_profile_discovery_metrics_table:
    description: >
      The ID of the bronze table for profile daily discovery metrics
    default: discovery
  bronze_profile_followers_table:
    description: >
      The ID of the bronze table for profile daily follower metrics
    default: followers
  bronze_post_patch_table:
    description: >
      The ID of the bronze table for post patching
    default: linkedin_patch
  bronze_post_metrics_table:
    description: >
      The ID of the bronze table for post metrics (taken from post metrics Excel files)
    default: post_details
  bronze_post_metadata_table:
    description: >
      The ID of the bronze table for post metadata
    default: posts
  bronze_profile_metrics_table:
    description: >
      The ID of the bronze table for profile daily engagement and impressions metrics
    default: totals

  # 3.1 Settings for the bronze staging tables
  bronze_staging_impressions_prefix:
    description: >
      The prefix for bronze staging tables for daily post impressions
    default: impressions
  bronze_staging_engagements_prefix:
    description: >
      The prefix for bronze staging tables for daily post engagements
    default: engagements

  # 4. Settings for the silver tables
  silver_post_impressions:
    description: >
      The ID of the silver table for post impressions
    default: impressions
  silver_post_engagements:
    description: >
      The ID of the silver table for post engagements
    default: engagements
  silver_profile_followers:
    description: >
      The ID of the silver table for profile follower metrics
    default: followers
  silver_post_metadata:
    description: >
      The ID of the silver table for consolidated post metadata
    default: posts
  silver_profile_metrics:
    description: >
      The ID of the silver table for profile engagement and impressions metrics
    default: totals

  # 4. Settings for the gold tables
  gold_fact_daily_post_statistics_table:
    description: >
      The ID of the gold fact table/view for daily posts statistics
    default: fct_daily_post_statistics
  gold_fact_daily_profile_statistics_table:
    description: >
      The ID of the gold fact table/view for daily profile statistics
    default: fct_daily_profile_statistics
  gold_dimension_date_table:
    description: >
      The ID of the gold dimension table/view for dates
    default: dim_date
  gold_dimension_timestamp_table:
    description: >
      The ID of the gold dimension table/view for timestamps
    default: dim_timestamp

  # 5. Settings for the dashboard
  dashboard_subscriber:
    description: >
      The email of the subscriber to the dashboard
    default: ${workspace.current_user.userName}

variables:

  # 0. Sensitive variables
  linkedin_profile_name:
    description: >
      The name of the LinkedIn profile
  warehouse_id:
    description: >
      The ID of the Databricks SQL warehouse cluster
  pipeline_runner:
    description: >
      The name of the service principal used to deploy the DAB and run the pipelines

  # 1. Settings for the medallion layer catalogs and schemas
  landing_catalog: 
    description: >
      The ID of the landing catalog
    default: landing
  bronze_catalog: 
    description: >
      The ID of the bronze catalog
    default: bronze
  silver_catalog: 
    description: >
      The ID of the silver catalog
    default: silver
  gold_catalog: 
    description: >
      The ID of the gold catalog
    default: gold
  common_schema:
    description: >
      The ID of the schema used in all medallion catalogs
    default: linkedin

  # 2. Settings for the landing volumes
  landing_content_daily_volume:
    description: >
      The ID of the landing volume for daily content metrics Excel files
    default: content_daily
  landing_content_historical_volume:
    description: >
      The ID of the landing volume for historical content metrics Excel files
    default: historical_daily
  landing_posts_volume:
    description: >
      The ID of the landing volume for post metrics Excel files
    default: posts
  landing_patch_volume:
    description: >
      The ID of the landing volume for patch CSV files
    default: patch
  ingestion_folder:
    description: >
      The folder in each volume for uploading files to be ingested
    default: pending
  processed_folder:
    description: >
      The folder in each volume for storing successfully ingested files
    default: processed
  errors_folder:
    description: >
      The folder in each volume for storing files with errors during ingestion
    default: errors
  patch_post_subfolder:
    description: >
      The subfolder in the landing patch volume for patching posts data
    default: posts

  # 3. Settings for the bronze tables
  bronze_profile_discovery_metrics_table:
    description: >
      The ID of the bronze table for profile daily discovery metrics
    default: discovery
  bronze_profile_followers_table:
    description: >
      The ID of the bronze table for profile daily follower metrics
    default: followers
  bronze_post_patch_table:
    description: >
      The ID of the bronze table for post patching
    default: linkedin_patch
  bronze_post_metrics_table:
    description: >
      The ID of the bronze table for post metrics (taken from post metrics Excel files)
    default: post_details
  bronze_post_metadata_table:
    description: >
      The ID of the bronze table for post metadata
    default: posts
  bronze_profile_metrics_table:
    description: >
      The ID of the bronze table for profile daily engagement and impressions metrics
    default: totals

  # 3.1 Settings for the bronze staging tables
  bronze_staging_impressions_prefix:
    description: >
      The prefix for bronze staging tables for daily post impressions
    default: impressions
  bronze_staging_engagements_prefix:
    description: >
      The prefix for bronze staging tables for daily post engagements
    default: engagements

  # 4. Settings for the silver tables
  silver_post_impressions:
    description: >
      The ID of the silver table for post impressions
    default: impressions
  silver_post_engagements:
    description: >
      The ID of the silver table for post engagements
    default: engagements
  silver_profile_followers:
    description: >
      The ID of the silver table for profile follower metrics
    default: followers
  silver_post_metadata:
    description: >
      The ID of the silver table for consolidated post metadata
    default: posts
  silver_profile_metrics:
    description: >
      The ID of the silver table for profile engagement and impressions metrics
    default: totals

  # 4. Settings for the gold tables
  gold_fact_daily_post_statistics_table:
    description: >
      The ID of the gold fact table/view for daily posts statistics
    default: fct_daily_post_statistics
  gold_fact_daily_profile_statistics_table:
    description: >
      The ID of the gold fact table/view for daily profile statistics
    default: fct_daily_profile_statistics
  gold_dimension_date_table:
    description: >
      The ID of the gold dimension table/view for dates
    default: dim_date
  gold_dimension_timestamp_table:
    description: >
      The ID of the gold dimension table/view for timestamps
    default: dim_timestamp

  # 5. Settings for the dashboard
  dashboard_subscriber:
    description: >
      The email of the subscriber to the dashboard
    default: ${workspace.current_user.userName}

Now changing a table name or adding a new volume is a single variable change, not a hunt through multiple notebooks and jobs.

For the variables with no defaults defined due to their sensitivity, they are defined in the override JSON located in the .databricks folder.

Location of the variable overrides JSON file.

The structure of the file is as follows.

{
  "linkedin_profile_name": "placeholder",
  "warehouse_id": "placeholder",
  "pipeline_runner": "placeholder"
}

{
  "linkedin_profile_name": "placeholder",
  "warehouse_id": "placeholder",
  "pipeline_runner": "placeholder"
}

b. schemas.yml

We define our schemas as well as our volumes here:

resources:
  schemas:
    landing:
      catalog_name: ${var.landing_catalog}
      comment: LinkedIn landing schema created by DAB
      name: ${var.common_schema}
    bronze:
      catalog_name: ${var.bronze_catalog}
      comment: LinkedIn bronze schema created by DAB
      name: ${var.common_schema}
    silver:
      catalog_name: ${var.silver_catalog}
      comment: LinkedIn silver schema created by DAB
      name: ${var.common_schema}
    gold:
      catalog_name: ${var.gold_catalog}
      comment: LinkedIn gold schema created by DAB
      name: ${var.common_schema}
  volumes:
    content_daily:
      catalog_name: ${var.landing_catalog}
      schema_name: ${resources.schemas.landing.name}
      name: ${var.landing_content_daily_volume}
    content_historical:
      catalog_name: ${var.landing_catalog}
      schema_name: ${resources.schemas.landing.name}
      name: ${var.landing_content_historical_volume}
    posts:
      catalog_name: ${var.landing_catalog}
      schema_name: ${resources.schemas.landing.name}
      name: ${var.landing_posts_volume}
    patch:
      catalog_name: ${var.landing_catalog}
      schema_name: ${resources.schemas.landing.name}
      name: ${var.landing_patch_volume}

resources:
  schemas:
    landing:
      catalog_name: ${var.landing_catalog}
      comment: LinkedIn landing schema created by DAB
      name: ${var.common_schema}
    bronze:
      catalog_name: ${var.bronze_catalog}
      comment: LinkedIn bronze schema created by DAB
      name: ${var.common_schema}
    silver:
      catalog_name: ${var.silver_catalog}
      comment: LinkedIn silver schema created by DAB
      name: ${var.common_schema}
    gold:
      catalog_name: ${var.gold_catalog}
      comment: LinkedIn gold schema created by DAB
      name: ${var.common_schema}
  volumes:
    content_daily:
      catalog_name: ${var.landing_catalog}
      schema_name: ${resources.schemas.landing.name}
      name: ${var.landing_content_daily_volume}
    content_historical:
      catalog_name: ${var.landing_catalog}
      schema_name: ${resources.schemas.landing.name}
      name: ${var.landing_content_historical_volume}
    posts:
      catalog_name: ${var.landing_catalog}
      schema_name: ${resources.schemas.landing.name}
      name: ${var.landing_posts_volume}
    patch:
      catalog_name: ${var.landing_catalog}
      schema_name: ${resources.schemas.landing.name}
      name: ${var.landing_patch_volume}

Here we see our variables in action in deploying consistent namings for schemas and volumes that are also fed into our notebooks.

c. dashboard.yml

We define our dashboard as below:

sync:
  exclude:
    - ../src/*.lvdash.json
    - ../src/*/*.lvdash.json
resources:
  dashboards:
    linkedin_statistics:
      display_name: LinkedIn Statistics
      file_path: >
        ../src/linkedin_analytics_jobs/
        4. data product/LinkedIn Statistics.lvdash.json
      warehouse_id: ${var.warehouse_id}

sync:
  exclude:
    - ../src/*.lvdash.json
    - ../src/*/*.lvdash.json
resources:
  dashboards:
    linkedin_statistics:
      display_name: LinkedIn Statistics
      file_path: >
        ../src/linkedin_analytics_jobs/
        4. data product/LinkedIn Statistics.lvdash.json
      warehouse_id: ${var.warehouse_id}

The exclusions are to avoid duplication of dashboards, per Databricks’ advice.

As of this writing, there is no straightforward, documented way to pull DAB variables directly into the dataset queries inside a Databricks dashboard, as evidenced by an open Databricks CLI GitHub issue. Dashboard parameters are instead designed for end-user interactivity via widgets. A workaround using third-party CI/CD tools such as GitHub Actions and Jenkins to template the dashboard JSON before deployment is possible; we touch on such tools towards the end of this article.

d. jobs.yml

Recall the pipeline structure that we finalized in the previous post:

Recap: pipeline structure for our LinkedIn analytics data product

We want to embed this all in code. But how?

Navigate to any of the Databricks Jobs. Click the three dots next to the ‘Run now’ button, and choose ‘View as code’

Some of the options available for Databricks Jobs.

Now we can see the code version of the Databricks Job, which has defaulted to YAML format. We can copy this and paste it into ‘jobs.yml’, then edit accordingly (e.g. replacing hard-coded values with our constants). We need to repeat this for all of our pipelines.

Example of a Databricks Job in YAML form.

One of the items that needs changing for this Databricks Job is the job_id that is triggered at the end of the pipeline. We can substitute an inferred variable in place of the hard-coded number, as below:

Example of a variable inferred from a different Databricks resource defined in the DAB.

For more details on the format for the types of tasks that can be added to Databricks Jobs in a DAB, refer to this link.

At the end of this import and refactoring process, we end up with a file that consolidates all our job definitions and uses our defined constants and inferred variables, as per below:

resources:
  jobs:
    linkedin_transformation:
      name: linkedin transformation
      tasks:
        - task_key: silver_engagements_transformation
          notebook_task:
            base_parameters:
              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_ENGAGEMENTS_TABLE_PREFIX: >
                ${var.bronze_staging_engagements_prefix}
              BRONZE_TOTALS_TABLE: >
                ${var.bronze_profile_metrics_table}

              SILVER_CATALOG: >
                ${var.silver_catalog}
              SILVER_SCHEMA: >
                ${resources.schemas.silver.name}
              SILVER_ENGAGEMENTS_TABLE: >
                ${var.silver_post_engagements}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              2. silver transformation/
              silver engagements consolidation.ipynb
            source: WORKSPACE
        - task_key: silver_impressions_transformation
          notebook_task:
            base_parameters:
              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_IMPRESSIONS_TABLE_PREFIX: >
                ${var.bronze_staging_impressions_prefix}
              BRONZE_TOTALS_TABLE: >
                ${var.bronze_profile_metrics_table}

              SILVER_CATALOG: >
                ${var.silver_catalog}
              SILVER_SCHEMA: >
                ${resources.schemas.silver.name}
              SILVER_IMPRESSIONS_TABLE: >
                ${var.silver_post_impressions}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              2. silver transformation/
              silver impressions consolidation.ipynb
            source: WORKSPACE
        - task_key: silver_views_update
          notebook_task:
            base_parameters:
              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_FOLLOWERS_TABLE: >
                ${var.bronze_profile_followers_table}
              BRONZE_TOTALS_TABLE: >
                ${var.bronze_profile_metrics_table}

              SILVER_CATALOG: >
                ${var.silver_catalog}
              SILVER_SCHEMA: >
                ${resources.schemas.silver.name}
              SILVER_FOLLOWERS_TABLE: >
                ${var.silver_profile_followers}
              SILVER_TOTALS_TABLE: >
                ${var.silver_profile_metrics}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              2. silver transformation/
              silver views update.ipynb
            source: WORKSPACE
        - task_key: silver_posts_enrichment
          notebook_task:
            base_parameters:
              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_POSTS_TABLE: >
                ${var.bronze_post_metadata_table}
              BRONZE_POST_DETAILS_TABLE: >
                ${var.bronze_post_metrics_table}
              BRONZE_POST_PATCH_TABLE: >
                ${var.bronze_post_patch_table}

              SILVER_CATALOG: >
                ${var.silver_catalog}
              SILVER_SCHEMA: >
                ${resources.schemas.silver.name}
              SILVER_POSTS_TABLE: >
                ${var.silver_post_metadata}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              2. silver transformation/
              silver post enrichment.ipynb
            source: WORKSPACE
        - task_key: gold_create_or_refresh
          depends_on:
            - task_key: silver_engagements_transformation
            - task_key: silver_impressions_transformation
            - task_key: silver_posts_enrichment
            - task_key: silver_views_update
          pipeline_task:
            pipeline_id: >
              ${resources.pipelines.gold_create.id}
          max_retries: 3
          min_retry_interval_millis: 60000
          retry_on_timeout: true
        - task_key: dashboard_refresh
          depends_on:
            - task_key: gold_create_or_refresh
          dashboard_task:
            subscription:
              subscribers:
                - user_name: >
                  ${var.dashboard_subscriber}
            warehouse_id: >
              ${var.warehouse_id}
            dashboard_id: >
              ${resources.dashboards.linkedin_statistics.id}
      queue:
        enabled: true
      performance_target: PERFORMANCE_OPTIMIZED
    linkedin_daily_ingest:
      name: linkedin daily ingest
      trigger:
        pause_status: UNPAUSED
        file_arrival:
          url: >
            /Volumes/${var.landing_catalog}/
            ${resources.schemas.landing.name}/
            ${resources.volumes.content_daily.name}/
            ${var.ingestion_folder}/
      tasks:
        - task_key: bronze_daily_ingest
          notebook_task:
            base_parameters:
              LINKEDIN_PROFILE_NAME: >
                ${var.linkedin_profile_name}

              LANDING_CATALOG: >
                ${var.landing_catalog}
              LANDING_SCHEMA: >
                ${resources.schemas.landing.name}
              LANDING_DAILY_VOLUME: >
                ${resources.volumes.content_daily.name}

              PENDING_FOLDER: >
                ${var.ingestion_folder}
              PROCESSED_FOLDER: >
                ${var.processed_folder}
              ERRORS_FOLDER: >
                ${var.errors_folder} 

              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_DISCOVERY_TABLE: >
                ${var.bronze_profile_discovery_metrics_table}
              BRONZE_TOTALS_TABLE: >
                ${var.bronze_profile_metrics_table}
              BRONZE_FOLLOWERS_TABLE: >
                ${var.bronze_profile_followers_table}

              BRONZE_IMPRESSIONS_TABLE: >
                ${var.bronze_staging_impressions_prefix}
              BRONZE_ENGAGEMENTS_TABLE: >
                ${var.bronze_staging_engagements_prefix}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              1. bronze ingestion/
              bronze daily ingest.ipynb
            source: WORKSPACE
        - task_key: bronze_post_ingest
          depends_on:
            - task_key: bronze_daily_ingest
          notebook_task:
            base_parameters:
              LINKEDIN_PROFILE_NAME: >
                ${var.linkedin_profile_name}

              LANDING_CATALOG: >
                ${var.landing_catalog}
              LANDING_SCHEMA: >
                ${resources.schemas.landing.name}
              LANDING_POSTS_VOLUME: >
                ${resources.volumes.posts.name}
              
              PENDING_FOLDER: >
                ${var.ingestion_folder}
              PROCESSED_FOLDER: >
                ${var.processed_folder}
              ERRORS_FOLDER: >
                ${var.errors_folder}

              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_IMPRESSIONS_TABLE: >
                ${var.bronze_staging_impressions_prefix}
              BRONZE_POSTS_TABLE: >
                ${var.bronze_post_metadata_table}
              BRONZE_POST_DETAILS_TABLE: >
                ${var.bronze_post_metrics_table}
              BRONZE_POST_PATCH_TABLE: >
                ${var.bronze_post_patch_table}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              1. bronze ingestion/
              bronze post ingest.ipynb
            source: WORKSPACE
        - task_key: trigger_transformation
          depends_on:
            - task_key: bronze_post_ingest
          run_job_task:
            job_id: >
              ${resources.jobs.linkedin_transformation.id}
      queue:
        enabled: true
      performance_target: PERFORMANCE_OPTIMIZED
    linkedin_historical_ingestion:
      name: linkedin historical ingestion
      trigger:
        pause_status: UNPAUSED
        file_arrival:
          url: >
            /Volumes/${var.landing_catalog}/
            ${resources.schemas.landing.name}/
            ${resources.volumes.content_historical.name}/
            ${var.ingestion_folder}/
      tasks:
        - task_key: bronze_historical_ingest
          notebook_task:
            base_parameters:
              LINKEDIN_PROFILE_NAME: >
                ${var.linkedin_profile_name}

              LANDING_CATALOG: >
                ${var.landing_catalog}
              LANDING_SCHEMA: >
                ${resources.schemas.landing.name}
              LANDING_DAILY_VOLUME: >
                ${resources.volumes.content_historical.name}

              PENDING_FOLDER: >
                ${var.ingestion_folder}
              PROCESSED_FOLDER: >
                ${var.processed_folder}
              ERRORS_FOLDER: >
                ${var.errors_folder} 

              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_TOTALS_TABLE: >
                ${var.bronze_profile_metrics_table}
              BRONZE_FOLLOWERS_TABLE: >
                ${var.bronze_profile_followers_table}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              1. bronze ingestion/
              bronze historical ingest.ipynb
            source: WORKSPACE
        - task_key: trigger_transformation
          depends_on:
            - task_key: bronze_historical_ingest
          run_job_task:
            job_id: >
              ${resources.jobs.linkedin_transformation.id}
      queue:
        enabled: true
      performance_target: PERFORMANCE_OPTIMIZED
    linkedin_post_ingest:
      name: linkedin post ingest
      trigger:
        pause_status: UNPAUSED
        file_arrival:
          url: >
            /Volumes/${var.landing_catalog}/
            ${resources.schemas.landing.name}/
            ${resources.volumes.posts.name}/
            ${var.ingestion_folder}/
      tasks:
        - task_key: bronze_post_ingest
          notebook_task:
            base_parameters:
              LINKEDIN_PROFILE_NAME: >
                ${var.linkedin_profile_name}

              LANDING_CATALOG: >
                ${var.landing_catalog}
              LANDING_SCHEMA: >
                ${resources.schemas.landing.name}
              LANDING_POSTS_VOLUME: >
                ${resources.volumes.posts.name}
              
              PENDING_FOLDER: >
                ${var.ingestion_folder}
              PROCESSED_FOLDER: >
                ${var.processed_folder}
              ERRORS_FOLDER: >
                ${var.errors_folder}

              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_IMPRESSIONS_TABLE: >
                ${var.bronze_staging_impressions_prefix}
              BRONZE_POSTS_TABLE: >
                ${var.bronze_post_metadata_table}
              BRONZE_POST_DETAILS_TABLE: >
                ${var.bronze_post_metrics_table}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              1. bronze ingestion/
              bronze post ingest.ipynb
            source: WORKSPACE
        - task_key: trigger_transformation
          depends_on:
            - task_key: bronze_post_ingest
          run_job_task:
            job_id: >
              ${resources.jobs.linkedin_transformation.id}
      queue:
        enabled: true
      performance_target: PERFORMANCE_OPTIMIZED
    linkedin_post_patch_ingest:
      name: linkedin post patch ingest
      trigger:
        pause_status: UNPAUSED
        file_arrival:
          url: >
            /Volumes/${var.landing_catalog}/
            ${resources.schemas.landing.name}/
            ${resources.volumes.patch.name}/
            ${var.ingestion_folder}/
            ${var.patch_post_subfolder}/
      tasks:
        - task_key: bronze_post_patch_ingest
          notebook_task:
            base_parameters:
              LANDING_CATALOG: >
                ${var.landing_catalog}
              LANDING_SCHEMA: >
                ${resources.schemas.landing.name}
              LANDING_PATCH_VOLUME: >
                ${resources.volumes.patch.name}
              
              PENDING_FOLDER: >
                ${var.ingestion_folder}
              PROCESSED_FOLDER: >
                ${var.processed_folder}
              ERRORS_FOLDER: >
                ${var.errors_folder}
              PATCH_POST_SUBFOLDER: >
                ${var.patch_post_subfolder}

              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_POST_PATCH_TABLE: >
                ${var.bronze_post_patch_table}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              1. bronze ingestion/
              bronze post patch ingest.ipynb
            source: WORKSPACE
        - task_key: trigger_transformation
          depends_on:
            - task_key: bronze_post_patch_ingest
          run_job_task:
            job_id: >
              ${resources.jobs.linkedin_transformation.id}
      queue:
        enabled: true
      performance_target: PERFORMANCE_OPTIMIZED

resources:
  jobs:
    linkedin_transformation:
      name: linkedin transformation
      tasks:
        - task_key: silver_engagements_transformation
          notebook_task:
            base_parameters:
              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_ENGAGEMENTS_TABLE_PREFIX: >
                ${var.bronze_staging_engagements_prefix}
              BRONZE_TOTALS_TABLE: >
                ${var.bronze_profile_metrics_table}

              SILVER_CATALOG: >
                ${var.silver_catalog}
              SILVER_SCHEMA: >
                ${resources.schemas.silver.name}
              SILVER_ENGAGEMENTS_TABLE: >
                ${var.silver_post_engagements}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              2. silver transformation/
              silver engagements consolidation.ipynb
            source: WORKSPACE
        - task_key: silver_impressions_transformation
          notebook_task:
            base_parameters:
              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_IMPRESSIONS_TABLE_PREFIX: >
                ${var.bronze_staging_impressions_prefix}
              BRONZE_TOTALS_TABLE: >
                ${var.bronze_profile_metrics_table}

              SILVER_CATALOG: >
                ${var.silver_catalog}
              SILVER_SCHEMA: >
                ${resources.schemas.silver.name}
              SILVER_IMPRESSIONS_TABLE: >
                ${var.silver_post_impressions}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              2. silver transformation/
              silver impressions consolidation.ipynb
            source: WORKSPACE
        - task_key: silver_views_update
          notebook_task:
            base_parameters:
              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_FOLLOWERS_TABLE: >
                ${var.bronze_profile_followers_table}
              BRONZE_TOTALS_TABLE: >
                ${var.bronze_profile_metrics_table}

              SILVER_CATALOG: >
                ${var.silver_catalog}
              SILVER_SCHEMA: >
                ${resources.schemas.silver.name}
              SILVER_FOLLOWERS_TABLE: >
                ${var.silver_profile_followers}
              SILVER_TOTALS_TABLE: >
                ${var.silver_profile_metrics}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              2. silver transformation/
              silver views update.ipynb
            source: WORKSPACE
        - task_key: silver_posts_enrichment
          notebook_task:
            base_parameters:
              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_POSTS_TABLE: >
                ${var.bronze_post_metadata_table}
              BRONZE_POST_DETAILS_TABLE: >
                ${var.bronze_post_metrics_table}
              BRONZE_POST_PATCH_TABLE: >
                ${var.bronze_post_patch_table}

              SILVER_CATALOG: >
                ${var.silver_catalog}
              SILVER_SCHEMA: >
                ${resources.schemas.silver.name}
              SILVER_POSTS_TABLE: >
                ${var.silver_post_metadata}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              2. silver transformation/
              silver post enrichment.ipynb
            source: WORKSPACE
        - task_key: gold_create_or_refresh
          depends_on:
            - task_key: silver_engagements_transformation
            - task_key: silver_impressions_transformation
            - task_key: silver_posts_enrichment
            - task_key: silver_views_update
          pipeline_task:
            pipeline_id: >
              ${resources.pipelines.gold_create.id}
          max_retries: 3
          min_retry_interval_millis: 60000
          retry_on_timeout: true
        - task_key: dashboard_refresh
          depends_on:
            - task_key: gold_create_or_refresh
          dashboard_task:
            subscription:
              subscribers:
                - user_name: >
                  ${var.dashboard_subscriber}
            warehouse_id: >
              ${var.warehouse_id}
            dashboard_id: >
              ${resources.dashboards.linkedin_statistics.id}
      queue:
        enabled: true
      performance_target: PERFORMANCE_OPTIMIZED
    linkedin_daily_ingest:
      name: linkedin daily ingest
      trigger:
        pause_status: UNPAUSED
        file_arrival:
          url: >
            /Volumes/${var.landing_catalog}/
            ${resources.schemas.landing.name}/
            ${resources.volumes.content_daily.name}/
            ${var.ingestion_folder}/
      tasks:
        - task_key: bronze_daily_ingest
          notebook_task:
            base_parameters:
              LINKEDIN_PROFILE_NAME: >
                ${var.linkedin_profile_name}

              LANDING_CATALOG: >
                ${var.landing_catalog}
              LANDING_SCHEMA: >
                ${resources.schemas.landing.name}
              LANDING_DAILY_VOLUME: >
                ${resources.volumes.content_daily.name}

              PENDING_FOLDER: >
                ${var.ingestion_folder}
              PROCESSED_FOLDER: >
                ${var.processed_folder}
              ERRORS_FOLDER: >
                ${var.errors_folder} 

              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_DISCOVERY_TABLE: >
                ${var.bronze_profile_discovery_metrics_table}
              BRONZE_TOTALS_TABLE: >
                ${var.bronze_profile_metrics_table}
              BRONZE_FOLLOWERS_TABLE: >
                ${var.bronze_profile_followers_table}

              BRONZE_IMPRESSIONS_TABLE: >
                ${var.bronze_staging_impressions_prefix}
              BRONZE_ENGAGEMENTS_TABLE: >
                ${var.bronze_staging_engagements_prefix}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              1. bronze ingestion/
              bronze daily ingest.ipynb
            source: WORKSPACE
        - task_key: bronze_post_ingest
          depends_on:
            - task_key: bronze_daily_ingest
          notebook_task:
            base_parameters:
              LINKEDIN_PROFILE_NAME: >
                ${var.linkedin_profile_name}

              LANDING_CATALOG: >
                ${var.landing_catalog}
              LANDING_SCHEMA: >
                ${resources.schemas.landing.name}
              LANDING_POSTS_VOLUME: >
                ${resources.volumes.posts.name}
              
              PENDING_FOLDER: >
                ${var.ingestion_folder}
              PROCESSED_FOLDER: >
                ${var.processed_folder}
              ERRORS_FOLDER: >
                ${var.errors_folder}

              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_IMPRESSIONS_TABLE: >
                ${var.bronze_staging_impressions_prefix}
              BRONZE_POSTS_TABLE: >
                ${var.bronze_post_metadata_table}
              BRONZE_POST_DETAILS_TABLE: >
                ${var.bronze_post_metrics_table}
              BRONZE_POST_PATCH_TABLE: >
                ${var.bronze_post_patch_table}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              1. bronze ingestion/
              bronze post ingest.ipynb
            source: WORKSPACE
        - task_key: trigger_transformation
          depends_on:
            - task_key: bronze_post_ingest
          run_job_task:
            job_id: >
              ${resources.jobs.linkedin_transformation.id}
      queue:
        enabled: true
      performance_target: PERFORMANCE_OPTIMIZED
    linkedin_historical_ingestion:
      name: linkedin historical ingestion
      trigger:
        pause_status: UNPAUSED
        file_arrival:
          url: >
            /Volumes/${var.landing_catalog}/
            ${resources.schemas.landing.name}/
            ${resources.volumes.content_historical.name}/
            ${var.ingestion_folder}/
      tasks:
        - task_key: bronze_historical_ingest
          notebook_task:
            base_parameters:
              LINKEDIN_PROFILE_NAME: >
                ${var.linkedin_profile_name}

              LANDING_CATALOG: >
                ${var.landing_catalog}
              LANDING_SCHEMA: >
                ${resources.schemas.landing.name}
              LANDING_DAILY_VOLUME: >
                ${resources.volumes.content_historical.name}

              PENDING_FOLDER: >
                ${var.ingestion_folder}
              PROCESSED_FOLDER: >
                ${var.processed_folder}
              ERRORS_FOLDER: >
                ${var.errors_folder} 

              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_TOTALS_TABLE: >
                ${var.bronze_profile_metrics_table}
              BRONZE_FOLLOWERS_TABLE: >
                ${var.bronze_profile_followers_table}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              1. bronze ingestion/
              bronze historical ingest.ipynb
            source: WORKSPACE
        - task_key: trigger_transformation
          depends_on:
            - task_key: bronze_historical_ingest
          run_job_task:
            job_id: >
              ${resources.jobs.linkedin_transformation.id}
      queue:
        enabled: true
      performance_target: PERFORMANCE_OPTIMIZED
    linkedin_post_ingest:
      name: linkedin post ingest
      trigger:
        pause_status: UNPAUSED
        file_arrival:
          url: >
            /Volumes/${var.landing_catalog}/
            ${resources.schemas.landing.name}/
            ${resources.volumes.posts.name}/
            ${var.ingestion_folder}/
      tasks:
        - task_key: bronze_post_ingest
          notebook_task:
            base_parameters:
              LINKEDIN_PROFILE_NAME: >
                ${var.linkedin_profile_name}

              LANDING_CATALOG: >
                ${var.landing_catalog}
              LANDING_SCHEMA: >
                ${resources.schemas.landing.name}
              LANDING_POSTS_VOLUME: >
                ${resources.volumes.posts.name}
              
              PENDING_FOLDER: >
                ${var.ingestion_folder}
              PROCESSED_FOLDER: >
                ${var.processed_folder}
              ERRORS_FOLDER: >
                ${var.errors_folder}

              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_IMPRESSIONS_TABLE: >
                ${var.bronze_staging_impressions_prefix}
              BRONZE_POSTS_TABLE: >
                ${var.bronze_post_metadata_table}
              BRONZE_POST_DETAILS_TABLE: >
                ${var.bronze_post_metrics_table}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              1. bronze ingestion/
              bronze post ingest.ipynb
            source: WORKSPACE
        - task_key: trigger_transformation
          depends_on:
            - task_key: bronze_post_ingest
          run_job_task:
            job_id: >
              ${resources.jobs.linkedin_transformation.id}
      queue:
        enabled: true
      performance_target: PERFORMANCE_OPTIMIZED
    linkedin_post_patch_ingest:
      name: linkedin post patch ingest
      trigger:
        pause_status: UNPAUSED
        file_arrival:
          url: >
            /Volumes/${var.landing_catalog}/
            ${resources.schemas.landing.name}/
            ${resources.volumes.patch.name}/
            ${var.ingestion_folder}/
            ${var.patch_post_subfolder}/
      tasks:
        - task_key: bronze_post_patch_ingest
          notebook_task:
            base_parameters:
              LANDING_CATALOG: >
                ${var.landing_catalog}
              LANDING_SCHEMA: >
                ${resources.schemas.landing.name}
              LANDING_PATCH_VOLUME: >
                ${resources.volumes.patch.name}
              
              PENDING_FOLDER: >
                ${var.ingestion_folder}
              PROCESSED_FOLDER: >
                ${var.processed_folder}
              ERRORS_FOLDER: >
                ${var.errors_folder}
              PATCH_POST_SUBFOLDER: >
                ${var.patch_post_subfolder}

              BRONZE_CATALOG: >
                ${var.bronze_catalog}
              BRONZE_SCHEMA: >
                ${resources.schemas.bronze.name}
              BRONZE_POST_PATCH_TABLE: >
                ${var.bronze_post_patch_table}
            notebook_path: >
              ../src/linkedin_analytics_jobs/
              1. bronze ingestion/
              bronze post patch ingest.ipynb
            source: WORKSPACE
        - task_key: trigger_transformation
          depends_on:
            - task_key: bronze_post_patch_ingest
          run_job_task:
            job_id: >
              ${resources.jobs.linkedin_transformation.id}
      queue:
        enabled: true
      performance_target: PERFORMANCE_OPTIMIZED

There are a few changes to this set of Databricks Jobs that weren’t present in the POC pipeline:

The post patch ingestion job now triggers on a file arrival event to the newly created patch landing volume, instead of on a bronze patch table update. This is because the bronze schema is being created by the DAB, and thus the bronze patch table can’t exist before that point, which makes the table update trigger invalid at the point of initial job creation. The post patch ingestion task has also been modified accordingly.
The gold materialized views have been moved into a Spark Declarative Pipeline; more on that later.
Task parameters have been defined for each notebook, bringing our defined constants into the notebooks.

We also need to edit our notebooks to take in the task parameters. Here’s an example of what needs to be done for Python substitutions:

import re

def get_valid_parameter(parameter_key: str):
    """
    Get a valid parameter value from the Databricks widgets.
    Hardened to thwart SQL injection attacks.
    """
    parameter_value = dbutils.widgets.get(parameter_key)
    
    # Parameter_value must be a string with only 
    # alphanumeric characters and underscores
    if not re.fullmatch(r'[a-zA-Z0-9_]+', parameter_value):
        raise ValueError(
            f"Invalid parameter value for {parameter_key}: "
            f"{parameter_value}"
        )
    
    # Disallow dangerous SQL keywords and patterns
    forbidden_patterns = [
        r'--', r';', r"'", r'"', r'/\*', r'\*/', r'xp_', r'char\(', 
        r'nchar\(', r'varchar\(', r'\balter\b', r'\bdrop\b', 
        r'\binsert\b', r'\bdelete\b', r'\bupdate\b', 
        r'\bselect\b', r'\bcreate\b', r'\bexec\b', r'\bunion\b', 
        r'\bor\b', r'\band\b'
    ]
    for pattern in forbidden_patterns:
        if re.search(pattern, parameter_value, re.IGNORECASE):
            raise ValueError(
              f"Potentially dangerous value for {parameter_key}: "
              f"{parameter_value} (pattern matched: {pattern})"
            )
    return parameter_value

# 1. define our constants
try:
    BRONZE_CATALOG = get_valid_parameter("BRONZE_CATALOG")
    BRONZE_SCHEMA = get_valid_parameter("BRONZE_SCHEMA")
    BRONZE_IMPRESSIONS_TABLE_PREFIX = \
      get_valid_parameter("BRONZE_IMPRESSIONS_TABLE_PREFIX")
    BRONZE_TOTALS_TABLE = get_valid_parameter("BRONZE_TOTALS_TABLE")

    SILVER_CATALOG = get_valid_parameter("SILVER_CATALOG")
    SILVER_SCHEMA = get_valid_parameter("SILVER_SCHEMA")
    SILVER_IMPRESSIONS_TABLE = \
      get_valid_parameter("SILVER_IMPRESSIONS_TABLE")

    print("Loaded widget values")
except:
    BRONZE_CATALOG = "bronze"
    BRONZE_SCHEMA = "linkedin"
    BRONZE_IMPRESSIONS_TABLE_PREFIX = "impressions"
    BRONZE_TOTALS_TABLE = "totals"

    SILVER_CATALOG = "silver"
    SILVER_SCHEMA = "linkedin"
    SILVER_IMPRESSIONS_TABLE = "impressions"

    print("Failed to load widget values, using default values")

import re

def get_valid_parameter(parameter_key: str):
    """
    Get a valid parameter value from the Databricks widgets.
    Hardened to thwart SQL injection attacks.
    """
    parameter_value = dbutils.widgets.get(parameter_key)
    
    # Parameter_value must be a string with only 
    # alphanumeric characters and underscores
    if not re.fullmatch(r'[a-zA-Z0-9_]+', parameter_value):
        raise ValueError(
            f"Invalid parameter value for {parameter_key}: "
            f"{parameter_value}"
        )
    
    # Disallow dangerous SQL keywords and patterns
    forbidden_patterns = [
        r'--', r';', r"'", r'"', r'/\*', r'\*/', r'xp_', r'char\(', 
        r'nchar\(', r'varchar\(', r'\balter\b', r'\bdrop\b', 
        r'\binsert\b', r'\bdelete\b', r'\bupdate\b', 
        r'\bselect\b', r'\bcreate\b', r'\bexec\b', r'\bunion\b', 
        r'\bor\b', r'\band\b'
    ]
    for pattern in forbidden_patterns:
        if re.search(pattern, parameter_value, re.IGNORECASE):
            raise ValueError(
              f"Potentially dangerous value for {parameter_key}: "
              f"{parameter_value} (pattern matched: {pattern})"
            )
    return parameter_value

# 1. define our constants
try:
    BRONZE_CATALOG = get_valid_parameter("BRONZE_CATALOG")
    BRONZE_SCHEMA = get_valid_parameter("BRONZE_SCHEMA")
    BRONZE_IMPRESSIONS_TABLE_PREFIX = \
      get_valid_parameter("BRONZE_IMPRESSIONS_TABLE_PREFIX")
    BRONZE_TOTALS_TABLE = get_valid_parameter("BRONZE_TOTALS_TABLE")

    SILVER_CATALOG = get_valid_parameter("SILVER_CATALOG")
    SILVER_SCHEMA = get_valid_parameter("SILVER_SCHEMA")
    SILVER_IMPRESSIONS_TABLE = \
      get_valid_parameter("SILVER_IMPRESSIONS_TABLE")

    print("Loaded widget values")
except:
    BRONZE_CATALOG = "bronze"
    BRONZE_SCHEMA = "linkedin"
    BRONZE_IMPRESSIONS_TABLE_PREFIX = "impressions"
    BRONZE_TOTALS_TABLE = "totals"

    SILVER_CATALOG = "silver"
    SILVER_SCHEMA = "linkedin"
    SILVER_IMPRESSIONS_TABLE = "impressions"

    print("Failed to load widget values, using default values")

We could have retrieved the parameters directly using Databricks’ associated utility function, but instead we’re wrapping it in additional Python code to prevent SQL injection attacks (at least on a basic level).

SQL code is a tougher ask. In theory, Databricks recommends the use of the IDENTIFIER clause for bringing parameters into SQL code, as per below:

CREATE VIEW IF NOT EXISTS IDENTIFIER 
  (:SILVER_CATALOG || "." || :SILVER_SCHEMA || "." || 
   :SILVER_TOTALS_TABLE) 
AS 
  SELECT date, impressions, engagements 
FROM IDENTIFIER 
  (:BRONZE_CATALOG || "." || :BRONZE_SCHEMA || "." || 
   :BRONZE_TOTALS_TABLE);

CREATE VIEW IF NOT EXISTS IDENTIFIER 
  (:SILVER_CATALOG || "." || :SILVER_SCHEMA || "." || 
   :SILVER_FOLLOWERS_TABLE) 
AS 
  SELECT date, new_followers 
FROM IDENTIFIER 
  (:BRONZE_CATALOG || "." || :BRONZE_SCHEMA || "." || 
   :BRONZE_FOLLOWERS_TABLE);

CREATE VIEW IF NOT EXISTS IDENTIFIER 
  (:SILVER_CATALOG || "." || :SILVER_SCHEMA || "." || 
   :SILVER_TOTALS_TABLE) 
AS 
  SELECT date, impressions, engagements 
FROM IDENTIFIER 
  (:BRONZE_CATALOG || "." || :BRONZE_SCHEMA || "." || 
   :BRONZE_TOTALS_TABLE);

CREATE VIEW IF NOT EXISTS IDENTIFIER 
  (:SILVER_CATALOG || "." || :SILVER_SCHEMA || "." || 
   :SILVER_FOLLOWERS_TABLE) 
AS 
  SELECT date, new_followers 
FROM IDENTIFIER 
  (:BRONZE_CATALOG || "." || :BRONZE_SCHEMA || "." || 
   :BRONZE_FOLLOWERS_TABLE);

But the IDENTIFIER clause, as of this writing, doesn’t support DDL statements like the above, which unfortunately covers all instances of our parameter usage. According to a Databricks forum thread, this issue should go away once the Databricks runtime on serverless computes are upgraded to the latest version, so this is hopefully a temporary state of affairs.

We can use the Python approach to validate our parameters before executing the SQL code using the spark.sql function. This is what we end up doing for our logical view creation in the silver layer.

But there is an additional wrinkle at the gold layer: SQL statements for materialized views can only run on an SQL warehouse cluster. This precludes using Python to retrieve the parameters properly.

This is why the code for gold materialized views have been moved into a pipeline that replaces the old notebook: that is what we’ll look at next.

e. pipelines.yml

First, here’s the pipeline definition for reference:

resources:
  pipelines:
    gold_create:
      name: gold_create
      configuration:
        SILVER_CATALOG: >
          ${var.silver_catalog}
        SILVER_SCHEMA: >
          ${resources.schemas.silver.name}
        SILVER_IMPRESSIONS_TABLE: >
          ${var.silver_post_impressions}
        SILVER_ENGAGEMENTS_TABLE: >
          ${var.silver_post_engagements}
        SILVER_POSTS_TABLE: >
          ${var.silver_post_metadata}
        SILVER_TOTALS_TABLE: >
          ${var.silver_profile_metrics}
        SILVER_FOLLOWERS_TABLE: >
          ${var.silver_profile_followers}
          
        GOLD_CATALOG: >
          ${var.gold_catalog}
        GOLD_SCHEMA: >
          ${resources.schemas.gold.name}
        FCT_DAILY_PROFILE_STATS_TABLE: >
          ${var.gold_fact_daily_profile_statistics_table}
        FCT_DAILY_POST_STATS_TABLE: >
          ${var.gold_fact_daily_post_statistics_table}
        DIM_DATE_TABLE: >
          ${var.gold_dimension_date_table}
        DIM_TIMESTAMP_TABLE: >
          ${var.gold_dimension_timestamp_table}
      libraries:
        - glob:
            include: >
              ../src/linkedin_analytics_jobs/
              3. gold modelling/pipeline_gold_create/**
      schema: ${resources.schemas.gold.name}
      continuous: false
      development: false
      photon: true
      channel: CURRENT
      catalog: ${var.gold_catalog}
      serverless: true
      root_path: >
        ../src/linkedin_analytics_jobs/
        3. gold modelling/

resources:
  pipelines:
    gold_create:
      name: gold_create
      configuration:
        SILVER_CATALOG: >
          ${var.silver_catalog}
        SILVER_SCHEMA: >
          ${resources.schemas.silver.name}
        SILVER_IMPRESSIONS_TABLE: >
          ${var.silver_post_impressions}
        SILVER_ENGAGEMENTS_TABLE: >
          ${var.silver_post_engagements}
        SILVER_POSTS_TABLE: >
          ${var.silver_post_metadata}
        SILVER_TOTALS_TABLE: >
          ${var.silver_profile_metrics}
        SILVER_FOLLOWERS_TABLE: >
          ${var.silver_profile_followers}
          
        GOLD_CATALOG: >
          ${var.gold_catalog}
        GOLD_SCHEMA: >
          ${resources.schemas.gold.name}
        FCT_DAILY_PROFILE_STATS_TABLE: >
          ${var.gold_fact_daily_profile_statistics_table}
        FCT_DAILY_POST_STATS_TABLE: >
          ${var.gold_fact_daily_post_statistics_table}
        DIM_DATE_TABLE: >
          ${var.gold_dimension_date_table}
        DIM_TIMESTAMP_TABLE: >
          ${var.gold_dimension_timestamp_table}
      libraries:
        - glob:
            include: >
              ../src/linkedin_analytics_jobs/
              3. gold modelling/pipeline_gold_create/**
      schema: ${resources.schemas.gold.name}
      continuous: false
      development: false
      photon: true
      channel: CURRENT
      catalog: ${var.gold_catalog}
      serverless: true
      root_path: >
        ../src/linkedin_analytics_jobs/
        3. gold modelling/

Just like with the Databricks Jobs, I created a POC manually and configured it to work, then copied the YAML over to the DAB and modified it accordingly. I’ve also copied the associated pipeline files to the DAB, resulting in the following file structure:

Structure of the gold modelling folder in the DAB. Note the icon next to the three dots; clicking that brings us to the associated pipeline editor. This icon only appears after you have deployed the DAB at least once.

Here are the highlights:

Each gold materialized view definition has been separated into its own SQL file in the same folder (3. gold modelling/pipeline_gold_create).
For now, each gold materialized view uses the older ${PARAM} parameter syntax because IDENTIFIER does not yet support our DDL materialized view statements on serverless SQL warehouses. Once the runtime is upgraded, these scripts should be migrated to parameter markers plus IDENTIFIER.

See the documentation for IDENTIFIER for more details.

Deprecation of language features, syntax and other dependencies can be a real concern in a production environment. The general guideline is to avoid using deprecated features and to migrate pipelines that depend on such features in a timely fashion and wherever feasible.

One last note before we move on: the risk arising from an SQL injection attack in this instance (a personal LinkedIn project) is relatively low. In addition, our scripts are internal and parameters are controlled values coming from the DAB, which greatly reduces the likelihood of an SQL injection attack.

f. databricks.yml

Now for the main YAML file.

# This is a Databricks asset bundle definition for <DAB name>.
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
bundle:
  name: <DAB name>
  uuid: <DAB UUID>

include:
  - resources/*.yml
  - resources/*/*.yml
targets:
  dev:
    # The default target uses 'mode: development' to create a development copy.
    # - Deployed resources get prefixed with '[dev my_user_name]'
    # - Any job schedules and triggers are paused by default.
    # See also https://docs.databricks.com/dev-tools/bundles/deployment-modes.html.
    mode: development
    default: true
    workspace:
      host: https://<redacted>.cloud.databricks.com
  prod:
    mode: production
    workspace:
      host: https://<redacted>.cloud.databricks.com
      # We explicitly deploy to /Workspace/Users/<current user> 
      # to make sure we only have a single copy in the workspace.
      root_path: >
        /Workspace/Users/
        ${workspace.current_user.userName}/
        .bundle/${bundle.name}/${bundle.target}
    run_as:
      service_principal_name: >
        ${var.pipeline_runner}
    variables:
      dashboard_subscriber: >
        ${workspace.current_user.userName}
    # permissions:
    #   - user_name: <replace with your own user name>
    #     level: CAN_MANAGE

# This is a Databricks asset bundle definition for <DAB name>.
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
bundle:
  name: <DAB name>
  uuid: <DAB UUID>

include:
  - resources/*.yml
  - resources/*/*.yml
targets:
  dev:
    # The default target uses 'mode: development' to create a development copy.
    # - Deployed resources get prefixed with '[dev my_user_name]'
    # - Any job schedules and triggers are paused by default.
    # See also https://docs.databricks.com/dev-tools/bundles/deployment-modes.html.
    mode: development
    default: true
    workspace:
      host: https://<redacted>.cloud.databricks.com
  prod:
    mode: production
    workspace:
      host: https://<redacted>.cloud.databricks.com
      # We explicitly deploy to /Workspace/Users/<current user> 
      # to make sure we only have a single copy in the workspace.
      root_path: >
        /Workspace/Users/
        ${workspace.current_user.userName}/
        .bundle/${bundle.name}/${bundle.target}
    run_as:
      service_principal_name: >
        ${var.pipeline_runner}
    variables:
      dashboard_subscriber: >
        ${workspace.current_user.userName}
    # permissions:
    #   - user_name: <replace with your own user name>
    #     level: CAN_MANAGE

Here we see the service principal being used to run the production pipelines, and the other YAML files being included via the defined paths.

We are working on the assumption that we have only one workspace in one Databricks instance to work with, and indeed Databricks Free Edition only allows a single workspace per instance.

In an enterprise deployment, development and production environments would typically be deployed in different workspaces or different instances; the latter applies where the Databricks instances are associated with a development and production cloud project (e.g. a Google Cloud project or an Azure resource group).

V. Deploying the DAB

Clicking the rocket icon below the folders icon in the DAB interface brings up the Deployments navigation panel. Click ‘Deploy’.

A portion of the DAB interface, showing the ‘Deploy’ button.

You will be presented with a deployment screen with a progress wheel; this is where Databricks is running in the background to check what can be deployed (equivalent to the ‘terraform plan’ step). If this step fails, there is most likely something wrong with your DAB configuration, and you will need to troubleshoot accordingly.

Deployment screen with changes being reviewed.

If the check is successful, you will be shown a list of all resources that are to be created, deleted or updated.

Deployment screen with all changes loaded.

Note the dev prefixes for the dashboards, jobs, pipelines and schemas; this is due to the ‘development: true’ setting in databricks.yml, and is a way to separate the development deployments being created by multiple contributors in the same workspace.

Clicking the ‘Deploy’ button will bring up the deployment output, where you can track the progress of the deployment (this is the equivalent of ‘terraform apply’).

If the deployment is successful, you will see the list of resources created as below. Now you can go to the respective pipelines and test them accordingly.

Example list of deployed DAB development resources.

Example of a landing schema and volumes created by a DAB.

Example list of Databricks Jobs created by a DAB.

Once we’re happy with the results, we can deploy the DAB to production, and test there again.

Below is a list of additional fixes I did to the pipeline that were not immediately apparent in the POC:

I added statements where relevant to create empty source tables, particularly for bronze.linkedin.post_details (which only populates when post analytics Excel files are ingested)

V. Areas of Improvement

We now have a maintainable pipeline where we can experiment in a development environment, with the confidence that we can implement or roll back any changes rapidly and accurately.

There are a few areas that we have not covered that are beyond the scope of the above deployment.

The ‘pending’ folders in each volume still need to be created manually; Databricks cannot define volume folders directly in asset bundles as of this writing. This can be mitigated by including a small Databricks Job that runs a script or notebook to create said folders.
We have been working exclusively in the Databricks UI, which is fine for our demonstration purposes. In an enterprise deployment we should be working off of the associated Git repository and following the relevant best practices, including raising pull requests for feature changes and committing them to separate development and production (main) branches. For this to work, we will need to configure the relevant CI/CD actions (e.g. GitHub Actions on GitHub, or Jenkins for an agnostic orchestrator) to make use of the Databricks CLI to manage the DAB. This is also where the ‘pending’ folders can be created, and the queries in our Databricks dashboard JSON updated, as part of the follow-up CI/CD actions. The end result is to make deployments reproducible and reviewable, rather than dependent on manual clicks in the UI.

VI. What’s Next?

Once the above deployment gaps are closed, the final step is to know when the pipeline breaks, and to see at a glance where the pipeline may be breaking. That’s the subject of our next post. See you there!

Build Your Own LinkedIn Analytics Part 9: Making a Maintainable Pipeline

TL;DR

I. Options for Maintainability

a. Git repositories

b. Databricks Asset Bundles

II. Setting up Git in Databricks

III. Setting up a service principal

IV. Creating our Databricks Asset Bundle

a. variables.yml

b. schemas.yml

c. dashboard.yml

d. jobs.yml

e. pipelines.yml

f. databricks.yml

V. Deploying the DAB

V. Areas of Improvement

VI. What’s Next?

Related

Leave a Reply Cancel reply

Build Your Own LinkedIn Analytics Part 9: Making a Maintainable Pipeline

TL;DR

I. Options for Maintainability

a. Git repositories

b. Databricks Asset Bundles

II. Setting up Git in Databricks

III. Setting up a service principal

IV. Creating our Databricks Asset Bundle

a. variables.yml

b. schemas.yml

c. dashboard.yml

d. jobs.yml

e. pipelines.yml

f. databricks.yml

V. Deploying the DAB

V. Areas of Improvement

VI. What’s Next?

Pass it on:

Related

Leave a Reply Cancel reply