Build Your Own LinkedIn Analytics Part 8: Orchestrating and Automating the Pipeline

Published on Medium on 1 December 2025

By the end of the previous post, we have built our data pipeline for LinkedIn analytics from ingestion all the way to the final dashboard product. However, every step needs to be manually executed, which can be both tedious and error-prone. If you’re manually re-running notebooks, you’re the scheduler, the observability stack and incident responder rolled into one. That is not sustainable.

That’s why it’s time to properly orchestrate and automate our data product.

This is part of the following blog series: Build Your Own LinkedIn Analytics

Table Of Contents

TL;DR
I. Planning the Architecture
II. Building the Pipeline
- a. Initial Setup
- b. Configuring Pipelines
III. Testing the Pipeline
IV. What’s Next?

TL;DR

Move from manual notebook runs to a trigger-driven, monitored Databricks pipeline for LinkedIn analytics.
Use file-arrival and table-update triggers to feed a consolidated transformation job that updates silver layer tables as well as refreshes gold layer materialized views and the dashboard only when needed.
Design the pipeline with small, idempotent and composable tasks, then observe and harden it with retries and run diagnostics in Databricks Jobs.

I. Planning the Architecture

Recall that we were building our layers and ingestion piece by piece. Combining all the data flows that we planned and implemented for the bronze ingestion, silver transformation, gold modelling and dashboard update results in the following overall data flow:

I’ve made a slight adjustment by removing the ‘misc’ landing zone for the bronze.linkedin.post_patch table, in favor of updating the table directly using a CSV file. I’ve also consolidated the post scraping step into the overall post ingestion task.

Keep in mind: tasks should be idempotent and small enough to retry independently. Idempotency is something that we have been designing for throughout this blog series, but keeping tasks composable is also important to keep our pipeline reliable, debuggable and extendable.

For this flow to be automated as a batch pipeline, we need to implement triggers. A trigger is, broadly speaking, something that starts when something happens (i.e. a kind of event). Out of these types of events, three stand out as being the most common:

Scheduled trigger: This is when pipelines run on a regular basis, for example hourly, daily or weekly. If we were able to pull LinkedIn data from an API, this would have been the ideal trigger because there would be no human intervention needed.
Upload-based trigger: This triggers when there is a file creation or update event in the relevant storage system, usually blob storage. Pipelines that depend on batch or individual file uploads use this type of trigger, which makes this ideal for our Excel-file-based ingestion.
Task-based trigger: This type of trigger is part and parcel of a data pipeline, where the execution of one or more tasks depends on the completion and results of one or more predecessor tasks. For instance, in the diagram above, the metrics consolidate task is triggered once the daily ingest task completes successfully. Within a data pipeline, this type of trigger is treated as a task dependency; however, external tasks can also act as triggers (typically as a way to break a large pipeline into smaller pipelines, then having the predecessor pipeline(s) trigger the successor pipeline(s)).

Alternatively, we could implement this flow as a streaming pipeline, which in this case means that the landing zone would be constantly monitored for new files. This is overkill for our purposes; why do we need to have resources spun up to do this constant monitoring when we are only uploading our files daily and on an ad-hoc basis?

Looking at our data flow, there are three types of triggers that we need to implement:

Starting triggers for file upload: Databricks only allows for this type of trigger on a per-volume basis. This is why we have created one volume for each type of file (see the high-level data architecture for the definitions of the landing volumes).
Starting trigger for table update: We are implementing this for our bronze.linkedin.post_patch table.
Ending trigger for dashboard refresh: Datasets on Databricks dashboards only update on refresh, so it makes sense to trigger the refresh for our dashboard once all the data has been updated in the gold layer.

Having different types of triggers helps ensure that we don’t re-run irrelevant parts of the pipeline for different events. For example, updating posts metadata should not trigger any ingestion of metrics.

This is what our data flow looks like with the triggers included:

Now I’ve alluded to this when discussing task-based triggers, but there are two main approaches to building an end-to-end pipeline, the monolithic pipeline and asset-oriented pipelines:

The monolithic pipeline: In this approach, we split our tasks into logical parts (either notebooks or script files) and orchestrate them in a single pipeline. This is a common approach if we are building a pipeline for an independant data product, i.e. without data dependancies in any of the intermediate layers (particularly the silver layer). Having all tasks in a single pipeline makes it easy to monitor and debug the pipeline in a single place, though this becomes increasingly unwieldy if there are a multitude of tasks.
Asset-oriented pipelines: For this approach, instead of orchestrating our tasks in a single pipeline, we split them into multiple pipelines, each dedicated to the respective data assets (e.g. a silver layer pipeline for all our silver layer tables and a gold layer pipeline for all our gold layer tables). This approach is used when there are data dependancies from other pipelines on intermediate layers, particularly the silver layer. An example would be the ingestion of sales data into the silver layer, which can subsequently be analysed and used in a variety of different ways, either by itself or combined with other data such as customer data.

There is also the single notebook approach, which in many cases means using the same notebook that was used to build the proof-of-concept. Except for the simplest of pipelines, I do not recommend this approach for production. In effect you have a single task responsible for all the ingestion, which means if anything fails you have to run everything all over again, or dive into the notebook to figure out where to resume from. When you are ingesting into multiple tables in multiple layers, this is a recipe for a bad day.

In short, the single notebook design is the anti-pattern for idempotent and composable tasks.

In our case, the monolithic pipeline approach works just fine. What then remains is to actually break down the tasks. Using the data flow diagram as a guide, we end up with the following:

Task flow for LinkedIn analytics data pipeline.

Note the materialize section of the pipeline; we will elaborate on that in the upcoming section.

II. Building the Pipeline

a. Initial Setup

In the main menu of the Databricks UI, navigate to the ‘Jobs & Pipelines’ section as illustrated below.

Databricks UI menu with ‘Jobs & Pipelines’ highlighted.

Remember the materialized views we built when modelling the data? In the Pipelines section, we can see that Databricks built a pipeline for each of the materialized view.

Materialized views are defined as pipelines in Databricks.

If we check the properties of one of the pipelines (by clicking on it and navigating to ‘Pipeline details’), you can see that it is of type ‘DBSQL MV/ST’. These are pipelines that Databricks builds automatically whenever you create either a materialized view or a streaming table.

The type of pipeline for a Databricks materialized view; note that it is also the same type for a Databricks streaming table.

However, we did not configure any schedule for refreshes when we created these materialized views. This was deliberate; we only want to refresh the views (using SQL statements) when we have ingested new data.

With that in mind, here is our updated pipeline, with the SQL refresh statements consolidated into a single task:

Updated task flow for LinkedIn analytics pipeline.

Let’s get to building. At the top of the screen, you can see three options:

The types of jobs and pipelines available to create.

Ingestion pipeline: When you click on this, you are brought to the ‘Add data’ page that we briefly covered when implementing our ingestion. There is a list of connectors here that allow for scheduled or triggered ingestion from third-party sources such as Salesforce and GitHub, or for incorporating tabular data that already exists on AWS S3 buckets. These options are beyond the scope of this project.
ETL pipeline: This is Databricks’ guided way of creating a declarative pipeline via streaming tables and/or materialized views.
Job: If you want a quick and dirty way to create a pipeline with pre-existing scripts and GUI-assisted configuration, this is the way to go.

Let’s go with the ‘Job’ option for our proof-of-concept pipeline. You will be presented with a screen like the following:

At this point, it’s time to break up our tasks into individual notebooks or scripts. If you’ve been following along, you may have put all the code into a single ‘linkedin_pipeline_poc’ Databricks notebook, or you may have already broken the code down according to the respective layers or mirroring the specific blog posts. Either way, we will need to create notebooks or scripts corresponding to the tasks that we have defined in our task flow above. Let’s recap:

bronze historical ingest: This ingests the historical Excel file into the relevant bronze tables. Even if you don’t expect to ingest this more than once, it is still best practice to implement this task just in case.
bronze daily ingest: This ingests the daily Excel file into the relevant bronze metrics tables, and is the heart of our bronze ingestion.
bronze post ingest: This triggers when either a daily Excel file or a post Excel file is uploaded, or when the bronze patch table is updated; it updates the relevant bronze post tables accordingly.
silver metrics consolidation: This is triggered once either the bronze historical ingest or the bronze daily ingest is complete, and updates the relevant silver metrics tables.
silver post enrichment: This is triggered by the bronze post ingest, and updates the silver post table.
gold refresh: As alluded to above, this is an SQL script that, when triggered by the silver tasks, runs a refresh of all the gold materialized views.
dashboard refresh: This is triggered by the gold materialized views refresh, and refreshes the dashboard accordingly (with the ability to also send updates to subscribers of the dashboard, which we’ll touch on in a subsequent post).

Now let’s set up the triggers. There are four types of triggers on Databricks as illustrated below:

The types of triggers available on Databricks.

We have covered the first three trigger types earlier in this article. ‘Continuous’ is not really a trigger; rather, it is Databricks’ way of implementing a streaming pipeline, where the pipeline is rerun as soon as it is complete.

You might now have noticed something: Databricks only allows one trigger per job. This is in contrast to other orchestrators (notably Airflow), where you have more flexibility with multiple triggers in a single pipeline. Having one trigger per job makes tracking and debugging job runs substantially more straightforward when one doesn’t have to worry about multiple trigger points and conditions. However, the single trigger approach does complicate building pipelines with multiple triggers.

We can overcome this limitation by creating one ingestion pipeline per trigger, and then have those pipelines trigger a consolidated transformation pipeline, as below:

Initial pipeline is now broken down into ingestion pipelines (one per file arrival trigger) and a transformation pipeline (to be triggered at the end of any of the ingestion pipelines).

We have just designed a fan-in orchestration pattern, where multiple workflows are executed in parallel before their results are consolidated in a single downstream workflow. Knowing about such design patterns can be very useful for thinking about task flows; refer to this link for an overview and some other examples.

There are duplicate tasks in each ingestion pipeline, but that doesn’t actually result in duplication of code; each duplicate task can reference the same script or notebook, which means that updating the script or notebook will also update all the related tasks.

Here we see the power of breaking down our pipeline into logical idempotent steps. This process of task decomposition is a fundamental aspect of software engineering, not just data engineering.

Let’s go ahead and configure our ingestion pipelines.

b. Configuring Pipelines

For most of the ingestion pipelines, we will be configuring file arrival triggers associated with their respective storage locations. Below is an example:

Example configuration for file arrival triggers.

You can test the trigger to make sure that the storage location exists and is accessible by Databricks.

Results of a successful file arrival trigger test.

Table update triggers are configured similarily, as below.

Example configuration for table update triggers.

Then for each task you create, there are the following common configurations to consider:

Task name: A descriptive name for the task; often this is just the name of the python file or notebook.
Type: There are a multitude of different task types available. For our purposes, the relevant types are Notebook, Python script, SQL, Dashboard and Run Job. Check out Databricks’ documentation for more details.
Depends on: This is where we define which prior tasks the current task depends on. The nature of this dependancy is defined by the next option.
Run if dependencies: The default for this is ‘All succeeded’, and for our current purposes we don’t need any other type of dependency. A full list of the types of dependencies is below.

Below is an example of the configuration for a Notebook task.

Example configuration for a Notebook task.

After all that, here is the high-level task flow of the pipeline as implemented in Databricks:

Task flow for pipeline implemented in Databricks, consisting of multiple Jobs.

The details for each trigger is as follows:

And the details for each task is as follows:

Configuration details for each task. For dependencies refer to the task flow diagram.

When you’re done, you should see a list of jobs/pipelines similar to the below:

The jobs are prefixed with ‘poc’ to make clear that these are proofs of concept.

We can now test the pipeline by uploading files to our configured landing folders.

III. Testing the Pipeline

Once the files have been uploaded, the associated ingestion pipeline will trigger. This is the time to monitor the pipeline as a whole to ensure that everything is executed as expected. For that, we return to the main ‘Jobs and Pipelines’ page to see what is being executed, paying particular attention to the ‘Recent runs’ field.

Above is an example from my testings. The icons have the following meanings:

Green tick means that the run executed successfully.
Green circular arrows means that the run is in progress.
Pale green arc means that the run is queued for execution.
Red cross means that the run failed.

The row with alternating successes and failures along with a queued run belongs to one of the materialized views pipeline, specifically gold.linkedin.fct_daily_post_statistics. If we dive into the error logs, we see the following:

Error log for failed materialized view refresh, indicating that not enough resources have been allocated.

Here is another limitation of Databricks Free Edition that we’ve just run into: we can only ever have one small serverless general cluster and one small serverless SQL warehouse cluster running at any one time, and our tasks cannot scale beyond that. Since we’re not able to increase our resources, the alternative is to implement a retry mechanism with the associated task, i.e. the gold_refresh task.

Navigate to the task within the linkedin transformation job (stick to the ‘Tasks’ tab), and edit the ‘Retries’ option to configure the retry policy: the configuration chosen is per the screenshot below.

Chosen retry policy: 2 times (3 total attempts), waiting 1 minute between retries.

In the real world, you will also run into resource-constrained situations, and you have to evaluate the best approach to dealing with them. Retry mechanisms paired with exponential standoffs (i.e. increasing the amount of waiting time between retries in an exponential manner) are common approaches to managing limited API call rates, for instance: this approach is vital for any form of API-based ingestion. At the same time, you need to manage the additional time required to run the pipeline on fewer resources; if latency is a priority or if the pipeline is running for too long, it is time to look into optimizations and/or increasing resources (if possible and economical).

Switching to the ‘Jobs’ tab in the selected job gives us an overview of all runs, historical and active:

Example overview of runs within a Databricks job.

Clicking into one of the runs (usually the most recent run) presents us with the following options:

The Graph view shows the DAG (directed acyclic graph) of the run, which is good for a quick overview of how the run is doing.

Example graph view of daily ingestion job being executed. Clicking on the trigger_transformation task brings you to the execution graph for the associated transformation job run.

Example graph view of transformation job being executed. Note that gold_refresh is on its second attempt.

The Timeline view gives us an idea of how long each job and task takes to run, with views of dependencies as well.

Example timeline view of completed transformation job run. Note the initial failure of **gold_refresh** and its subsequent success.

Finally, the List view comes in handy if the job has many tasks to look through; a monolithic pipeline could have tens or hundreds of tasks configured, for instance.

Example view of completed transformation job run.

It’s worth noting that in this testing phase, I found a bug in one of the SQL scripts in the silver transformation step; I have subsequently fixed it and updated the blog post accordingly. This is why testing is such an important part of any type of development; one would not want to end up with a pipeline that does not work or that gives errornous results.

We can also see job runs in the overall ‘Jobs & Pipelines’ page, and filter by our desired job (or some other filter):

Example list of all job runs for the daily ingest pipeline.

We can see that the daily ingest run takes about 30 to 35 minutes, which is acceptable for a daily run on Databricks Free Edition but would be a clear optimization target in a production setting given the small size of our dataset.

IV. What’s Next?

Congratulations, you now have a working end-to-end LinkedIn data product! The next step would be to make the pipeline easy to maintain and improve upon: that is the topic of the next post in the series. See you there!