Build Your Own LinkedIn Analytics Part 10: Observing the Pipeline
Orginally published on Medium on 30 December 2025
By the end of the previous post, we had built our data pipeline for LinkedIn analytics from ingestion all the way to the final dashboard product, as well as put in place proper orchestration and automation together with high maintainability.
The final step is observability: making sure that we know what’s going on with our pipelines.
TL;DR
- Observability turns this LinkedIn analytics pipeline from a black box into a predictable data product with clear Service Level Agreements (SLAs) for reliability.
- Operational observability ensures you know when jobs and tasks fail, who needs to act, and how alerts flow via Databricks notifications.
- Data observability tracks quality dimensions like completeness, timeliness and uniqueness using Databricks’ built-in profiling, so you catch silent failures, not just broken runs.
- For a single-creator LinkedIn stack, Databricks’ native observability is sufficient; platform-neutral stacks (OpenTelemetry, Prometheus, Grafana) pay off once you’re coordinating many pipelines across teams and platforms.
I. Why Observability is Important
One of the critical aspects of an enterprise-ready data product is reliability. That means the pipeline needs to work most of the time and any issues should be resolved within a certain time period. These details are usually defined by a Service Level Agreement (SLA), which formalises how reliable the pipeline needs to be and how quickly incidents must be resolved.
Batch pipelines like ours typically need data ready for consumption by a fixed time or within a defined duration; streaming pipelines add uptime targets (e.g., 99.5%) and tighter SLAs.
While it would be nice for our pipeline to work 100% of the time, in reality there will be disruptions or errors for any number of reasons. How do we make sure such issues are resolved in a timely manner?
- The appropriate person needs to be alerted. In smaller setups this would typically be the developer or maintainer of the data product; larger teams would have dedicated ‘Level 1’ (L1) support staff who would triage the most common issues, and escalate to higher levels only if the issue requires a deeper investigation or complex fix.
- The issue needs to be logged and traceable. Error logs need to clearly indicate where something has gone wrong with associated information; meanwhile, we need to be able to trace which task in which job had the issue. If you can’t see where it failed, you can’t promise when it’ll be fixed.
If the second point sounds familiar, that’s because we’ve already seen this in action when testing our orchestrated pipeline. The same set of skills and tools needed for debugging come into play when triaging a pipeline, and that includes knowing where to look for pipeline monitoring information as well as writing appropriate logging code within our pipeline.
If we properly document the code and annotate the logging, we should have sufficient information to at least begin to tackle whatever issue may come up.
But what are the issues that need to be tracked and addressed?
a. Operational health
This is the most visible issue that we need to address. If any part of our pipeline stops working, we need to know about and resolve the issue in a timely manner. If your business use case (in this instance, perhaps a weekly LinkedIn review) depends on this data product, every failed run is a missed opportunity to learn and adjust your business strategy (or in this instance, content strategy) in a timely fashion.
Other than the health of the data pipelines and dashboards, the health of our data infrastructure is also important, even if we can’t necessarily do anything about it. For instance, what’s our recourse if the entire AWS us-east-2 region goes down and takes the entire Databricks Free Edition stack (which is what we’ve been using for this series) down with it?
This isn’t a hypothetical issue: in October 2025, only a few months before the writing of this post, AWS suffered a major outage and took major websites and services down with it. That was just the hardest-hitting of the outages that followed (on Azure and on Cloudflare).
If the pipeline is critical enough, we’d implement some form of multi-zonal or multi-region setup for our data infrastructure. This feature on Databricks is only available at the Enterprise level, with the associated costs. Higher resiliency almost always results in higher operational costs, and you will need to evaluate how much that higher resiliency and reliability is worth.
b. Data health (aka data quality)
Just because the pipeline is running doesn’t mean that the data is correct. We need to establish metrics to measure data health i.e. data quality.
We used exactly such metrics (as defined by Databricks) to evaluate our data sources all the way back in part 2 of our blog series. Let’s recap:
- Consistency: Data values should not conflict with other values across data sets
- Accuracy: There should be no errors in the data
- Validity: The data should conform to a certain format
- Completeness: There should be no missing data
- Timeliness: The data should be up to date
- Uniqueness: There should be no duplicates
These are the dimensions that we watch over time to spot when something silently goes wrong. We defined Relevance as another metric when we were evaluating our data sources, but that’s no longer necessary in our operational state (we already chose our data to be relevant, unless the data product changes significantly).
II. Our Observability Stack
Building a data product on a single mature data platform allows us to centralise and use the native observability stack of that platform. In other setups, there can be more than one orchestrator and/or data platform. This can happen when:
- Multiple legacy systems remain from earlier design decisions and migrations were never prioritised (‘don’t fix what’s not broken’).
- Different teams own their own stacks and have independently chosen tooling.
- Parts of the system run on third-party platforms outside of the main data platform.
In these cases, there would be a greater call for a unified observability stack. A combination of OpenTelemetry (an open standard for emitting traces, metrics and logs), Prometheus (an open-source time-series database) and Grafana (a real-time dashboard) is commonly used as a platform-neutral observability stack; the respective cloud platforms also have their own solutions (e.g. Google Cloud Monitoring/Logging as well as Dataplex).
For a single-creator LinkedIn analytics stack, Databricks’ native observability is enough; the OpenTelemetry/Prometheus/Grafana route usually pays off when you are coordinating many pipelines across teams or platforms.
a. Operational observability
We have already seen what the operational monitoring on Databricks looks like when testing our orchestrated pipeline. But what about operational alerting?
Databricks has a notifications feature linked to its jobs and pipelines, which can be added either on the Databricks UI or as a configuration in a Databricks Asset Bundle (DAB). Here’s a breakdown on when to use which
- Job or pipeline notifications ensure you hear about any failed run without checking the Jobs UI.
- Task notifications let you focus alerts on the most critical steps and avoid noise from skipped or cancelled tasks.
Now let’s see how to enable them for Jobs and Tasks on the Databricks UI; we will explore the DABs approach afterwards.
Navigate to the interface of one of your Jobs and find the ‘Job notifications’ section as below.

Click on ‘Edit notifications’, and you are brought to the ‘Job notifications’ dialog box. Clicking on ‘Add notification’ then brings you to the setup page for Job notifications.


The default destination type for notifications is an email address, as can be seen in the screencap. However, other system destinations can be set up, such as to a Microsoft Teams channel or to a webhook.
For our purposes, email alerts are perfectly fine, though you could set up a webhook to your Telegram, WhatsApp or other messaging system of your choice if you prefer more immediate notifications than an email.
Below are some screencaps of what an email notification setup looks like.


Accessing task notifications is a similar story. There is a ‘Notifications’ section in the task setup, and clicking ‘+ Add’ brings us to a similar ‘Task notifications’ dialog box.



Once you’re happy with the UI setup, you can replicate the same notifications into a DAB so they’re versioned and reproducible (i.e. maintainable):
resources:
jobs:
job_name:
name: ...
email_notifications:
on_failure:
- placeholder@email.com
no_alert_for_skipped_runs: true
notification_settings:
no_alert_for_skipped_runs: true
no_alert_for_canceled_runs: true
...
tasks:
- task_key: task_name_1
...
email_notifications:
on_failure:
- placeholder@email.com
notification_settings:
no_alert_for_skipped_runs: true
no_alert_for_canceled_runs: true
alert_on_last_attempt: true
- task_key: task_name_2
...If you’re setting up custom notification systems, note that these cannot yet be set up via DABs, but you can still direct the notifications to a webhook; its configuration can be stored in a DAB variable.
Similar notification setups can be done for Declarative Pipelines, but they are not as extensive, as can be seen below:

b. Data observability
Earlier, we defined data quality dimensions like completeness, timeliness and uniqueness. In observability terms, these become checks and dashboards that tell us when the numbers we rely on for LinkedIn decisions might be wrong.
We already defined what constitutes good data quality for our data when choosing our data sources. Generally speaking, we don’t stop our pipelines on data quality issues unless such issues break our pipelines (e.g. unexpected data schema changes). However, we still want to ensure that we can monitor such issues as and when they appear, and we would like to be alerted if the quality issues cross a certain threshold.
Databricks has a data quality feature built into its tables. Click on any table and navigate to the ‘Quality’ tab; you’ll see the option to enable Data Quality Monitoring.

Clicking on ‘Enable’ shows us two options: a schema-wide ‘intelligent’ anomaly detection feature, and data profiling for column and row-level metrics.

The data profiling feature is where you can set up Databricks’ more generic data quality monitoring. There are three main options, as seen in the below screencap:

- Time series profiling analyses quality metrics over a set of time windows. In our case, daily time windows make sense for much of our data, e.g. our impressions and engagements silver tables.
- Snapshot profiling analyses quality metrics over the entire table and is used when there is no useful or relevant time field to be used, e.g. for our posts silver tables. Note that materialized tables can only use snapshot profiling.
- Inference profiling is specifically for tracking machine learning model drift and performance over time. Since we’re not running any such models, this is not relevant for us. If we decide to deploy machine learning models later (e.g. forecasting model for impressions), enabling this feature will help us catch any underlying drift in the data that can result in a reduction in the models’ accuracy.
There is also an advanced options section where refresh schedules, the location of data profile metric tables, as well as optional custom metrics can be defined.

Once the data profiling has been configured and saved, a dashboard with its two associated metrics tables is created, see the screencaps below.



Finally, all of the above data profiling can be configured via DABs, as per Databricks’ documentation.
As of the time of writing, anomaly detection cannot be set up via DABs; if you want to use a CI/CD approach, you will have to explore either the Data Quality Monitoring API or the equivalent Terraform module.
III. What’s Next?
Our journey to build our LinkedIn analytics data product is almost complete. Next up is a post-mortem where I share key takeaways and lessons learned when working on this project, so stay tuned.
Leave a Reply