Build Your Own LinkedIn Analytics Part 10: Observing the Pipeline

Orginally published on Medium on 30 December 2025

By the end of the previous post, we had built our data pipeline for LinkedIn analytics from ingestion all the way to the final dashboard product, as well as put in place proper orchestration and automation together with high maintainability.

The final step is observability: making sure that we know what’s going on with our pipelines.

This is part of the following blog series: Build Your Own LinkedIn Analytics

Table Of Contents

TL;DR

TL;DR

Observability turns this LinkedIn analytics pipeline from a black box into a predictable data product with clear Service Level Agreements (SLAs) for reliability.
Operational observability ensures you know when jobs and tasks fail, who needs to act, and how alerts flow via Databricks notifications.
Data observability tracks quality dimensions like completeness, timeliness and uniqueness using Databricks’ built-in profiling, so you catch silent failures, not just broken runs.
For a single-creator LinkedIn stack, Databricks’ native observability is sufficient; platform-neutral stacks (OpenTelemetry, Prometheus, Grafana) pay off once you’re coordinating many pipelines across teams and platforms.

I. Why Observability is Important

One of the critical aspects of an enterprise-ready data product is reliability. That means the pipeline needs to work most of the time and any issues should be resolved within a certain time period. These details are usually defined by a Service Level Agreement (SLA), which formalises how reliable the pipeline needs to be and how quickly incidents must be resolved.

Batch pipelines like ours typically need data ready for consumption by a fixed time or within a defined duration; streaming pipelines add uptime targets (e.g., 99.5%) and tighter SLAs.

While it would be nice for our pipeline to work 100% of the time, in reality there will be disruptions or errors for any number of reasons. How do we make sure such issues are resolved in a timely manner?

The appropriate person needs to be alerted. In smaller setups this would typically be the developer or maintainer of the data product; larger teams would have dedicated ‘Level 1’ (L1) support staff who would triage the most common issues, and escalate to higher levels only if the issue requires a deeper investigation or complex fix.
The issue needs to be logged and traceable. Error logs need to clearly indicate where something has gone wrong with associated information; meanwhile, we need to be able to trace which task in which job had the issue. If you can’t see where it failed, you can’t promise when it’ll be fixed.

If the second point sounds familiar, that’s because we’ve already seen this in action when testing our orchestrated pipeline. The same set of skills and tools needed for debugging come into play when triaging a pipeline, and that includes knowing where to look for pipeline monitoring information as well as writing appropriate logging code within our pipeline.

If we properly document the code and annotate the logging, we should have sufficient information to at least begin to tackle whatever issue may come up.

But what are the issues that need to be tracked and addressed?

a. Operational health

This is the most visible issue that we need to address. If any part of our pipeline stops working, we need to know about and resolve the issue in a timely manner. If your business use case (in this instance, perhaps a weekly LinkedIn review) depends on this data product, every failed run is a missed opportunity to learn and adjust your business strategy (or in this instance, content strategy) in a timely fashion.

Other than the health of the data pipelines and dashboards, the health of our data infrastructure is also important, even if we can’t necessarily do anything about it. For instance, what’s our recourse if the entire AWS us-east-2 region goes down and takes the entire Databricks Free Edition stack (which is what we’ve been using for this series) down with it?

This isn’t a hypothetical issue: in October 2025, only a few months before the writing of this post, AWS suffered a major outage and took major websites and services down with it. That was just the hardest-hitting of the outages that followed (on Azure and on Cloudflare).

If the pipeline is critical enough, we’d implement some form of multi-zonal or multi-region setup for our data infrastructure. This feature on Databricks is only available at the Enterprise level, with the associated costs. Higher resiliency almost always results in higher operational costs, and you will need to evaluate how much that higher resiliency and reliability is worth.

b. Data health (aka data quality)

Just because the pipeline is running doesn’t mean that the data is correct. We need to establish metrics to measure data health i.e. data quality.

We used exactly such metrics (as defined by Databricks) to evaluate our data sources all the way back in part 2 of our blog series. Let’s recap:

Consistency: Data values should not conflict with other values across data sets
Accuracy: There should be no errors in the data
Validity: The data should conform to a certain format
Completeness: There should be no missing data
Timeliness: The data should be up to date
Uniqueness: There should be no duplicates

These are the dimensions that we watch over time to spot when something silently goes wrong. We defined Relevance as another metric when we were evaluating our data sources, but that’s no longer necessary in our operational state (we already chose our data to be relevant, unless the data product changes significantly).

II. Our Observability Stack

Building a data product on a single mature data platform allows us to centralise and use the native observability stack of that platform. In other setups, there can be more than one orchestrator and/or data platform. This can happen when:

Multiple legacy systems remain from earlier design decisions and migrations were never prioritised (‘don’t fix what’s not broken’).
Different teams own their own stacks and have independently chosen tooling.
Parts of the system run on third-party platforms outside of the main data platform.

In these cases, there would be a greater call for a unified observability stack. A combination of OpenTelemetry (an open standard for emitting traces, metrics and logs), Prometheus (an open-source time-series database) and Grafana (a real-time dashboard) is commonly used as a platform-neutral observability stack; the respective cloud platforms also have their own solutions (e.g. Google Cloud Monitoring/Logging as well as Dataplex).

For a single-creator LinkedIn analytics stack, Databricks’ native observability is enough; the OpenTelemetry/Prometheus/Grafana route usually pays off when you are coordinating many pipelines across teams or platforms.

a. Operational observability

We have already seen what the operational monitoring on Databricks looks like when testing our orchestrated pipeline. But what about operational alerting?

Databricks has a notifications feature linked to its jobs and pipelines, which can be added either on the Databricks UI or as a configuration in a Databricks Asset Bundle (DAB). Here’s a breakdown on when to use which

Job or pipeline notifications ensure you hear about any failed run without checking the Jobs UI.
Task notifications let you focus alerts on the most critical steps and avoid noise from skipped or cancelled tasks.

Now let’s see how to enable them for Jobs and Tasks on the Databricks UI; we will explore the DABs approach afterwards.

Navigate to the interface of one of your Jobs and find the ‘Job notifications’ section as below.

The Job notifications section of Databricks Jobs.

Click on ‘Edit notifications’, and you are brought to the ‘Job notifications’ dialog box. Clicking on ‘Add notification’ then brings you to the setup page for Job notifications.

Setup dialog box for job notifications. Note the checkboxes for different states of the job run; ‘Failure’ is ticked by default. Note also the mute options for skipped or cancelled runs.

The default destination type for notifications is an email address, as can be seen in the screencap. However, other system destinations can be set up, such as to a Microsoft Teams channel or to a webhook.

For our purposes, email alerts are perfectly fine, though you could set up a webhook to your Telegram, WhatsApp or other messaging system of your choice if you prefer more immediate notifications than an email.

Below are some screencaps of what an email notification setup looks like.

The job notifications section of Databricks Jobs, now with a single email notification on failure.

Partial screencap of an email that reports the successful signup for email notifications to a particular Databricks Job.

Accessing task notifications is a similar story. There is a ‘Notifications’ section in the task setup, and clicking ‘+ Add’ brings us to a similar ‘Task notifications’ dialog box.

The ‘Notifications’ section of the Databricks Jobs task setup.

The setup dialog box for task notifications. There is an extra mute option for ignoring all but the final retry of a task.

Once you’re happy with the UI setup, you can replicate the same notifications into a DAB so they’re versioned and reproducible (i.e. maintainable):

resources:
  jobs:
    job_name:
      name: ...
      email_notifications:
        on_failure:
          - placeholder@email.com
        no_alert_for_skipped_runs: true
      notification_settings:
        no_alert_for_skipped_runs: true
        no_alert_for_canceled_runs: true
      ...
      tasks:
        - task_key: task_name_1
          ...
          email_notifications:
            on_failure:
              - placeholder@email.com
          notification_settings:
            no_alert_for_skipped_runs: true
            no_alert_for_canceled_runs: true
            alert_on_last_attempt: true
        - task_key: task_name_2
          ...

resources:
  jobs:
    job_name:
      name: ...
      email_notifications:
        on_failure:
          - placeholder@email.com
        no_alert_for_skipped_runs: true
      notification_settings:
        no_alert_for_skipped_runs: true
        no_alert_for_canceled_runs: true
      ...
      tasks:
        - task_key: task_name_1
          ...
          email_notifications:
            on_failure:
              - placeholder@email.com
          notification_settings:
            no_alert_for_skipped_runs: true
            no_alert_for_canceled_runs: true
            alert_on_last_attempt: true
        - task_key: task_name_2
          ...

If you’re setting up custom notification systems, note that these cannot yet be set up via DABs, but you can still direct the notifications to a webhook; its configuration can be stored in a DAB variable.

Similar notification setups can be done for Declarative Pipelines, but they are not as extensive, as can be seen below:

Notification setup for pipelines; note that they can be only be set at the pipeline rather than task level, and they can only be sent via email.

b. Data observability

Earlier, we defined data quality dimensions like completeness, timeliness and uniqueness. In observability terms, these become checks and dashboards that tell us when the numbers we rely on for LinkedIn decisions might be wrong.

We already defined what constitutes good data quality for our data when choosing our data sources. Generally speaking, we don’t stop our pipelines on data quality issues unless such issues break our pipelines (e.g. unexpected data schema changes). However, we still want to ensure that we can monitor such issues as and when they appear, and we would like to be alerted if the quality issues cross a certain threshold.

Databricks has a data quality feature built into its tables. Click on any table and navigate to the ‘Quality’ tab; you’ll see the option to enable Data Quality Monitoring.

Example screencap of Databricks’ native data quality monitoring configuration.

Clicking on ‘Enable’ shows us two options: a schema-wide ‘intelligent’ anomaly detection feature, and data profiling for column and row-level metrics.

Choices for Databricks’ native data quality monitoring.

The data profiling feature is where you can set up Databricks’ more generic data quality monitoring. There are three main options, as seen in the below screencap:

Three types of data profiling available.

Time series profiling analyses quality metrics over a set of time windows. In our case, daily time windows make sense for much of our data, e.g. our impressions and engagements silver tables.
Snapshot profiling analyses quality metrics over the entire table and is used when there is no useful or relevant time field to be used, e.g. for our posts silver tables. Note that materialized tables can only use snapshot profiling.
Inference profiling is specifically for tracking machine learning model drift and performance over time. Since we’re not running any such models, this is not relevant for us. If we decide to deploy machine learning models later (e.g. forecasting model for impressions), enabling this feature will help us catch any underlying drift in the data that can result in a reduction in the models’ accuracy.

There is also an advanced options section where refresh schedules, the location of data profile metric tables, as well as optional custom metrics can be defined.

Advanced options for data quality monitor creation.

Once the data profiling has been configured and saved, a dashboard with its two associated metrics tables is created, see the screencaps below.

Data profiling configured. Note that the dashboard will have no data until the extra supporting tables have been created by Databricks.

The two extra tables created by data profiling.

Example screencap of time series data profiling.

Finally, all of the above data profiling can be configured via DABs, as per Databricks’ documentation.

As of the time of writing, anomaly detection cannot be set up via DABs; if you want to use a CI/CD approach, you will have to explore either the Data Quality Monitoring API or the equivalent Terraform module.

III. What’s Next?

Our journey to build our LinkedIn analytics data product is almost complete. Next up is a post-mortem where I share key takeaways and lessons learned when working on this project, so stay tuned.

Build Your Own LinkedIn Analytics Part 10: Observing the Pipeline

TL;DR

I. Why Observability is Important

a. Operational health

b. Data health (aka data quality)

II. Our Observability Stack

a. Operational observability

b. Data observability

III. What’s Next?

Related

Leave a Reply Cancel reply

Build Your Own LinkedIn Analytics Part 10: Observing the Pipeline

TL;DR

I. Why Observability is Important

a. Operational health

b. Data health (aka data quality)

II. Our Observability Stack

a. Operational observability

b. Data observability

III. What’s Next?

Pass it on:

Related

Leave a Reply Cancel reply