Build Your Own LinkedIn Analytics Part 5: Cleaning and Transforming Data

Published on Medium on 3 November 2025

In the previous article, we ingested our LinkedIn analytics data into the bronze layer. In this article, we will clean and transform our data into a Single Source of Truth (SSOT) in the silver layer.

This is part of the following blog series: Build Your Own LinkedIn Analytics

Table Of Contents

TL;DR
I. Recap
II. Silver layer: transformation approach
III. What’s next?

TL;DR

We build up our Single Souce of Truth (SSOT) in the silver layer by creating views as well consolidating and enriching from the bronze layer

I. Recap

Before we proceed further, let’s recap where we are in the building out of our data product. Recall the data architecture that we set up:

Revisitng the high-level data architecture.

We have established the landing and the bronze layer, and ingested our test data into those layers. This means that we have queryable tables with data that is close to its raw form. Our next step is to take that data and prepare it to serve as our Single Source of Truth (SSOT) in the silver layer.

Data flow from landing to bronze layer from the previous article, with the temporary staging tables highlighted in green.

II. Silver layer: transformation approach

So how shall we approach the silver layer? The proposed data flow below illustrates three approaches.

Proposed data flow from bronze to silver layer. Note the removal of the staging tables after ingestion into the silver layer.

Direct view from bronze layer: This is for bronze tables that are already well suited to being the source of truth for business. Usually, the most that would be done would be to filter down the list of columns, and/or do some renaming.
Consolidation: This refers to the merger of various bronze tables that have the same structure and purpose; these tables are seperate in the bronze layer simply due to the nature of the ingest but would really be better served as being a single table.
Enrichment: This refers to the addition and/or modification of a bronze table with information from other tables.

In many real-world pipelines, consolidation and enrichment go hand-in-hand; in certain architectural approaches, in fact, a consolidated table with minimal transformations can be considered a bronze layer table still (especially if the data is not ready to act as a source of truth for business analysts).

Let’s deep dive into the approaches that we landed on for our silver transformation.

a. Views from bronze: totals and followers

totals and followers are great examples of bronze tables that do not need further processing, and thus can be exposed as views on the silver layer. The DDL used to create both of them is below:

CREATE VIEW IF NOT EXISTS silver.linkedin.totals AS 
  SELECT date, impressions, engagements FROM bronze.linkedin.totals;

CREATE VIEW IF NOT EXISTS silver.linkedin.followers AS 
  SELECT date, new_followers FROM bronze.linkedin.followers;

CREATE VIEW IF NOT EXISTS silver.linkedin.totals AS 
  SELECT date, impressions, engagements FROM bronze.linkedin.totals;

CREATE VIEW IF NOT EXISTS silver.linkedin.followers AS 
  SELECT date, new_followers FROM bronze.linkedin.followers;

I decided not to include timestamps in these views, deeming them not relevant from the point of view of an analyst looking at the dataset.

b. Consolidation from bronze staging: impressions and engagements

The impressions and engagements silver tables are derived from a merger, i.e. consolidation of the staging impressions and engagements tables in the bronze layer, with the staging tables removed after the consolidation is done and dusted. An additional field is added to distinguish between the different staging tables, which in this case is the analytics date that is derived from the suffix of the staging table name.

For a discussion of staging tables, refer to the previous article in the series on ingesting data into the bronze layer.

In the real world, calculated fields such as day of week or change in metric could also be added to the silver table over time as BI analysts, data analysts and other end users of the table find themselves calculating these fields more and more often. Alternatively, such fields could be added instead in the gold or semantic layer, tuned to the particular business case. Once again, there is no hard and fast rule about where such metrices should go.

Transformations applied for **silver.linkedin.impressions** and **silver.linkedin.engagements** tables, consolidating the source staging tables.

There is one more enhancement to make. Based on our analysis of the data sources, we know that only the metrics for the 50 top posts for daily impressions or engagements are ingested individually. However, we also have the total daily impressions and engagement numbers. Based on that, we can calculate an “other” record that consolidates the metrices for all posts outside of the daily top 50.

Addition of ‘other’ record for **silver.linkedin.impressions** and **silver.linkedin.engagements** tables, based on posts outside of the top 50 for the specific analytics date.

So what does the code look like? Let’s see the example for ingesting the impressions silver table:

import pandas as pd
import re
import datetime
from pyspark.sql import Row
from pyspark.sql.functions import sum
from delta.tables import DeltaTable

# 1. define our constants
BRONZE_CATALOG = "bronze"
BRONZE_SCHEMA = "linkedin"
BRONZE_IMPRESSIONS_TABLE_PREFIX = "impressions"
BRONZE_TOTALS_TABLE = "totals"

SILVER_CATALOG = "silver"
SILVER_SCHEMA = "linkedin"
SILVER_IMPRESSIONS_TABLE = "impressions"

# 2. set our input and output variables

# extract list of daily bronze staging impressions tables
# based on regular expression matching
bronze_impressions_tables = [
  table.name 
  for table in spark.catalog.listTables(
    f"{BRONZE_CATALOG}.{BRONZE_SCHEMA}"
  )
  if re.match(
    rf"{BRONZE_IMPRESSIONS_TABLE_PREFIX}_\d{{4}}_\d{{2}}_\d{{2}}$", 
    table.name
  )
]

silver_impressions_table = \
  f"{SILVER_CATALOG}.{SILVER_SCHEMA}.{SILVER_IMPRESSIONS_TABLE}"

# 3. execute the ingestion

# Process daily bronze staging impressions tables 
# and merge into silver layer

print("Processing daily impressions tables...")

for bronze_impression_table in bronze_impressions_tables:
    # Extract date suffix from table name 
    # and convert to datetime in %Y_%m_%d format
    date_str = \
      bronze_impression_table.replace(
        f"{BRONZE_IMPRESSIONS_TABLE_PREFIX}_", ""
      )
    analytics_date = \
      pd.to_datetime(date_str, format='%Y_%m_%d').date()

    table_name = \
      f"{BRONZE_CATALOG}.{BRONZE_SCHEMA}.{bronze_impression_table}"
    
    print(f"Processing {table_name}...")

    silver_impressions_df = spark.sql(f"""
    SELECT 
        post_url,
        to_date(post_publish_date, 'M/d/yyyy') AS post_publish_date,
        impressions,
        to_date('{date_str}', 'yyyy_MM_dd') AS analytics_date 
    FROM {table_name}
    """)

    silver_impressions_totals_df = spark.sql(f"""
    SELECT 
        impressions,
        source_file,
        source_file_timestamp,
        ingestion_timestamp
    FROM {BRONZE_CATALOG}.{BRONZE_SCHEMA}.{BRONZE_TOTALS_TABLE}
    WHERE to_date('{date_str}', 'yyyy_MM_dd') = date
    """)

    impressions_others = \
      silver_impressions_totals_df.select(
        "impressions"
      ).collect()[0].impressions - \
      silver_impressions_df.agg(sum("impressions")).collect()[0][0]

    if impressions_others > 0:
        silver_impressions_df = silver_impressions_df.union(
            spark.createDataFrame([
                Row(
                    post_url = "others",
                    post_publish_date = analytics_date,
                    impressions = impressions_others,
                    analytics_date = analytics_date,
                )
            ])
        )

    if spark.catalog.tableExists(silver_impressions_table):
        print(f"Table {silver_impressions_table} exists, merging data...")
        delta_table = DeltaTable.forName(spark, silver_impressions_table)
        (
            delta_table.alias("t")
            .merge(
                silver_impressions_df.alias("s"),
                "t.post_url = s.post_url \
                 AND t.analytics_date = s.analytics_date"
            )
            .whenMatchedUpdateAll()
            .whenNotMatchedInsertAll()
            .execute()
        )
    else:
        print(
          f"Table {silver_impressions_table} does not exist, creating..."
        )
        silver_impressions_df.write.format("delta").saveAsTable(
          silver_impressions_table
        )
    
    # Drop the source table after processing
    spark.sql(f"DROP TABLE IF EXISTS {table_name}")
    print(f"Dropped table {table_name}")

import pandas as pd
import re
import datetime
from pyspark.sql import Row
from pyspark.sql.functions import sum
from delta.tables import DeltaTable

# 1. define our constants
BRONZE_CATALOG = "bronze"
BRONZE_SCHEMA = "linkedin"
BRONZE_IMPRESSIONS_TABLE_PREFIX = "impressions"
BRONZE_TOTALS_TABLE = "totals"

SILVER_CATALOG = "silver"
SILVER_SCHEMA = "linkedin"
SILVER_IMPRESSIONS_TABLE = "impressions"

# 2. set our input and output variables

# extract list of daily bronze staging impressions tables
# based on regular expression matching
bronze_impressions_tables = [
  table.name 
  for table in spark.catalog.listTables(
    f"{BRONZE_CATALOG}.{BRONZE_SCHEMA}"
  )
  if re.match(
    rf"{BRONZE_IMPRESSIONS_TABLE_PREFIX}_\d{{4}}_\d{{2}}_\d{{2}}$", 
    table.name
  )
]

silver_impressions_table = \
  f"{SILVER_CATALOG}.{SILVER_SCHEMA}.{SILVER_IMPRESSIONS_TABLE}"

# 3. execute the ingestion

# Process daily bronze staging impressions tables 
# and merge into silver layer

print("Processing daily impressions tables...")

for bronze_impression_table in bronze_impressions_tables:
    # Extract date suffix from table name 
    # and convert to datetime in %Y_%m_%d format
    date_str = \
      bronze_impression_table.replace(
        f"{BRONZE_IMPRESSIONS_TABLE_PREFIX}_", ""
      )
    analytics_date = \
      pd.to_datetime(date_str, format='%Y_%m_%d').date()

    table_name = \
      f"{BRONZE_CATALOG}.{BRONZE_SCHEMA}.{bronze_impression_table}"
    
    print(f"Processing {table_name}...")

    silver_impressions_df = spark.sql(f"""
    SELECT 
        post_url,
        to_date(post_publish_date, 'M/d/yyyy') AS post_publish_date,
        impressions,
        to_date('{date_str}', 'yyyy_MM_dd') AS analytics_date 
    FROM {table_name}
    """)

    silver_impressions_totals_df = spark.sql(f"""
    SELECT 
        impressions,
        source_file,
        source_file_timestamp,
        ingestion_timestamp
    FROM {BRONZE_CATALOG}.{BRONZE_SCHEMA}.{BRONZE_TOTALS_TABLE}
    WHERE to_date('{date_str}', 'yyyy_MM_dd') = date
    """)

    impressions_others = \
      silver_impressions_totals_df.select(
        "impressions"
      ).collect()[0].impressions - \
      silver_impressions_df.agg(sum("impressions")).collect()[0][0]

    if impressions_others > 0:
        silver_impressions_df = silver_impressions_df.union(
            spark.createDataFrame([
                Row(
                    post_url = "others",
                    post_publish_date = analytics_date,
                    impressions = impressions_others,
                    analytics_date = analytics_date,
                )
            ])
        )

    if spark.catalog.tableExists(silver_impressions_table):
        print(f"Table {silver_impressions_table} exists, merging data...")
        delta_table = DeltaTable.forName(spark, silver_impressions_table)
        (
            delta_table.alias("t")
            .merge(
                silver_impressions_df.alias("s"),
                "t.post_url = s.post_url \
                 AND t.analytics_date = s.analytics_date"
            )
            .whenMatchedUpdateAll()
            .whenNotMatchedInsertAll()
            .execute()
        )
    else:
        print(
          f"Table {silver_impressions_table} does not exist, creating..."
        )
        silver_impressions_df.write.format("delta").saveAsTable(
          silver_impressions_table
        )
    
    # Drop the source table after processing
    spark.sql(f"DROP TABLE IF EXISTS {table_name}")
    print(f"Dropped table {table_name}")

To break down what’s happening here:

We load the required libraries and define our constants, as per usual.
We extract a list of all staging impressions tables based on matching the naming pattern that we’ve already established for these tables.
For each staging table, we
1. extract the analytics date from the table suffix;
2. prepare the extract from the staging table;
3. pull the total daily impressions from the totals table and calculate the remainder;
4. add an “others” record to the extract if the remainder is more than 0, i.e. there are posts not covered by the existing ingestion;
5. create or update the silver impressions table with the extract; and
6. remove the staging table.

The code for ingesting engagements is very similar, I leave it as an exercise for the reader.

c. Enrichment from bronze: posts

The silver posts table needs to pull from three different bronze tables:

The posts table, which contains the bulk of the post metadata.
The post_details table, which contains the timestamp of the post.
The post_patch table, which contains post metadata that takes priority over whatever is in the posts table (that’s what patching is about after all).

Below is a diagram showing how this enrichment takes place:

Enrichment of **silver.linkedin.posts** table using the source tables in bronze (the join key is highlighted in purple). Note how the patching from **bronze.linkedin.post_patch** table is applied only when the initial **source bronze.linkedin.posts** table contains invalid or NULL entries for the relevant fields.

In this instance, I’ve decided to use SQL to do the ingesting rather than Python. Databricks recommends sticking to SQL where possible so as to take advantage of their own query and table writing optimizations. For relatively small datasets such as what we’re dealing with, it doesn’t really matter which approach we end up using from an efficency standpoint.

The combined query is as follows:

-- 1. Join patch to posts on post_url
-- Prioritize patch data over posts data
-- Deduplicate merged_posts to ensure one row per post_url
CREATE OR REPLACE TEMP VIEW merged_posts_dedup AS
SELECT
  post_url,
  post_publish_date,
  post_publish_timestamp,
  link,
  title,
  content
FROM (
  SELECT
    posts.post_url AS post_url,
    CAST(posts.post_publish_date AS DATE) AS post_publish_date,
    post_details.post_timestamp AS post_publish_timestamp,
    COALESCE(patch.true_url, posts.link) AS link,
    COALESCE(patch.title, posts.title) AS title,
    COALESCE(patch.content, posts.content, posts.title) AS content,
    ROW_NUMBER() OVER (
      PARTITION BY posts.post_url 
      ORDER BY 
        patch.true_url DESC, 
        patch.title DESC, 
        patch.content DESC
    ) AS rn
  FROM
    bronze.linkedin.posts AS posts
  LEFT JOIN
    bronze.linkedin.linkedin_patch AS patch
  ON
    posts.post_url = patch.post_url
  LEFT JOIN
    bronze.linkedin.post_details AS post_details
  ON
    posts.post_url = post_details.post_url
)
WHERE rn = 1;

-- 2. Create silver posts table if not exists
CREATE TABLE IF NOT EXISTS silver.linkedin.posts (
  post_url STRING,
  post_publish_date DATE,
  post_publish_timestamp TIMESTAMP,
  link STRING,
  title STRING,
  content STRING
) USING DELTA;

-- 3. Use deduplicated view for MERGE
MERGE INTO silver.linkedin.posts AS t
USING merged_posts_dedup AS s
ON t.post_url = s.post_url
WHEN MATCHED THEN
  UPDATE SET
    post_url = s.post_url,
    post_publish_date = s.post_publish_date,
    post_publish_timestamp = s.post_publish_timestamp,
    link = s.link,
    title = s.title,
    content = s.content
WHEN NOT MATCHED THEN
  INSERT (
    post_url, 
    post_publish_date, 
    post_publish_timestamp, 
    link, 
    title, 
    content
  )
  VALUES (
    s.post_url, 
    s.post_publish_date, 
    s.post_publish_timestamp, 
    s.link, 
    s.title, 
    s.content
  );

-- 1. Join patch to posts on post_url
-- Prioritize patch data over posts data
-- Deduplicate merged_posts to ensure one row per post_url
CREATE OR REPLACE TEMP VIEW merged_posts_dedup AS
SELECT
  post_url,
  post_publish_date,
  post_publish_timestamp,
  link,
  title,
  content
FROM (
  SELECT
    posts.post_url AS post_url,
    CAST(posts.post_publish_date AS DATE) AS post_publish_date,
    post_details.post_timestamp AS post_publish_timestamp,
    COALESCE(patch.true_url, posts.link) AS link,
    COALESCE(patch.title, posts.title) AS title,
    COALESCE(patch.content, posts.content, posts.title) AS content,
    ROW_NUMBER() OVER (
      PARTITION BY posts.post_url 
      ORDER BY 
        patch.true_url DESC, 
        patch.title DESC, 
        patch.content DESC
    ) AS rn
  FROM
    bronze.linkedin.posts AS posts
  LEFT JOIN
    bronze.linkedin.linkedin_patch AS patch
  ON
    posts.post_url = patch.post_url
  LEFT JOIN
    bronze.linkedin.post_details AS post_details
  ON
    posts.post_url = post_details.post_url
)
WHERE rn = 1;

-- 2. Create silver posts table if not exists
CREATE TABLE IF NOT EXISTS silver.linkedin.posts (
  post_url STRING,
  post_publish_date DATE,
  post_publish_timestamp TIMESTAMP,
  link STRING,
  title STRING,
  content STRING
) USING DELTA;

-- 3. Use deduplicated view for MERGE
MERGE INTO silver.linkedin.posts AS t
USING merged_posts_dedup AS s
ON t.post_url = s.post_url
WHEN MATCHED THEN
  UPDATE SET
    post_url = s.post_url,
    post_publish_date = s.post_publish_date,
    post_publish_timestamp = s.post_publish_timestamp,
    link = s.link,
    title = s.title,
    content = s.content
WHEN NOT MATCHED THEN
  INSERT (
    post_url, 
    post_publish_date, 
    post_publish_timestamp, 
    link, 
    title, 
    content
  )
  VALUES (
    s.post_url, 
    s.post_publish_date, 
    s.post_publish_timestamp, 
    s.link, 
    s.title, 
    s.content
  );

We’re doing the following:

Creating a temporary view (merged_posts) that combines our patch, post details and orginial posts data, using the COALESCE SQL function to determine what gets merged in which order. We’re also using the ROW_NUMBER window function to ensure that only unique post URLs are parsed.¹
Create the silver posts table as necessary.
Merge the temporary view into the silver posts table.

In effect, we’re overwriting our silver posts table every time we run the SQL query. For a relatively small amount of records, this is still fine; however, if there are thousands or millions of records, we will need to implement a form of incremental ingestion by adding and referencing the ingestion timestamps recorded in the bronze layer. We will return to this towards the end of the series.

III. What’s next?

In the next article, we will finally get to build our gold/sementic layer. See you there!

30 November 2025: Edited SQL to ensure uniqueness on merge. ↩︎

Build Your Own LinkedIn Analytics Part 5: Cleaning and Transforming Data

TL;DR

I. Recap

II. Silver layer: transformation approach

a. Views from bronze: totals and followers

b. Consolidation from bronze staging: impressions and engagements

c. Enrichment from bronze: posts

III. What’s next?

Related

Leave a Reply Cancel reply

Build Your Own LinkedIn Analytics Part 5: Cleaning and Transforming Data

TL;DR

I. Recap

II. Silver layer: transformation approach

a. Views from bronze: totals and followers

b. Consolidation from bronze staging: impressions and engagements

c. Enrichment from bronze: posts

III. What’s next?

Pass it on:

Related

Leave a Reply Cancel reply