Build Your Own LinkedIn Analytics Part 5: Cleaning and Transforming Data
Published on Medium on 3 November 2025
In the previous article, we ingested our LinkedIn analytics data into the bronze layer. In this article, we will clean and transform our data into a Single Source of Truth (SSOT) in the silver layer.
TL;DR
- We build up our Single Souce of Truth (SSOT) in the silver layer by creating views as well consolidating and enriching from the bronze layer
I. Recap
Before we proceed further, let’s recap where we are in the building out of our data product. Recall the data architecture that we set up:

We have established the landing and the bronze layer, and ingested our test data into those layers. This means that we have queryable tables with data that is close to its raw form. Our next step is to take that data and prepare it to serve as our Single Source of Truth (SSOT) in the silver layer.

II. Silver layer: transformation approach
So how shall we approach the silver layer? The proposed data flow below illustrates three approaches.

- Direct view from bronze layer: This is for bronze tables that are already well suited to being the source of truth for business. Usually, the most that would be done would be to filter down the list of columns, and/or do some renaming.
- Consolidation: This refers to the merger of various bronze tables that have the same structure and purpose; these tables are seperate in the bronze layer simply due to the nature of the ingest but would really be better served as being a single table.
- Enrichment: This refers to the addition and/or modification of a bronze table with information from other tables.
In many real-world pipelines, consolidation and enrichment go hand-in-hand; in certain architectural approaches, in fact, a consolidated table with minimal transformations can be considered a bronze layer table still (especially if the data is not ready to act as a source of truth for business analysts).
Let’s deep dive into the approaches that we landed on for our silver transformation.
a. Views from bronze: totals and followers
totals and followers are great examples of bronze tables that do not need further processing, and thus can be exposed as views on the silver layer. The DDL used to create both of them is below:
CREATE VIEW IF NOT EXISTS silver.linkedin.totals AS
SELECT date, impressions, engagements FROM bronze.linkedin.totals;
CREATE VIEW IF NOT EXISTS silver.linkedin.followers AS
SELECT date, new_followers FROM bronze.linkedin.followers;I decided not to include timestamps in these views, deeming them not relevant from the point of view of an analyst looking at the dataset.
b. Consolidation from bronze staging: impressions and engagements
The impressions and engagements silver tables are derived from a merger, i.e. consolidation of the staging impressions and engagements tables in the bronze layer, with the staging tables removed after the consolidation is done and dusted. An additional field is added to distinguish between the different staging tables, which in this case is the analytics date that is derived from the suffix of the staging table name.
For a discussion of staging tables, refer to the previous article in the series on ingesting data into the bronze layer.
In the real world, calculated fields such as day of week or change in metric could also be added to the silver table over time as BI analysts, data analysts and other end users of the table find themselves calculating these fields more and more often. Alternatively, such fields could be added instead in the gold or semantic layer, tuned to the particular business case. Once again, there is no hard and fast rule about where such metrices should go.

There is one more enhancement to make. Based on our analysis of the data sources, we know that only the metrics for the 50 top posts for daily impressions or engagements are ingested individually. However, we also have the total daily impressions and engagement numbers. Based on that, we can calculate an “other” record that consolidates the metrices for all posts outside of the daily top 50.

So what does the code look like? Let’s see the example for ingesting the impressions silver table:
import pandas as pd
import re
import datetime
from pyspark.sql import Row
from pyspark.sql.functions import sum
from delta.tables import DeltaTable
# 1. define our constants
BRONZE_CATALOG = "bronze"
BRONZE_SCHEMA = "linkedin"
BRONZE_IMPRESSIONS_TABLE_PREFIX = "impressions"
BRONZE_TOTALS_TABLE = "totals"
SILVER_CATALOG = "silver"
SILVER_SCHEMA = "linkedin"
SILVER_IMPRESSIONS_TABLE = "impressions"
# 2. set our input and output variables
# extract list of daily bronze staging impressions tables
# based on regular expression matching
bronze_impressions_tables = [
table.name
for table in spark.catalog.listTables(
f"{BRONZE_CATALOG}.{BRONZE_SCHEMA}"
)
if re.match(
rf"{BRONZE_IMPRESSIONS_TABLE_PREFIX}_\d{{4}}_\d{{2}}_\d{{2}}$",
table.name
)
]
silver_impressions_table = \
f"{SILVER_CATALOG}.{SILVER_SCHEMA}.{SILVER_IMPRESSIONS_TABLE}"
# 3. execute the ingestion
# Process daily bronze staging impressions tables
# and merge into silver layer
print("Processing daily impressions tables...")
for bronze_impression_table in bronze_impressions_tables:
# Extract date suffix from table name
# and convert to datetime in %Y_%m_%d format
date_str = \
bronze_impression_table.replace(
f"{BRONZE_IMPRESSIONS_TABLE_PREFIX}_", ""
)
analytics_date = \
pd.to_datetime(date_str, format='%Y_%m_%d').date()
table_name = \
f"{BRONZE_CATALOG}.{BRONZE_SCHEMA}.{bronze_impression_table}"
print(f"Processing {table_name}...")
silver_impressions_df = spark.sql(f"""
SELECT
post_url,
to_date(post_publish_date, 'M/d/yyyy') AS post_publish_date,
impressions,
to_date('{date_str}', 'yyyy_MM_dd') AS analytics_date
FROM {table_name}
""")
silver_impressions_totals_df = spark.sql(f"""
SELECT
impressions,
source_file,
source_file_timestamp,
ingestion_timestamp
FROM {BRONZE_CATALOG}.{BRONZE_SCHEMA}.{BRONZE_TOTALS_TABLE}
WHERE to_date('{date_str}', 'yyyy_MM_dd') = date
""")
impressions_others = \
silver_impressions_totals_df.select(
"impressions"
).collect()[0].impressions - \
silver_impressions_df.agg(sum("impressions")).collect()[0][0]
if impressions_others > 0:
silver_impressions_df = silver_impressions_df.union(
spark.createDataFrame([
Row(
post_url = "others",
post_publish_date = analytics_date,
impressions = impressions_others,
analytics_date = analytics_date,
)
])
)
if spark.catalog.tableExists(silver_impressions_table):
print(f"Table {silver_impressions_table} exists, merging data...")
delta_table = DeltaTable.forName(spark, silver_impressions_table)
(
delta_table.alias("t")
.merge(
silver_impressions_df.alias("s"),
"t.post_url = s.post_url \
AND t.analytics_date = s.analytics_date"
)
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()
)
else:
print(
f"Table {silver_impressions_table} does not exist, creating..."
)
silver_impressions_df.write.format("delta").saveAsTable(
silver_impressions_table
)
# Drop the source table after processing
spark.sql(f"DROP TABLE IF EXISTS {table_name}")
print(f"Dropped table {table_name}")To break down what’s happening here:
- We load the required libraries and define our constants, as per usual.
- We extract a list of all staging impressions tables based on matching the naming pattern that we’ve already established for these tables.
- For each staging table, we
- extract the analytics date from the table suffix;
- prepare the extract from the staging table;
- pull the total daily impressions from the totals table and calculate the remainder;
- add an “others” record to the extract if the remainder is more than 0, i.e. there are posts not covered by the existing ingestion;
- create or update the silver impressions table with the extract; and
- remove the staging table.
The code for ingesting engagements is very similar, I leave it as an exercise for the reader.
c. Enrichment from bronze: posts
The silver posts table needs to pull from three different bronze tables:
- The posts table, which contains the bulk of the post metadata.
- The post_details table, which contains the timestamp of the post.
- The post_patch table, which contains post metadata that takes priority over whatever is in the posts table (that’s what patching is about after all).
Below is a diagram showing how this enrichment takes place:

In this instance, I’ve decided to use SQL to do the ingesting rather than Python. Databricks recommends sticking to SQL where possible so as to take advantage of their own query and table writing optimizations. For relatively small datasets such as what we’re dealing with, it doesn’t really matter which approach we end up using from an efficency standpoint.
The combined query is as follows:
-- 1. Join patch to posts on post_url
-- Prioritize patch data over posts data
-- Deduplicate merged_posts to ensure one row per post_url
CREATE OR REPLACE TEMP VIEW merged_posts_dedup AS
SELECT
post_url,
post_publish_date,
post_publish_timestamp,
link,
title,
content
FROM (
SELECT
posts.post_url AS post_url,
CAST(posts.post_publish_date AS DATE) AS post_publish_date,
post_details.post_timestamp AS post_publish_timestamp,
COALESCE(patch.true_url, posts.link) AS link,
COALESCE(patch.title, posts.title) AS title,
COALESCE(patch.content, posts.content, posts.title) AS content,
ROW_NUMBER() OVER (
PARTITION BY posts.post_url
ORDER BY
patch.true_url DESC,
patch.title DESC,
patch.content DESC
) AS rn
FROM
bronze.linkedin.posts AS posts
LEFT JOIN
bronze.linkedin.linkedin_patch AS patch
ON
posts.post_url = patch.post_url
LEFT JOIN
bronze.linkedin.post_details AS post_details
ON
posts.post_url = post_details.post_url
)
WHERE rn = 1;
-- 2. Create silver posts table if not exists
CREATE TABLE IF NOT EXISTS silver.linkedin.posts (
post_url STRING,
post_publish_date DATE,
post_publish_timestamp TIMESTAMP,
link STRING,
title STRING,
content STRING
) USING DELTA;
-- 3. Use deduplicated view for MERGE
MERGE INTO silver.linkedin.posts AS t
USING merged_posts_dedup AS s
ON t.post_url = s.post_url
WHEN MATCHED THEN
UPDATE SET
post_url = s.post_url,
post_publish_date = s.post_publish_date,
post_publish_timestamp = s.post_publish_timestamp,
link = s.link,
title = s.title,
content = s.content
WHEN NOT MATCHED THEN
INSERT (
post_url,
post_publish_date,
post_publish_timestamp,
link,
title,
content
)
VALUES (
s.post_url,
s.post_publish_date,
s.post_publish_timestamp,
s.link,
s.title,
s.content
);We’re doing the following:
- Creating a temporary view (merged_posts) that combines our patch, post details and orginial posts data, using the COALESCE SQL function to determine what gets merged in which order. We’re also using the ROW_NUMBER window function to ensure that only unique post URLs are parsed.1
- Create the silver posts table as necessary.
- Merge the temporary view into the silver posts table.
In effect, we’re overwriting our silver posts table every time we run the SQL query. For a relatively small amount of records, this is still fine; however, if there are thousands or millions of records, we will need to implement a form of incremental ingestion by adding and referencing the ingestion timestamps recorded in the bronze layer. We will return to this towards the end of the series.
III. What’s next?
In the next article, we will finally get to build our gold/sementic layer. See you there!
- 30 November 2025: Edited SQL to ensure uniqueness on merge. ↩︎
Leave a Reply