Build Your Own LinkedIn Analytics Part 6: Modelling the Data

Published on Medium on 10 November 2025

In the previous article, we established our Single Source of Truth (SSOT) for our LinkedIn analytics data in the silver layer on our Databricks Free lakehouse. In this article, we will proceed to model our data to prepare it for its next destination: the dashboard.

This is part of the following blog series: Build Your Own LinkedIn Analytics

Table Of Contents

TL;DR
I. Approaches to Modelling Data
II. Our Gold Table Schemas and Transformations
III. What’s Next?

TL;DR

We explore the different approaches to modelling business data.
Based on those approaches, we establish our gold table schemas and transformations.

I. Approaches to Modelling Data

Before we tackle modelling our LinkedIn analytics data, let’s talk about how data is modelled in general.

The most important thing to note is that data is modelled to serve the analytical needs of the end user. That means that the modelled data needs to be:

Relevant, i.e. only needed data included.
Easy to work with, for easier self-service analytics.
Adaptable for various perspectives, enabling agility and future-proofing.

Now let’s look at the possible approaches.

a. Split it Up: Fact and Dimension Tables

We see fact and dimension tables at play when dealing with star and snowflake schemas (examples are as below).

Illustration of a star schema with a single fact table and five related dimension tables.

Illustration of a snowflake schema with a single fact table, five normalized dimension tables, and various other sub-dimension tables associated with each main dimension table.

Here are definitions for these tables:

Fact tables are used to store quantitative metrics such as impressions and engagements, along with foreign keys associated with dimension tables.
Dimension tables are used to store metadata or descriptive elements associated with their parent tables (usually fact tables, but can be other dimensions tables in a snowflake schema).

If you’re familiar with normalised schemas in a traditional database, this pattern will look familiar to you; and indeed, this approach is often used to make joining data for analysis easier and to avoid duplication of data.

The snowflake schema is out of fashion due to its complexity and computational intensity (associated with the increased number of joins); it was more relevant in an era when storage space was at a premium. That said, such a schema may become appropriate for complex data domains where the risks associated with duplication are sufficiently high.

For more information on star and snowflake schemas, see the introduction by DataCamp.

b. Everything At Once: One Big Table

Well, not quite this kind of big table. Photo by Mitchell Luo on Unsplash.

With storage becoming ever cheaper and more abundant, it has become common to construct a single table with all the metrics and dimensions needed, regardless of the amount of duplication. This One Big Table approach aims to greatly reduce complexity and query time by eliminating the need for joins; the end-user only needs to filter by the specific columns they need.

This approach can become unwieldy quite rapidly if the number of metrics and dimensions climb into the hundreds, and the presense of duplicates can complicate the updating of the table. That’s where the next approach comes into play.

c. Hybrid Approach

Best of both worlds? Photo by THLT LCX on Unsplash.

Just like a hybrid vehicle with both electric and gasoline components, the hybrid approach seeks to combine the elements of both approaches into a whole that is more than the sum of its parts. The exact approach will depend, as always, on the profile of the data in general. This is the approach that we shall be taking.

II. Our Gold Table Schemas and Transformations

We are going with the following approach:

One big table for daily post statistics that takes in the data from totals, impressions, engagments and posts from the silver layer.
One big table for daily profile statistics that takes in the date from totals and followers from the silver layer.
Dimension tables for dates and timestamps derived from the dates in all the silver tables as well as timestamps from the posts silver table.

Overall data flow from the silver to gold layer.

The relationships between the fact and dimension tables in the gold layer.

This hybrid approach strikes a balance between ease of querying and flexibility. Let’s delve into the details of these tables and their transformations to see how that balance is achieved.

a. One Big Fact Table: fct_daily_post_statistics

The aim of the daily post statistics table is to consolidate all the data about each of the recorded posts into a single record per post. This includes the following:

Metadata such as a post ID, post publish date and timestamp, and post content.
Base metrics, namely impressions and engagements.
Derived metrics or measures, e.g. days since post and changes in impressions and engagements. These could also be calculated and recorded in the silver layer table as alluded to in the previous post when discussing enrichment.

Schema for **gold.linkedin.fct_daily_statistics**

We’re using a materialized view rather than a traditional table for creating this gold-level table. This acts like a view that is persisted on disk until it is refreshed; see the Databricks documentation for a detailed explanation. Using this approach means that we have outsourced decisions about query optimization and clustering to the Databricks query engine.

This is as good a time as any to address why partitioning hasn’t yet been mentioned as an approach to optimize the transformation process. By default, Databricks uses liquid clustering to automatically seek an optimal way to cluster Delta Lake managed tables. Unless we have very specific partitioning needs for our tables and we are dealing with massive amounts of data, this is good enough for our purposes, but we will come back to this towards the end of the series.

With that out of the way, here is the DDL for the materialized view:

CREATE MATERIALIZED VIEW gold.linkedin.fct_daily_post_statistics AS
-- 1: select daily stats for individual posts
SELECT
  concat(
    i.post_publish_date, 
    ": ", 
    regexp_extract(p.link, r'.*_(.*)-activity.*')
  ) AS post_id,
  i.post_publish_date,
  p.post_publish_timestamp,
  impressions,
  coalesce(engagements, 0) as engagements,
  i.analytics_date,
  DATEDIFF(i.analytics_date, i.post_publish_date) AS days_since_post,
  coalesce(impressions - LAG(impressions, 1) OVER (
    PARTITION BY i.post_url ORDER BY i.analytics_date
  )) AS change_in_impressions,
  coalesce(engagements - LAG(engagements, 1) OVER (
    PARTITION BY i.post_url ORDER BY i.analytics_date
  )) AS change_in_engagements,
  p.title,
  p.content
FROM
  silver.linkedin.impressions i
  LEFT JOIN silver.linkedin.engagements e
    ON i.post_url = e.post_url
    AND i.analytics_date = e.analytics_date
  LEFT JOIN silver.linkedin.posts p
    ON i.post_url = p.post_url
WHERE i.post_url != "others"

UNION ALL

-- 2: select daily stats for posts outside of top 50 for the day
SELECT
  "others" AS post_id,
  NULL AS post_publish_date,
  NULL AS post_publish_timestamp,
  impressions,
  coalesce(engagements, 0) AS engagements,
  i.analytics_date,
  NULL AS days_since_post,
  NULL AS change_in_impressions,
  NULL AS change_in_engagements,
  NULL AS title,
  NULL AS content  
FROM 
  (
    SELECT
      impressions,
      analytics_date
    FROM
      silver.linkedin.impressions
    WHERE post_url = "others"
  ) i 
  LEFT JOIN (
    SELECT
      engagements,
      analytics_date
    FROM
      silver.linkedin.engagements
    WHERE post_url = "others"
  ) e
  ON i.analytics_date = e.analytics_date
  
UNION ALL

-- 3: select total stats
SELECT
  "total" AS post_id,
  NULL AS post_publish_date,
  NULL AS post_publish_timestamp,
  t.impressions,
  t.engagements,
  t.date AS analytics_date,
  NULL AS days_since_post,
  coalesce(
    t.impressions - prev_t.impressions
  ) AS change_in_impressions,
  coalesce(
    t.engagements - prev_t.engagements
  ) AS change_in_engagements,
  NULL AS title,
  NULL AS content
FROM
  silver.linkedin.totals t
  LEFT JOIN silver.linkedin.totals prev_t
    ON prev_t.date = DATE_SUB(t.date, 1)

CREATE MATERIALIZED VIEW gold.linkedin.fct_daily_post_statistics AS
-- 1: select daily stats for individual posts
SELECT
  concat(
    i.post_publish_date, 
    ": ", 
    regexp_extract(p.link, r'.*_(.*)-activity.*')
  ) AS post_id,
  i.post_publish_date,
  p.post_publish_timestamp,
  impressions,
  coalesce(engagements, 0) as engagements,
  i.analytics_date,
  DATEDIFF(i.analytics_date, i.post_publish_date) AS days_since_post,
  coalesce(impressions - LAG(impressions, 1) OVER (
    PARTITION BY i.post_url ORDER BY i.analytics_date
  )) AS change_in_impressions,
  coalesce(engagements - LAG(engagements, 1) OVER (
    PARTITION BY i.post_url ORDER BY i.analytics_date
  )) AS change_in_engagements,
  p.title,
  p.content
FROM
  silver.linkedin.impressions i
  LEFT JOIN silver.linkedin.engagements e
    ON i.post_url = e.post_url
    AND i.analytics_date = e.analytics_date
  LEFT JOIN silver.linkedin.posts p
    ON i.post_url = p.post_url
WHERE i.post_url != "others"

UNION ALL

-- 2: select daily stats for posts outside of top 50 for the day
SELECT
  "others" AS post_id,
  NULL AS post_publish_date,
  NULL AS post_publish_timestamp,
  impressions,
  coalesce(engagements, 0) AS engagements,
  i.analytics_date,
  NULL AS days_since_post,
  NULL AS change_in_impressions,
  NULL AS change_in_engagements,
  NULL AS title,
  NULL AS content  
FROM 
  (
    SELECT
      impressions,
      analytics_date
    FROM
      silver.linkedin.impressions
    WHERE post_url = "others"
  ) i 
  LEFT JOIN (
    SELECT
      engagements,
      analytics_date
    FROM
      silver.linkedin.engagements
    WHERE post_url = "others"
  ) e
  ON i.analytics_date = e.analytics_date
  
UNION ALL

-- 3: select total stats
SELECT
  "total" AS post_id,
  NULL AS post_publish_date,
  NULL AS post_publish_timestamp,
  t.impressions,
  t.engagements,
  t.date AS analytics_date,
  NULL AS days_since_post,
  coalesce(
    t.impressions - prev_t.impressions
  ) AS change_in_impressions,
  coalesce(
    t.engagements - prev_t.engagements
  ) AS change_in_engagements,
  NULL AS title,
  NULL AS content
FROM
  silver.linkedin.totals t
  LEFT JOIN silver.linkedin.totals prev_t
    ON prev_t.date = DATE_SUB(t.date, 1)

This query is quite complex, so here’s a visual breakdown:

Visualization of the query used to create **gold.linkedin.fct_daily_post_statistics**

We have three separate queries in play here:

Daily records by post is the most complex query that pulls from the impressions, engagements and posts tables in the silver layer. We will look at the details of its transformations in a bit.
Daily records for “others” pulls only from the impressions and engagements tables since there are no associated post records; we also skip recording any changes in impressions and engagements here because the “others” record does not occur enough to be meaningful, i.e. we rarely get more than 50 posts with impressions or engagements for any given day.
Daily records for totals pulls from the silver totals table; there are no associated post records and the totals for impressions and engagements are already in the totals table. The only complication is the use of a left join to retrieve impressions and engagements from the previous day.

These three queries are then combined with a UNION ALL, and that is then our daily post statistics table.

Now let’s deep dive into the transformations for daily records by post. I’ve repeated the SQL code below for reference:

SELECT
  concat(
    i.post_publish_date, 
    ": ", 
    regexp_extract(p.link, r'.*_(.*)-activity.*')
  ) AS post_id,
  i.post_publish_date,
  p.post_publish_timestamp,
  impressions,
  coalesce(engagements, 0) as engagements,
  i.analytics_date,
  DATEDIFF(i.analytics_date, i.post_publish_date) AS days_since_post,
  coalesce(impressions - LAG(impressions, 1) OVER (
    PARTITION BY i.post_url ORDER BY i.analytics_date
  )) AS change_in_impressions,
  coalesce(engagements - LAG(engagements, 1) OVER (
    PARTITION BY i.post_url ORDER BY i.analytics_date
  )) AS change_in_engagements,
  p.title,
  p.content
FROM
  silver.linkedin.impressions i
  LEFT JOIN silver.linkedin.engagements e
    ON i.post_url = e.post_url
    AND i.analytics_date = e.analytics_date
  LEFT JOIN silver.linkedin.posts p
    ON i.post_url = p.post_url
WHERE i.post_url != "others"

SELECT
  concat(
    i.post_publish_date, 
    ": ", 
    regexp_extract(p.link, r'.*_(.*)-activity.*')
  ) AS post_id,
  i.post_publish_date,
  p.post_publish_timestamp,
  impressions,
  coalesce(engagements, 0) as engagements,
  i.analytics_date,
  DATEDIFF(i.analytics_date, i.post_publish_date) AS days_since_post,
  coalesce(impressions - LAG(impressions, 1) OVER (
    PARTITION BY i.post_url ORDER BY i.analytics_date
  )) AS change_in_impressions,
  coalesce(engagements - LAG(engagements, 1) OVER (
    PARTITION BY i.post_url ORDER BY i.analytics_date
  )) AS change_in_engagements,
  p.title,
  p.content
FROM
  silver.linkedin.impressions i
  LEFT JOIN silver.linkedin.engagements e
    ON i.post_url = e.post_url
    AND i.analytics_date = e.analytics_date
  LEFT JOIN silver.linkedin.posts p
    ON i.post_url = p.post_url
WHERE i.post_url != "others"

We join impressions with engagements using post_url and analytics_date as the join keys, as well as with posts using post_url as the join key. We use impressions as our left-most (aka reference) table because a) any post with engagements will also have impressions but not the other way round; and b) we are not interested in posts with no impressions.
We create post_id by combining the post publish date with a shortened title for the post derived from the post link.
We pull post_publish_date and analytics_date from the impressions table, and post metadata such as post_publish_timestamp, title and content from the posts table.
Impressions and engagement metrics are pulled from their respective tables, with the engagments metrics wrapped with the COALESCE function to handle any missing engagement values (since there are many dates where there are impressions but no engagements).
days_since_post is essential for analysing the performance over time of the post in question.
We use windowing functions to extract changes in impressions and engagements.

b. Another (not so big) Fact Table: fct_daily_profile_statistics

The aim of the daily profile statistics table is to consolidate all the profile-specific data into a single record per day. In this case, it is simply total impressions and engagements from the totals silver table, as well as new followers from the followers silver table. We also calculate the change in impressions and engagements.

Schema for **gold.linkedin.fct_daily_profile_statistics**.

Wait, didn’t we already include the daily total impressions and engagements in the daily posts statistics table?
We did, but if we are querying the daily profile statistics table, we would want to have aggregate data about impressions and engagements in this table. This kind of duplication is not uncommon in the gold layer.
What happened to total number of followers?
We do not have a reliable way of retrieving the total number of followers for a given date (yet another limitation of the Excel exports), so we are sticking to only the new follower metric.

Often our analysis is limited by our data; this and other limitations need to be clearly communicated to business stakeholders early on as part of any requirements gathering.

The DDL for the materialized view is as below:

CREATE MATERIALIZED VIEW gold.linkedin.fct_daily_profile_statistics AS
SELECT
  t.date AS analytics_date,
  f.new_followers,
  t.impressions,
  t.engagements,
  coalesce(
    t.impressions - prev_t.impressions
  ) AS change_in_impressions,
  coalesce(
    t.engagements - prev_t.engagements
  ) AS change_in_engagements
FROM
  silver.linkedin.totals t
  LEFT JOIN silver.linkedin.totals prev_t
    ON prev_t.date = DATE_SUB(t.date, 1)
  LEFT JOIN silver.linkedin.followers f
    ON f.date = t.date
ORDER BY t.date ASC

CREATE MATERIALIZED VIEW gold.linkedin.fct_daily_profile_statistics AS
SELECT
  t.date AS analytics_date,
  f.new_followers,
  t.impressions,
  t.engagements,
  coalesce(
    t.impressions - prev_t.impressions
  ) AS change_in_impressions,
  coalesce(
    t.engagements - prev_t.engagements
  ) AS change_in_engagements
FROM
  silver.linkedin.totals t
  LEFT JOIN silver.linkedin.totals prev_t
    ON prev_t.date = DATE_SUB(t.date, 1)
  LEFT JOIN silver.linkedin.followers f
    ON f.date = t.date
ORDER BY t.date ASC

c. The Time Dimension: dim_date, dim_timestamp

Now we come to our two dimension tables, both related to time in different granularities. Time dimensions are the most commonly defined dimensions for good reason; practically any kind of historical analytics and forecasting requires a time dimension.

Let’s first take a look at the date dimension table, dim_date:

day of month is good for seeing if there is any trend related to the beginning, middle or end of the month.
Similarily, month is good for seeing if there are any monthly trends. This will become more useful once we have more than a year’s data; for now this is mainly for downstream filtering or grouping. week does something similar but in a more granular weekly fashion.
year will let us perform analytics based on the year, again useful once we have built up a few years’ worth of data.
weekday and weekday_str lets us analyse trends based on the day of the week; this is the most useful date dimension for us right now.
is_weekend lets us analyse the difference between weekdays and weekends. We can define similar fields for other categorical features, e.g. is_sg_public_holiday.

The DDL for this materialized view is as below.

CREATE MATERIALIZED VIEW gold.linkedin.dim_date AS
SELECT
  date,
  DAY(date) AS day,
  MONTH(date) AS month,
  YEAR(date) AS year,
  WEEKOFYEAR(date) AS week,
  WEEKDAY(date) AS weekday,
  DATE_FORMAT(date, 'E') AS weekday_str,
  WEEKDAY(date) >= 5 AS is_weekend
FROM (
  SELECT date FROM silver.linkedin.totals
  UNION 
  SELECT post_publish_date AS date FROM silver.linkedin.posts
)
ORDER BY date ASC

CREATE MATERIALIZED VIEW gold.linkedin.dim_date AS
SELECT
  date,
  DAY(date) AS day,
  MONTH(date) AS month,
  YEAR(date) AS year,
  WEEKOFYEAR(date) AS week,
  WEEKDAY(date) AS weekday,
  DATE_FORMAT(date, 'E') AS weekday_str,
  WEEKDAY(date) >= 5 AS is_weekend
FROM (
  SELECT date FROM silver.linkedin.totals
  UNION 
  SELECT post_publish_date AS date FROM silver.linkedin.posts
)
ORDER BY date ASC

Now let’s examine the schema for the timestamp dimension table, dim_timestamp:

Schema for **gold.linkedin.dim_timestamp**

Besides the original timestamp, we add parts from the timestamp that include date, hour and minute. date can be used to join with dim_date and obtain date-specific enrichment, while hour and minute can be used to analyse trends over the course of a day or hour. second is not a component that can be extracted from our current data source, and anyway we do not need that level of granularity (who’s analysing posting by the second?).
We also distinguish between UTC and SGT (aka Singapore Time) timezones for each of our timestamp dimensions. As someone based in Singapore with a mainly Asia Pacific audience, the Singapore-related fields are far more relevant to me; the UTC fields are preserved to provide a point of reference.

The DDL for dim_timestamp is as below:

CREATE MATERIALIZED VIEW gold.linkedin.dim_timestamp AS
SELECT
  timestamp,
  sg_timestamp,
  DATE(timestamp) AS utc_date,
  HOUR(timestamp) AS utc_hour,
  MINUTE(timestamp) AS utc_minute,
  DATE(sg_timestamp) AS sgt_date,
  HOUR(sg_timestamp) AS sgt_hour,
  MINUTE(sg_timestamp) AS sgt_minute
FROM (
  SELECT DISTINCT 
    post_publish_timestamp AS timestamp,
    from_utc_timestamp(
      post_publish_timestamp, 'Asia/Singapore'
    ) AS sg_timestamp
  FROM silver.linkedin.posts 
  WHERE post_publish_timestamp IS NOT NULL
)
ORDER BY timestamp ASC

CREATE MATERIALIZED VIEW gold.linkedin.dim_timestamp AS
SELECT
  timestamp,
  sg_timestamp,
  DATE(timestamp) AS utc_date,
  HOUR(timestamp) AS utc_hour,
  MINUTE(timestamp) AS utc_minute,
  DATE(sg_timestamp) AS sgt_date,
  HOUR(sg_timestamp) AS sgt_hour,
  MINUTE(sg_timestamp) AS sgt_minute
FROM (
  SELECT DISTINCT 
    post_publish_timestamp AS timestamp,
    from_utc_timestamp(
      post_publish_timestamp, 'Asia/Singapore'
    ) AS sg_timestamp
  FROM silver.linkedin.posts 
  WHERE post_publish_timestamp IS NOT NULL
)
ORDER BY timestamp ASC

d. Other Possibilities

These are not the only possibilities for defining our gold-layer tables and the fields within them, not by a long shot. Below are some examples for future exploration:

Annotation fields for the time dimension tables, which can contain information specific to the respective date or timestamp. The source of this information would ultimately be an annotations table that is manually filled and ingested into the bronze layer, similar to the post_patch table. This can be useful for filtering out outliers; an example is my post on the 2025 AWS outage that was picked up by LinkedIn News and hence has far more impressions and engagements than all my other posts as of the time of writing.
Additional measures in daily profile statistics table, in particular related to either the number of posts in the specific day / hour or the number of days / hours since the last post.
Categorical fields to classify the types of posts; possible types of categories include post type (e.g. “article”, “post”) and content type (e.g. “certification”, “news”, “update”). These would need to be either derived or hard-coded.

III. What’s Next?

In the next article, we will build a dashboard to finally examine our LinkedIn data. See you there!

Build Your Own LinkedIn Analytics Part 6: Modelling the Data

TL;DR

I. Approaches to Modelling Data

a. Split it Up: Fact and Dimension Tables

b. Everything At Once: One Big Table

c. Hybrid Approach

II. Our Gold Table Schemas and Transformations

a. One Big Fact Table: fct_daily_post_statistics

b. Another (not so big) Fact Table: fct_daily_profile_statistics

c. The Time Dimension: dim_date, dim_timestamp

d. Other Possibilities

III. What’s Next?

Related

Leave a Reply Cancel reply

Build Your Own LinkedIn Analytics Part 6: Modelling the Data

TL;DR

I. Approaches to Modelling Data

a. Split it Up: Fact and Dimension Tables

b. Everything At Once: One Big Table

c. Hybrid Approach

II. Our Gold Table Schemas and Transformations

a. One Big Fact Table: fct_daily_post_statistics

b. Another (not so big) Fact Table: fct_daily_profile_statistics

c. The Time Dimension: dim_date, dim_timestamp

d. Other Possibilities

III. What’s Next?

Pass it on:

Related

Leave a Reply Cancel reply