Build Your Own LinkedIn Analytics Part 3: Creating the Data Architecture

Build Your Own LinkedIn Analytics Part 3: Creating the Data Architecture

Published on Medium on 21 October 2025

In the previous article, I introduced my rationale behind starting this end-to-end data project for my LinkedIn analytics. Today we’ll dive into the specifics, starting with what LinkedIn data to ingest and where to ingest it from.

This is part of the following blog series:

TL;DR

  • Set up Databricks Free Edition account
  • Design a medallion architecture, then implement it in the Databricks context

I. Setting up Databricks Free Edition account

Let’s start by creating your Databricks Free Edition account. The steps are as follows:

  1. Go to https://www.databricks.com/ and click ‘Try Databricks’.
  1. Use your favoured way of signing up, whether it’s via Google account, Microsoft account or email.
  1. Choose ‘Get Free Edition’ when prompted.

Note: There are certain limits with the Free Edition compared to the full-fledged Databricks. I’ll note them where they’re relevant during the rest of this series, but for the curious, see https://docs.databricks.com/aws/en/getting-started/free-edition-limitations.

  1. Name your account and choose your location. As an aside, I don’t believe location matters for the Free Edition, since the workspace is always created in a US availability zone.
  1. After a few moment, you’re greeted with a spanking new Databricks data platform. Now we can get started.

II. Data Architecture

Before proceeding further, let’s take a moment to review our high-level data architecture.

High-level data architecture

This is a version of what is known as “medallion architecture”, so-called because the 3 primary layers are labelled like podium finishers in the Olympics. The specifics can vary between setups, but at a high level the medallion architecture consists of the following:

  1. Bronze layer: Also variously known as the staging, ingestion or extraction layer. This is where data from outside the data warehouse is supposed to land, with minimal or no cleaning and transformations. The idea is that this layer is a source of truth for external data, and any potential reingestion of raw data or any audits can start from this layer.
  2. Silver layer: Also known as the transformation layer, because this is where much of the transformation work takes place. Here, data from the bronze layer is cleaned, relabelled and enriched (especially with business metadata), with the end goal being a Single Source of Truth (SSOT) that can be used for subsequent business use cases — hence why this is also known as the curated layer. This is usually the layer that data analysts use to explore their use cases in larger entreprises.
  3. Gold layer: Also variously known as the semantic or modelling layer, this is where data from the silver layer is further transformed to serve specific business use cases, such as semantic models for dashboards. In a larger entreprise with a mature data governance setup, such gold layer datasets would have differential access such that the various stakeholders can only access the datasets relevant to their use case — in other words, the gold layer acts as a data mart.

In our case, I’ve also created an landing layer to hold our raw Excel files. Medallion layers generally contain structured data that is addressable via SQL, whereas raw Excel files are examples of unstructured data that require some processing to be accessible by a SQL query engine. If we were accessing the LinkedIn API, the raw JSON responses would also be candidates for “landing” in our landing layer.

So what is that going to look like in Databricks?

The general data architecture of a Databricks lakehouse. Source: Databricks Objects in Databricks (https://docs.databricks.com/aws/en/database-objects/)

Within the workspace that you have access to, you can create multiple catalogs, which can in turn contain multiple schemas. Schemas can be thought of as collections of data objects; in some other data warehouses they are known as datasets. The types of data objects that can be created inside a schema are as follows:

  • Tables are structured collections of data organized by rows and columns. This is how we are going to organize our data in the medallion layers.
  • Views are saved queries against one or more tables. These can come in handy in the gold layer.
  • Volumes are logical volumes of non-tabular data in cloud object storage. Our Excel files can go here.
  • Functions are user-defined and reusable SQL queries that can be used in other queries, including within a data pipeline. We are not using these in our pipeline, nor are we using machine learning models that also come under this category.

There are a few approachs to implementing the medallion structure we just described in Databrics:

  1. Use catalogs as the layers, and schemas to organize data within each layer. This approach is generally preferred by small or centralized data teams in full control of their Databricks warehouse.
  2. Use catalogs to organize data by domain, and implement the layers as schemas within each layer. This approach is often preferable when an entreprise’s data domain boundaries are clearly defined and different domains have distinct governance or Service-Level Agreement (SLA) needs.

In my case, I decided on the first approach, which is to use catalogs as the layers, and schemas to organize data within each layer. But there isn’t a hard and fast rule about these things; ultimately it depends on your use case.

In most other data warehouses, there are only two layers in the architecture (datasets/schemas and tables/etc.) rather than three (catalogs, schemas and tables/etc.) How one translates the high-level architecture into those data warehouses (and in general) depends on the technical specifications of the data warehouses, the data governance requirements and the business use cases.

III. Creating the data layers

Navigate to the ‘Catalog’ section, and you can see the default and system catalogs and datasets that Databricks has set up. We can now create the following catalogs using the default settings:

  1. landing: For our raw Excel data files
  2. bronze, silver, gold: For our medallion layers of structured data
  3. metadata: This is for any miscelleneous data that is related to our data infrastructure rather than to the data itself. We will use this to store our logs, as detailed in the subsequent section.
Create catalogs via the ‘+’ button.

You may have noticed that there are several options for types of catalogs. For our purposes, standard is what we are going for, and it is what is used for the majority of use cases. Here’s a quick rundown of the other types of catalogs and their use cases:

  • Foreign catalogs are for accessing structured data from outside Databricks as if it is part of the Databricks warehouse. This is especially useful for having Databricks act as the central node for a data federation strategy that ingests data from disparate databases and data warehouses.
  • Shared catalogs can be considered special instances of foreign catalogs, except that they are from different Databricks workspaces and thus have tighter integrations.
  • Databases is the newest type, and refers to a type of serverless Postgres database (Lakebase) on Databricks that is designed to support transactional loads.

Note that for standard catalogs, we also have the option to choose our own cloud storage such as AWS S3, Google Cloud Storage (GCS) or Azure Data Lake Storage (ADLS) Gen2. We are going to stick with the default storage in this instance.

In Databricks Free Edition, only AWS S3 and Cloudflare R2 are available as custom cloud storage locations.

IV. Data Architecture: Schemas and Landing Volumes

It’s time to create our schemas, but what is the approach to take? Recall that we have three types of Excel files that we can choose to ingest. We have a few options:

  1. Create one schema per type of data domain, e.g. LinkedIn analytics.
  2. Create one schema per type of data, e.g. LinkedIn Posts vs LinkedIn Content.
  3. Create one schema per type of file, e.g. Excel file vs API extracts.

We face similar decisions when it comes to creating volumes within our landing schemas (or tables within our medallion schemas). Once again, there are no hard and fast rules; what you end up with will most likely conform to both your use case as well as whatever data governance policy you have in your team or organization.

For this use case, I’ve decided to go with the first approach, which is to define a linkedin schema for each medallion layer (including landing). Details in the figure below:

Detailed data schemetic for our Databricks project.
  • There are only volumes in the landing catalog; volumes are the only data object that can store unstructured data such as Excel files.
  • The metadata catalog contains a logs schema; the intent is to have this as a central place to store all logs related to our forthcoming data pipeline(s). This is not the only possible approach; for instance, I could define a metadata schema for every medallion layer, and have a logs table within each of the schema. This latter approach works better if you are defining differential access controls for each catalog.

Finally, let’s decide what our landing volumes are going to be. I settled on the following approach:

  • content_daily for storing the content analytics Excel files with a single-day time range.
  • content_historical for storing the content analytics Excel file with maximun time range.
  • posts for storing the post analytics Excel files.

At this point, you should have a consistent naming policy at hand. I’ve gone with snake case and a time element suffix, which means that I should stick with this convention going forward. 

In the real world, it’s important to have an established naming convention so as to increase consistency and reduce confusion; either the convention needs to be clearly communicated with your stakeholders, or you need to conform to the convention already established within your stakeholders.

We could also use a single volume and have folders within that volume that correspond to the different types of Excel files. Again, there is no single way to approach this architecture, so long as the approach satisfies your use case as well as relevant data governance policy — and so long as the architecture itself is well-documented, like we have just done.

V. Creating the rest of the initial data infrastructure

We’re now ready to create our schemas as well as landing zone volumes.

a. Creating schemas

Click on the landing catalog in the catalog explorer, and at the top right you can see a ‘Create schema’ button.

The landing catalog

We will create our linkedin schema in this way using the default options, and then rinse and repeat for the other catalogs. Also go ahead and create the logs schema inside the metadata catalog.

The dialog box for creating a new schema

b. Creating landing zone volumes

Navigate to the linkedin schema inside the landing catalog, then click the dropdown for ‘Create’ on the top right.

 Click on ‘Volume’, then proceed to create the volume with default settings. Repeat for all the landing zone volumes. When you’re done, your catalog should appear as follows:

Initial data architecture on Databricks

Tables and views will be created once we start ingesting our data.

In the real world, and especially in a production environment, we will need to establish proper access controls and enforce data governance. That is beyond the scope of this personal project, but for the curious, refer to the relevant Databricks documentation on administration, security and compliance, and data governance.

c. Creating the pending folder in each landing volume

One last thing to do is to create a pending folder (or directory) in each of the landing volumes. Later on in the process we will be moving our files to either a processed folder or an errors folder depending on whether the ingestion succeeded or failed.

Folder structure and data flow within each landing volume.

For each landing volume, locate the ‘Create directory’ button near the top right corner.

Example screencap from the landing.linkedin.content_daily volume. The ‘Create directory’ button is to the botton right of the screencap.

Then enter the name of the directory to be created, in this case pending, and click the ‘Create’ button.

Example screencap from the landing.linkedin.content_daily volume.

You will now be in the newly-created pending directory.

Example screencap from the pending folder/directory in the landing.linkedin.content_daily volume.

VI. What’s Next?

With our initial data architecture in place, we are now ready to do some ingesting. See you in the next article!

Yingzhao Ouyang is an AI and data engineering specialist with a distinctive blend of humanities, business, and technical expertise, bringing a uniquely holistic perspective to enterprise data challenges that others with purely technical backgrounds miss. To find out more, follow his LinkedIn profile at https://www.linkedin.com/in/yzouyang/

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.