Build Your Own LinkedIn Analytics Part 2: Choosing the Data Source

Build Your Own LinkedIn Analytics Part 2: Choosing the Data Source

Orginally published on Medium on 14 October 2025

In the previous article, I introduced my rationale behind starting this end-to-end data project for my LinkedIn analytics. Today we’ll dive into the specifics, starting with what LinkedIn data to ingest and where to ingest it from.

This is part of the following blog series:

TL;DR

  • Manual Excel downloads are the only realistic option for individual users.
  • By defining our use case, we can better evaluate which data to ingest and/or preprocess.

I. Ingestion Options

While there are are several options for extracting LinkedIn data, realistically speaking only one of them is practical for us as individual users.

a. Option 1: Manual Excel downloads from LinkedIn dashboards

This is the option that’s present by default on LinkedIn analytics dashboards, accessible via the ‘Export’ button. The limitations have been covered by the previous article in the series, but to re-state:

  • I cannot retrieve the content of the posts since only the activity-formatted links to the posts are included.
  • Daily statistics are only available on an aggregate basis, otherwise all the issues with the lack of daily statistics persist (i.e. no breakdown by posts) unless limited to a single day of download (which is not available for individual posts).
  • If I download files on a daily basis, I now have to deal with multiple files in order to get a consolidated view of my analytics.
  • Only the top 50 posts for a given time period and statistics are included in the extract. This is not yet a major problem given how recently I started my LinkedIn thought leadership efforts, but could grow to become a bigger issue further down the line.

In addition, there is no automatic way to download these Excel files, or to configure what Excel files should be downloaded.

b. Option 2: LinkedIn API access

In theory, LinkedIn API access can solve a lot of our issues. For example, the Member Posts Statistics endpoint offers a fine-grained, programmatic way to extract single post or aggregated posts analytics over an arbitary date range, while the Posts endpoint allows for the retrieval of the contents, metadata and associated media (including articles, images and videos) of selected posts.

There is just one problem. 

All these APIs require a LinkedIn company page associated with your profile as well as with a registered company, and a developer application created on the LinkedIn Developer Portal.

This is not something that the majority of us will have access to (including myself), so it is a non-starter. However, if you are a business owner or otherwise manage your profile as a business, this is the preferred option.

c. Option 3: Automation of Excel downloads from LinkedIn dashboards

We could use web automation tools such as Selenium to help us obtain the Excel downloads. Nowadays there also exists the possibility of using AI agents such as Manus to automate the manual steps involved in extracting the appropriate Excel files from the LinkedIn dashboard. However, such automated tools are very much something that LinkedIn blocks and bans at every opportunity. 

The risks of having our personal LinkedIn profile blocked or banned in this instance is unacceptably high, so it is out of consideration.

We will touch on automation (especially the AI variety) in more detail in the final article of this series. Meanwhile, if you’ve found safe, sustainable ways to automate LinkedIn analytics for personal use, do share your experience in the comments or connect with me for broader discussion.

d. Option 4: Third-party Analytics

We touched on this in the previous article when discussing the business case for embarking on this data project. To recap:

  • The third-party solutions can be quite expensive
  • The third-party solutions are overkill for our use case

That said, if our needs outpace the scale of our data solution (e.g. we are managing multiple profiles and/or posting multiple times a day), these solutions could become a viable alternative.

This perfectly illustrates the “build or buy” dilemma that faces all organizations looking to address their use cases (not just data use cases). In a real world scenario, one should regularly revisit this question and determine whether precious internal resources should be spent on building an in-house solution; whether it makes more sense to purchase from a third-party vendor; or whether the two approaches can be combined into a hybrid solution.

e. Final Decision: Manual Excel downloads

Let’s review our options:

  1. API access is unavailable to us as individual users.
  2. Automation poses unacceptable risks.
  3. Third-party analytics are costly and overkill.

We are thus left with manual Excel downloads as our only viable option. As long as the initial and daily number of files to download are managable, and the data availble is suitable for our analysis, this is still an appropriate approach for a personal data product. 

The below image illustrates our strategy to get the files to Databricks:

Process of ingesting LinkedIn Excel files into Databricks

If the amount of downloads become unsustainable, or if the quality of the data deteriorates, it may be time to consider either jumping through the hoops to qualify for API access, or bite the bullet and purchase a third-party LinkedIn analytics solution.

II. Examining the Raw Data

Before we dive into the raw data, we need to keep our goal (aka our use case) in mind. To recap in the form of a user story:

I want to analyze and compare the performance of my LinkedIn posts and articles so that I can determine my content strategy.

In order for the above to happen, the data that we ingest needs to be of a minimun quality. In fact, let’s take it directly from Databricks itself, which references the “Six Dimensions” data quality model defined by DAMA (an international organization for enterprise data management):

  • Consistency: Data values should not conflict with other values across data sets
  • Accuracy: There should be no errors in the data
  • Validity: The data should conform to a certain format
  • Completeness: There should be no missing data
  • Timeliness: The data should be up to date
  • Uniqueness: There should be no duplicates

We still need to add Relevance as an additional measure:

  • Relevance: The data should directly relate to our use case

With this in mind, let’s do a deeper dive into the Excel files available for download.

a. Excel File: Content Analytics

This is the file that is available for download on the overall Analytics page, with adjustable date filters. The format of its filename is as follows:

Content_{date start in YYYY-MM-DD}_{date end in YYYY-MM-DD}_{ProfileName}.xlsx

An example is as follows: 

Content_2024-08-17_2025-08-16_YingzhaoOuyang.xlsx

Before looking at the file, we can already make the following comments based on the date filters:

  • Consistency: The data can be considered consistent only if the current date is excluded from the export. This is because data for the current date is essentially live data that constantly updates, and our reliance on manual downloads means that we cannot control the timing of our data extraction. Hence, we can only consider data from the previous day and earlier.
  • Completeness: The earliest start date for extraction is 365 days from the current date; in other word, you cannot access analytics from more than 1 year ago. Since I started my thought leadership initiative 2 months ago, I am less interested in analytics that is earlier than that, so this isn’t a big issue for me, but it is something that you need to consider for your own use case.

The following 4 sheets are in the file:

1. DISCOVERY: This is where the total number of impressions and of members reached is recorded.

Expand to see screencap of DISCOVERY tab
The DISCOVERY tab. Seems relatively straightforward to parse…
  • Total impressions can be obained by summing the impressions by day in the ENGAGEMENT tab; in other words, it is not unique. However, one might still consider ingesting it in order to do a consistency check on the numbers from impressions by day.
  • Members reached is a profile-wide number, not a post-specific number so it has less relevance to our use case.

2. ENGAGEMENT: This is where the daily numbers for impressions and engagement are recorded, aggregated across posts.

Expand to see screencap of ENGAGEMENT tab.
The ENGAGEMENT tab has a standard table that is very easy to parse.
  • Date appears to be formatted as ‘MM/DD/YYY’. However, this is an Excel file, so the raw data beneath this is an Excel-specific date format, which can be parsed by most Python ingestion libraries. In addition, LinkedIn always exports its date and time in the UTC timezone. This observation applies to all subsequent date fields.
  • Impressions and engagements are summed by date as integers (whole numbers). As we will see, this sum is crucial for including metrics from posts that are not among the ones listed in the TOP POSTS tab.

3. TOP POSTS: This is the only place in the file where you get the breakdown for individual posts, for both impressions and engagements. We can see that there is no breakdown by date; the data is instead aggregated over the entire date range for the extract.

Expand to see screencap of TOP POSTS tab.
The TOP POSTS tab has two seperate tables to parse.
  • Note the post URL in this file; we will be revisiting this when we look at the post analytics file. Post publish date is also available here.
  • Impressions and engagements are summed by post URL as integers, aggregrated over the selected date range. If the start date and end date are the same, we can obtain the total impressions and engagements by post URL for a specific date, which is exactly the metrics we are looking for.

4. FOLLOWERS: This shows the number of new followers for the profile on a daily basis. This sheet is relevant if we are using follower trend to perform post perfomance analysis, but there are some complications with doing so, as will be elaborated on subsequently.

Expand to see screencap of FOLLOWERS tab.
The FOLLOWERS tab. Relatively straightforward table to parse.
  • Total followers can be used to derive historical follower metrics up to 1 year prior by working backwards from the new followers metric. It does not have any other relevance.
  • New followers by date can be used to establish a follower trend, which may have utility as mentioned above.

5. DEMOGRAPHICS: This feels like the most useless tab if we’re doing time-based analysis, as it just show a snapshot of the percentage of followers who match different categories. That being said, we may come back to this if we decide that current demographic data is important for any future forecasting model.

Expand to see screencap of DEMOGRAPHICS tab.
The DEMOGRAPHICS tab. Also relatively straightforward to ingest but is there a point?

b. Excel File: Post Analytics

This is the file that is available for download on the individual post Analytics page; it is not possible to choose the time range. The format of its filename is as follows:

PostAnalytics_{ProfileName}_{Post ID}.xlsx

An example is as follows:

PostAnalytics_YingzhaoOuyang_7372453867575283712.xlsx

The following 2 sheets are in the file:

  1. PERFORMANCE: The aggregate figures for the post can be found here, as well as some metadata about the post.
Expand to see screencaps of PERFORMANCE tab.
  • The Post URL here is different from the Post URL we see in the TOP POSTS tab for overall analytics. In other words, there is no correlation between the two URLs. Why would you do this, LinkedIn?
  • This is the only way to extract Post Publish Time for a given post.
  • The post performance metrics are all in aggregate, and there is no way to break it up by date in this file. This is unfortunate because this is the only place where we see a breakdown of the types of engagements as well as the members reached by the post.
  • The same applies to article performance numbers if the post has an associated article, which is especially unfortunate since this is the only way to extract article views and other article performance metrics.

2. TOP DEMOGRAPHICS: This tab is much the same as the DEMOGRAPHICS tab for overall analytics, except scoped down to the individual post. They thus share the same issues about being aggregrate percentages over an entire time period.

Expand to see screencap of TOP DEMOGRAPHICS tab.
The TOP DEMOGRAPHICS tab.
  • A snapshot of post demographic information may be useful to determine who engages with what kind of post; unfortunately, the inability to set a time range greatly limits the utility of this data, which can at best be used as a reference but not relied upon for consistency (since the percentage could change for different extraction dates and times)

III. Choosing what to ingest

Recall the data quality metrics from earlier; we are now going to use it to decide what exactly to ingest.

Based on our use case, there are three types of Excel files we could download:

  1. Content Analytics over the maximun time range to obtain overall historical metrics. The current date is excluded so as to avoid issues with data consistency.
  2. Content Analytics on a day-by-day basis to obtain detailed daily metrics, starting from 1st August 2025 and excluding the current date. The reason for setting this start date is because August 2025 is the month when I began my thought leadership initiative, and I rarely posted before that time. I would also rather not have to filter and download 365 files manually for my initial ingest.
  3. Post Analytics for posts after 1st August 2025, to extract additional details about post timing.

With the above analysis in mind, here is how we might conduct our Excel ingestions:

Excel ingestion strategy for a) initial/historical load and b) daily incremental load.

Below is a breakdown of the types of fields that we could ingest from the respective files, with analysis based on the different data quality dimensions that we have identified, as well as a final decision column. We also define colour coding as follows:

  • Green: The data should be ingested.
  • Yellow: The data could be ingested, but needs additional work or has caveats attached (reason(s) highlighted).
  • Orange: The data should not be ingested (reason(s) highlighted).
Case 1: Content Analytics over the maximun time range, excluding current date
Case 2: Content Analytics on a day-by-day basis
Case 3: Post Analytics

In a real-life scenario, you may find that you are pressed for time to deliver a data product rapidly, and may thus feel tempted to skip the data quality analysis above. That is fine for proofs of concept (POCs), but is ultimately not tenable for production workloads:

  • This is an essential piece of documentation about the state of the raw data that future maintainers, developers and analysts can refer to and update where necessary.
  • Without this documentation, the knowledge of why some data was ingested in a certain way and why some other data was left out can be lost and end up contributing to the dreaded technical debt.

What if there are hundreds of fields to consider?

This is why defining the use case is so important. If we ingest data blindly without considering business utility except in the sense of “it may be useful someday”, we end up with a data swamp with no clear idea of which data is useful for what purpose. With a clearly defined use case, we can triage what fields to prioritise for evaluation, leaving the rest behind for future use cases. GenAI can be a decent assistant at times like this (e.g. for summarizing what particular fields could be for based on the data), but as with anything GenAI, a human needs to be the one to confirm the output.

In our case, we have a limited number of fields to consider, and we also aggregated some of the fields in our analysis where it made sense (e.g. post performance metrics since we could already tell that they are not relevant to our use case).

IV. What’s Next?

Now that we have selected our data sources and considered what fields to extract from them, our next step is to set up our Databricks Free platform and ingest the data. See you in the next article!

Yingzhao Ouyang is an AI and data engineering specialist with a distinctive blend of humanities, business, and technical expertise, bringing a uniquely holistic perspective to enterprise data challenges that others with purely technical backgrounds miss. To find out more, follow his LinkedIn profile at https://www.linkedin.com/in/yzouyang/

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.