Python scripts for structured data validation and gap analysis

Schema has been a constant topic in SEO for years. Recently, though, the conversation has shifted toward whether it still matters—both for traditional search and for LLM-driven experiences like ChatGPT.

I don’t think its importance has diminished. If anything, it’s evolving.

In March 2025, Google updated its structured data requirements for return policies. That alone is telling. Search engines don’t invest in maintaining and expanding structured data documentation unless it continues to play a meaningful role in how content is understood and surfaced.

The same applies to LLMs. Fabrice Canel, Principal Product Manager at Microsoft Bing, confirmed that schema markup helps Microsoft’s LLMs better understand web content. This isn’t just a theoretical benefit, it’s something practitioners are actively testing.

For example, Andrea Volpini shared findings showing a clear divide: websites with comprehensive structured data are more accurately represented in AI-generated responses, while those without it risk being misinterpreted or overlooked.

Taken together, this reinforces a simple point: structured data is still an important layer of communication between your content and both search engines and AI systems.

If schema is doing double duty by signaling to both crawlers and LLMs, then gaps in your schema implementation are missed opportunities to be accurately represented wherever your content is surfaced.

In practice, during audits, I repeatedly find missing, incomplete, or misaligned schema implementations. That’s what inspired me to start automating the process.

In this guide, I want to share how to audit schemas using Python and identify gaps between your implementation and your competitors’.

Step-by-step Python script + Colab notebook link

Prerequisites

You don’t need to be a developer. Basic comfort with running code in a browser is enough. The simplest way to do that is Google Colab, which lets you run Python directly in your browser with no setup required.

To get started, go to https://colab.research.google.com/ and create a new notebook for our code.

Once you have a new notebook ready, add a code block and include the following lines of code and click run. This should set you up.

Note: Each code block below is a separate cell . Paste them in order, running each one before moving to the next.

# Run this first
!pip install requests beautifulsoup4 pandas

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
from google.colab import files
import time

First run of script

How the script works

At a high level, the script goes through six simple steps to analyze and compare structured data between your site and your competitors.

Step 1: Get the list of your URLs and competitors URLs

This script takes a CSV file with two columns as input.

Column A contains URLs from your website
Column B contains URLs from competitor websites against which you want to compare

In this example, I’m using my own website and comparing the schema on my page to the schema found on competitor pages. You can access this spreadsheet here.

Code

 uploaded = files.upload()

df_input = pd.read_csv(list(uploaded.keys())[0])

print("✅ File loaded successfully!")
display(df_input.head())

Step 2: Fetch HTML

The script visits each URL and downloads the page content just like a browser would.

Code

 def fetch_page(url):
    headers = {"User-Agent": "Mozilla/5.0"}
    
    try:
        response = requests.get(url, headers=headers, timeout=10)
        return BeautifulSoup(response.text, "html.parser")
    
    except:
        print(f"❌ Failed to fetch: {url}")
        return None

Step 3: Pull raw schema data

It scans the page and pulls out any structured data (schema) added using JSON-LD.

Note: this code only works with JSON-LD schema. If a page has structured data in the format of RDFa or microdata, the code won’t work as expected.

Code

# ============================================
# STEP 3: EXTRACT JSON-LD SCHEMA
# ============================================

def extract_jsonld(soup):
    schemas = []
    
    if soup is None:
        return schemas

    scripts = soup.find_all("script", type="application/ld+json")
    
    for script in scripts:
        try:
            # ✅ FIX: safer JSON extraction
            raw_json = script.string or script.text

            if not raw_json:
                continue

            data = json.loads(raw_json.strip())
            
            if isinstance(data, list):
                schemas.extend(data)
            else:
                schemas.append(data)
        
        except json.JSONDecodeError as e:
            print(f"⚠️ Could not parse schema JSON: {e}")
            continue
    
    return schemas

Step 4: Make schemas comparable

Structured data is often nested (fields inside fields), which makes comparison messy.

So the script “flattens” it into a simpler format. For example:

Original: author → name
Becomes: author.name

This makes it much easier to compare fields across different websites.

Code

# ============================================
# STEP 4: FLATTEN SCHEMA (UNCHANGED)
# ============================================

def flatten_schema(data, parent_key=""):
    items = []

    if isinstance(data, dict):
        for k, v in data.items():
            new_key = f"{parent_key}.{k}" if parent_key else k
            items.extend(flatten_schema(v, new_key))
    
    elif isinstance(data, list):
        for item in data:
            items.extend(flatten_schema(item, parent_key))
    
    else:
        items.append(parent_key)

    return items

Step 5: Audit schemas

For every URL, the script:

Reads the schema
Identifies the schema types used (e.g., Product, Article)
Extracts all the fields associated with each type

This also includes the core of the analysis. For each schema type, the script compares your implementation with your competitors and identifies:

Common fields → what you both have
Missing fields → what competitors have that you don’t
Unique fields → what you have that competitors don’t

This is where the “gap analysis” happens.

Code

# ============================================
# STEP 5: AUDIT + GAP ANALYSIS
# ============================================

all_results = []

for _, row in df_input.iterrows():
    your_url = row["your_url"]
    competitor_url = row["competitor_url"]

    print(f"\n🔍 Processing:\nYou: {your_url}\nCompetitor: {competitor_url}")

    your_soup = fetch_page(your_url)
    time.sleep(2)
    competitor_soup = fetch_page(competitor_url)
    time.sleep(2)

    your_schemas = extract_jsonld(your_soup)
    competitor_schemas = extract_jsonld(competitor_soup)

    def build_schema_map(schemas, source_label, url):
        results = {}
        for schema in schemas:
            # ✅ FIX: safer @type handling
            schema_type = schema.get("@type", "Unknown")

            if isinstance(schema_type, list):
                schema_type = schema_type[0]

            if not isinstance(schema_type, str):
                schema_type = "Unknown"

            fields = set(flatten_schema(schema))

            if schema_type not in results:
                results[schema_type] = []

            results[schema_type].append({
                "source": source_label,
                "url": url,
                "fields": fields
            })

        return results

    your_data    = build_schema_map(your_schemas, "you", your_url)
    competitor_data = build_schema_map(competitor_schemas, "competitor", competitor_url)

    your_types   = set(your_data.keys())
    comp_types   = set(competitor_data.keys())
    shared_types = your_types & comp_types

    missing_fields_by_type = {}

    for schema_type in sorted(shared_types):
        your_fields = set().union(*(e["fields"] for e in your_data[schema_type]))
        comp_fields = set().union(*(e["fields"] for e in competitor_data[schema_type]))

        missing = comp_fields - your_fields

        if missing:
            missing_fields_by_type[schema_type] = sorted(missing)

    fields_missing_str = " | ".join(
        f"{t}: {', '.join(fields)}"
        for t, fields in missing_fields_by_type.items()
    ) if missing_fields_by_type else ""

    all_results.append(pd.DataFrame([{
        "your_url": your_url,
        "your_schema_types": ", ".join(sorted(your_types)),
        "competitor_url": competitor_url,
        "competitor_schema_types": ", ".join(sorted(comp_types)),
        "schema_types_you_are_missing": ", ".join(sorted(comp_types - your_types)),
        "schema_types_competitor_is_missing": ", ".join(sorted(your_types - comp_types)),
        "fields_you_are_missing": fields_missing_str,
    }]))

Step 6: Generate the output report

Finally, the script:

Outputs a clean summary of findings
Exports everything into a CSV file

So you can easily filter, prioritize, and take action on the gaps.

Code

# ============================================
# STEP 6: FINAL OUTPUT TABLE
# ============================================

# ✅ FIX: prevent crash if empty
if all_results:
    final_df = pd.concat(all_results, ignore_index=True)
else:
    final_df = pd.DataFrame()

print("\n✅ Analysis complete!")
display(final_df.head())

final_df.to_csv("schema_gap_analysis.csv", index=False)

print("📁 File saved as: schema_gap_analysis.csv")

Here’s the link to the complete Colab Notebook for this code.

Example output

Once you run the script using the example input in this sheet, you will get the following output both in the colab notebook and in an exported file.

Example script output

This basically shows that your URL(s) were compared against the competitors URLs and detected schema types are listed as well as a simple gap analysis.

Here’s a detailed breakdown of each column:

Column	What it tells you
your_url / competitor_url	The two pages being compared
your_schema_types	Schema types detected on your page (e.g. Article, BreadcrumbList)
competitor_schema_types	Schema types detected on the competitor’s page
schema_types_you_are_missing	Schema types the competitor has that you don’t implement at all
schema_types_competitor_is_missing	Schema types you have that they don’t (useful context, but not your priority)
fields_you_are_missing	The most actionable column. For schema types you both implement, these are the specific fields the competitor includes that you don’t — e.g. Article: author.name, dateModified means you have Article schema but are missing those two fields

For example, you can see in row two that my website is missing the course schema.

Missing course schema example

Some of the insights you can get from the above table:

The missing course schema is a missed opportunity for rich results [learn more here].
The missing FAQPage is a missed SERP real estate [learn more here].

The goal is not to copy everything your competitors are doing. Instead, your priority should be finding missed opportunities that can bring value to your website and business.

Tips on customizing the script to fit your workflow

This script is a great starting point, but there’s plenty of space to customize it for your own workflow.

Compare one URL against a full SERP

Instead of manually pairing URLs, plug in a SERP scraping API to automatically pull the top 10 results for a keyword and feed them as the competitor column.

From there you can layer in additional SERP signals like title tag length, meta description, word count and turn the schema audit into a broader on-page gap analysis in a single pass.

Weight schemas by rich result eligibility

Not all schema types have the same upside. If your goal is rich results, types like Product, Recipe, Review, and VideoObject are worth more than WebSite for example.

You can add a priority map to the script and split the missing schema output into two columns missing_high_priority and missing_low_priority, so you immediately know what to act on versus what to log for later.

Track changes over time

Run the script after every major Google update, or whenever you see meaningful rank movement, and save each output with a timestamp.

Over time the CSVs become a dataset you can mine for patterns which schema types correlate with ranking shifts, which competitors are iterating on their markup, and what the dominant schema fingerprint looks like for a given keyword category.

Run large scale studies

You can use this script with some minor modifications, to input a list of keywords and their type e.g transactional, commercial, branded, etc… and analyze the most common schema types in the top 10 by keyword type in your niche.

You’ll need two things before modifying to perform this study/task:

A new input CSV with two columns: keyword and keyword_type
A SERP API key to fetch the top 10 results per keyword (SerpAPI has a free tier that works fine for this)

Once you have those handy, you can paste the current code created above, and ask chatgpt/claude to modify it to perform the following:

Accept a CSV input with two columns: keyword and keyword_type (e.g. transactional, commercial, branded)
Use SerpAPI to automatically fetch the top 10 organic results for each keyword
Visit each of those URLs and extract all JSON-LD schema types using the existing functions.
Add a delay between each request to avoid rate limiting
Count how frequently each schema type appears across the top 10 results, grouped by keyword type
Export two CSV files: one raw file with every URL and its detected schema types, and one summary file counting schema type frequency by keyword type.
Plot a graph for each keyword type, where the x-axis is the schema type, and the y-axis represents the frequency.

Integrate a schema validator

This script detects a schema code, but does not validate it. Something you may want to explore including in your version of this script.

To do this you need to create the following table:

Decide on which schema types are important to you and their related fields. For example, person schema, organization schema, product schema, reviews schema, etc…
Once you have this list, ask your LLM of choice to create a table of both required and recommended fields.

Next, to modify this code, ask chatgpt/claude or the LLM of your choice to update this code to validate schemas and use this table as a reference to validate the selected schema types.

Conclusion (and next steps)

Structured data’s impact is no longer limited to rich results in SERPs and search engines’ knowledge graphs, it’s also about how AI systems process and understand your content. That makes schema less of a nice-to-have and more of a foundational layer of how your content gets represented, wherever it’s surfaced.

Using a script like this to run schema audits and gap analysis helps you identify missed opportunities, catch errors, and stay competitive across both traditional search and AI-driven experiences.

Your next step: run this on your top five pages by traffic or business importance, paired with their top-ranking competitors, and see what’s missing. It’s a small time investment that can surface quick wins.

Check out the code here!