Python scripts for structured data validation and gap analysis
Schema has been a constant topic in SEO for years. Recently, though, the conversation has shifted toward whether it still matters—both for traditional search and for LLM-driven experiences like ChatGPT.
I don’t think its importance has diminished. If anything, it’s evolving.
In March 2025, Google updated its structured data requirements for return policies. That alone is telling. Search engines don’t invest in maintaining and expanding structured data documentation unless it continues to play a meaningful role in how content is understood and surfaced.
The same applies to LLMs. Fabrice Canel, Principal Product Manager at Microsoft Bing, confirmed that schema markup helps Microsoft’s LLMs better understand web content. This isn’t just a theoretical benefit, it’s something practitioners are actively testing.
For example, Andrea Volpini shared findings showing a clear divide: websites with comprehensive structured data are more accurately represented in AI-generated responses, while those without it risk being misinterpreted or overlooked.
Taken together, this reinforces a simple point: structured data is still an important layer of communication between your content and both search engines and AI systems.
If schema is doing double duty by signaling to both crawlers and LLMs, then gaps in your schema implementation are missed opportunities to be accurately represented wherever your content is surfaced.
In practice, during audits, I repeatedly find missing, incomplete, or misaligned schema implementations. That’s what inspired me to start automating the process.
In this guide, I want to share how to audit schemas using Python and identify gaps between your implementation and your competitors’.
Step-by-step Python script + Colab notebook link
Prerequisites
You don’t need to be a developer. Basic comfort with running code in a browser is enough. The simplest way to do that is Google Colab, which lets you run Python directly in your browser with no setup required.
To get started, go to https://colab.research.google.com/ and create a new notebook for our code.
Once you have a new notebook ready, add a code block and include the following lines of code and click run. This should set you up.
Note: Each code block below is a separate cell . Paste them in order, running each one before moving to the next.
# Run this first !pip install requests beautifulsoup4 pandas import requests from bs4 import BeautifulSoup import json import pandas as pd from google.colab import files import time

How the script works
At a high level, the script goes through six simple steps to analyze and compare structured data between your site and your competitors.
Step 1: Get the list of your URLs and competitors URLs
This script takes a CSV file with two columns as input.
- Column A contains URLs from your website
- Column B contains URLs from competitor websites against which you want to compare
In this example, I’m using my own website and comparing the schema on my page to the schema found on competitor pages. You can access this spreadsheet here.
Code
uploaded = files.upload()
df_input = pd.read_csv(list(uploaded.keys())[0])
print("✅ File loaded successfully!")
display(df_input.head())
Step 2: Fetch HTML
The script visits each URL and downloads the page content just like a browser would.
Code
def fetch_page(url):
headers = {"User-Agent": "Mozilla/5.0"}
try:
response = requests.get(url, headers=headers, timeout=10)
return BeautifulSoup(response.text, "html.parser")
except:
print(f"❌ Failed to fetch: {url}")
return None
Step 3: Pull raw schema data
It scans the page and pulls out any structured data (schema) added using JSON-LD.
Note: this code only works with JSON-LD schema. If a page has structured data in the format of RDFa or microdata, the code won’t work as expected.
Code
# ============================================
# STEP 3: EXTRACT JSON-LD SCHEMA
# ============================================
def extract_jsonld(soup):
schemas = []
if soup is None:
return schemas
scripts = soup.find_all("script", type="application/ld+json")
for script in scripts:
try:
# ✅ FIX: safer JSON extraction
raw_json = script.string or script.text
if not raw_json:
continue
data = json.loads(raw_json.strip())
if isinstance(data, list):
schemas.extend(data)
else:
schemas.append(data)
except json.JSONDecodeError as e:
print(f"⚠️ Could not parse schema JSON: {e}")
continue
return schemas
Step 4: Make schemas comparable
Structured data is often nested (fields inside fields), which makes comparison messy.
So the script “flattens” it into a simpler format. For example:
- Original: author → name
- Becomes: author.name
This makes it much easier to compare fields across different websites.
Code
# ============================================
# STEP 4: FLATTEN SCHEMA (UNCHANGED)
# ============================================
def flatten_schema(data, parent_key=""):
items = []
if isinstance(data, dict):
for k, v in data.items():
new_key = f"{parent_key}.{k}" if parent_key else k
items.extend(flatten_schema(v, new_key))
elif isinstance(data, list):
for item in data:
items.extend(flatten_schema(item, parent_key))
else:
items.append(parent_key)
return items
Step 5: Audit schemas
For every URL, the script:
- Reads the schema
- Identifies the schema types used (e.g., Product, Article)
- Extracts all the fields associated with each type
This also includes the core of the analysis. For each schema type, the script compares your implementation with your competitors and identifies:
- Common fields → what you both have
- Missing fields → what competitors have that you don’t
- Unique fields → what you have that competitors don’t
This is where the “gap analysis” happens.
Code
# ============================================
# STEP 5: AUDIT + GAP ANALYSIS
# ============================================
all_results = []
for _, row in df_input.iterrows():
your_url = row["your_url"]
competitor_url = row["competitor_url"]
print(f"\n🔍 Processing:\nYou: {your_url}\nCompetitor: {competitor_url}")
your_soup = fetch_page(your_url)
time.sleep(2)
competitor_soup = fetch_page(competitor_url)
time.sleep(2)
your_schemas = extract_jsonld(your_soup)
competitor_schemas = extract_jsonld(competitor_soup)
def build_schema_map(schemas, source_label, url):
results = {}
for schema in schemas:
# ✅ FIX: safer @type handling
schema_type = schema.get("@type", "Unknown")
if isinstance(schema_type, list):
schema_type = schema_type[0]
if not isinstance(schema_type, str):
schema_type = "Unknown"
fields = set(flatten_schema(schema))
if schema_type not in results:
results[schema_type] = []
results[schema_type].append({
"source": source_label,
"url": url,
"fields": fields
})
return results
your_data = build_schema_map(your_schemas, "you", your_url)
competitor_data = build_schema_map(competitor_schemas, "competitor", competitor_url)
your_types = set(your_data.keys())
comp_types = set(competitor_data.keys())
shared_types = your_types & comp_types
missing_fields_by_type = {}
for schema_type in sorted(shared_types):
your_fields = set().union(*(e["fields"] for e in your_data[schema_type]))
comp_fields = set().union(*(e["fields"] for e in competitor_data[schema_type]))
missing = comp_fields - your_fields
if missing:
missing_fields_by_type[schema_type] = sorted(missing)
fields_missing_str = " | ".join(
f"{t}: {', '.join(fields)}"
for t, fields in missing_fields_by_type.items()
) if missing_fields_by_type else ""
all_results.append(pd.DataFrame([{
"your_url": your_url,
"your_schema_types": ", ".join(sorted(your_types)),
"competitor_url": competitor_url,
"competitor_schema_types": ", ".join(sorted(comp_types)),
"schema_types_you_are_missing": ", ".join(sorted(comp_types - your_types)),
"schema_types_competitor_is_missing": ", ".join(sorted(your_types - comp_types)),
"fields_you_are_missing": fields_missing_str,
}]))
Step 6: Generate the output report
Finally, the script:
- Outputs a clean summary of findings
- Exports everything into a CSV file
So you can easily filter, prioritize, and take action on the gaps.
Code
# ============================================
# STEP 6: FINAL OUTPUT TABLE
# ============================================
# ✅ FIX: prevent crash if empty
if all_results:
final_df = pd.concat(all_results, ignore_index=True)
else:
final_df = pd.DataFrame()
print("\n✅ Analysis complete!")
display(final_df.head())
final_df.to_csv("schema_gap_analysis.csv", index=False)
print("📁 File saved as: schema_gap_analysis.csv")
Here’s the link to the complete Colab Notebook for this code.
Example output
Once you run the script using the example input in this sheet, you will get the following output both in the colab notebook and in an exported file.

This basically shows that your URL(s) were compared against the competitors URLs and detected schema types are listed as well as a simple gap analysis.
Here’s a detailed breakdown of each column:
| Column | What it tells you |
|---|---|
| your_url / competitor_url | The two pages being compared |
| your_schema_types | Schema types detected on your page (e.g. Article, BreadcrumbList) |
| competitor_schema_types | Schema types detected on the competitor’s page |
| schema_types_you_are_missing | Schema types the competitor has that you don’t implement at all |
| schema_types_competitor_is_missing | Schema types you have that they don’t (useful context, but not your priority) |
| fields_you_are_missing | The most actionable column. For schema types you both implement, these are the specific fields the competitor includes that you don’t — e.g. Article: author.name, dateModified means you have Article schema but are missing those two fields |
For example, you can see in row two that my website is missing the course schema.

Some of the insights you can get from the above table:
- The missing course schema is a missed opportunity for rich results [learn more here].
- The missing FAQPage is a missed SERP real estate [learn more here].
The goal is not to copy everything your competitors are doing. Instead, your priority should be finding missed opportunities that can bring value to your website and business.
Tips on customizing the script to fit your workflow
This script is a great starting point, but there’s plenty of space to customize it for your own workflow.
Compare one URL against a full SERP
Instead of manually pairing URLs, plug in a SERP scraping API to automatically pull the top 10 results for a keyword and feed them as the competitor column.
From there you can layer in additional SERP signals like title tag length, meta description, word count and turn the schema audit into a broader on-page gap analysis in a single pass.
Weight schemas by rich result eligibility
Not all schema types have the same upside. If your goal is rich results, types like Product, Recipe, Review, and VideoObject are worth more than WebSite for example.
You can add a priority map to the script and split the missing schema output into two columns missing_high_priority and missing_low_priority, so you immediately know what to act on versus what to log for later.
Track changes over time
Run the script after every major Google update, or whenever you see meaningful rank movement, and save each output with a timestamp.
Over time the CSVs become a dataset you can mine for patterns which schema types correlate with ranking shifts, which competitors are iterating on their markup, and what the dominant schema fingerprint looks like for a given keyword category.
Run large scale studies
You can use this script with some minor modifications, to input a list of keywords and their type e.g transactional, commercial, branded, etc… and analyze the most common schema types in the top 10 by keyword type in your niche.
You’ll need two things before modifying to perform this study/task:
- A new input CSV with two columns:
keywordandkeyword_type - A SERP API key to fetch the top 10 results per keyword (SerpAPI has a free tier that works fine for this)
Once you have those handy, you can paste the current code created above, and ask chatgpt/claude to modify it to perform the following:
- Accept a CSV input with two columns: keyword and keyword_type (e.g. transactional, commercial, branded)
- Use SerpAPI to automatically fetch the top 10 organic results for each keyword
- Visit each of those URLs and extract all JSON-LD schema types using the existing functions.
- Add a delay between each request to avoid rate limiting
- Count how frequently each schema type appears across the top 10 results, grouped by keyword type
- Export two CSV files: one raw file with every URL and its detected schema types, and one summary file counting schema type frequency by keyword type.
- Plot a graph for each keyword type, where the x-axis is the schema type, and the y-axis represents the frequency.
Integrate a schema validator
This script detects a schema code, but does not validate it. Something you may want to explore including in your version of this script.
To do this you need to create the following table:
- Decide on which schema types are important to you and their related fields. For example, person schema, organization schema, product schema, reviews schema, etc…
- Once you have this list, ask your LLM of choice to create a table of both required and recommended fields.
Next, to modify this code, ask chatgpt/claude or the LLM of your choice to update this code to validate schemas and use this table as a reference to validate the selected schema types.
Conclusion (and next steps)
Structured data’s impact is no longer limited to rich results in SERPs and search engines’ knowledge graphs, it’s also about how AI systems process and understand your content. That makes schema less of a nice-to-have and more of a foundational layer of how your content gets represented, wherever it’s surfaced.
Using a script like this to run schema audits and gap analysis helps you identify missed opportunities, catch errors, and stay competitive across both traditional search and AI-driven experiences.
Your next step: run this on your top five pages by traffic or business importance, paired with their top-ranking competitors, and see what’s missing. It’s a small time investment that can surface quick wins.