SEO & Blogging

How AI helped build hreflang XML sitemaps at scale

As the use of AI tools has become more common, I’ve seen amazing examples of people building tools to perform complex processes that once required significant manual effort. I’ve also seen teams use AI just because it’s available, often with little practical benefit.

My approach is to focus on AI applications that save time and solve real problems.

Recently, I needed to coordinate the SEO design of over a dozen websites in three different businesses, eight regional domains, and multiple languages, including trilingual English, Italian, Japanese, Spanish, Thai, French, and Korean.

Historically, mapping thousands of URLs to create unified hreflang XML sitemaps would have required specialized software or days of spreadsheet work. Instead, I used Google Gemini to create a custom Python script that handles the heavy lifting.

Here’s how the project evolved from initial awareness to a highly customized automation tool, and what it taught me about using AI for SEO technology.

Where AI brings great value

I mainly use AI for practical, time-saving tasks, including:

  • It generates regex patterns when I need a quick solution without researching the syntax from scratch.
  • Creating complex spreadsheet formulas for reporting workflows that rely on manual data entry.
  • Accelerate research and planning projects that require competitive analysis across multiple business lines.
  • Building custom automation tools for repetitive SEO and data processing tasks.

The hreflang project discussed here falls into that last category.

Be a brand customers get first.

Track, increase, and measure your visibility across Google, AI search, social, local, and all channels that influence purchasing decisions.

Start your free trial

Mapping hreflang to scale

The challenge was clear: map thousands of URLs from more than a dozen multilingual websites to accurate hreflang XML sitemaps.

Instead of tackling the project by hand, I used Google Gemini to help build a custom Python solution.

Here’s how the process unfolded.

Phase 1: Asking for a method, not just a text

A common pitfall when using generative AI for coding is to ask it to run before it knows the route. If you simply type, “Write a Python script to create a hreflang sitemap,” you’ll get a generic, broken piece of code that breaks when it meets real-world data.

Instead, I started by asking for directions. I described the situation: multiple regional domains, organic growth over several years leading to matching URL slugs, translated subfolders, and additional revision years.

Gemini proposed a multi-step, data-driven approach:

  • Crawl websites to collect live URLs and their metadata.
  • Use Python in Google Colab to process raw data.
  • Start the direct matching collection first to collect matching slugs.
  • Use an advanced semantic AI model (like SentenceTransformers) to match ambiguous translated pages based on their titles and common URLs.

Phase 2: Crawling and data collection

Following the strategy, I used a search engine to call all the regional websites. The goal was to generate a compressed comma-separated values ​​(CSV) file containing live URLs, status codes, title tags, and H1s. Screaming frog worked well in this app.

Bottom line: AI output is only as good as your crawl data (remember the old saying, “garbage in, garbage out”).

The AI ​​script will fail to map “exact match” if the target URL is a 404 or 301 redirect to your source data. You must filter your CSV to include only indexed content before submitting the script.

Dive Deeper: International SEO in 2026: What’s Working, What’s Not, and Why

Get the newsletter search marketers rely on.


Phase 3: Google Colab sandbox

Google Colab provides a free, cloud-based Jupyter notebook environment where you can write, paste, and run Python code without worrying about localization or local variables. You can access it through Google Drive. I found that the free version has enough capacity to handle this project.

I uploaded the CSV to Colab, and Gemini provided the first Python script. The script used the domain mapping routine to allocate language codes, clean up URLs, and generate an XML tree. The first release was a long way off.

Stage 4: Repetition (where the real work happens)

If you’re expecting the AI ​​to deliver a flawless, flawless script on the first try, you’ll be disappointed. You’ve probably heard comparisons between AI tools and interns, which means you need to check out their work. That is very true.

The true value of AI is in replication. As we ran the script, we encountered a number of unmatched URLs, leaving pages as orphans rather than merging them with their international counterparts.

Here’s how I repeatedly trained AI to handle the nuances of human-managed websites.

Directory extension problem

The US site recently reorganized its blog into topic folders, while the Mexican and Italian sites had not yet been reorganized.

I made a Gemini with these very different examples. It responded by adding a URL flattener function to the script, which stripped the header folders behind the scenes so that the translated slugs could target them properly.

Aggressive semantic trap

To prevent the AI ​​from combining different topics, we used logic traps. At first, they were very strong. A UK article about the manufacturing sector would not be the same as an Italian article because the US topic was a little different.

I instructed Gemini to loosen the trappings of general industries while keeping a tight grip on important acronyms (like “SEO” vs “SEM”). This gave the AI ​​the breathing space it needed to match the creative rendering.

A translated slug epiphany

A big breakthrough came when we were researching orphans for a Mexican blog. For example, a Spanish URL /detras-de-escenas-historias... direct English translation /behind-the-scenes-stories... I pointed this out to Gemini.

Instead of forcing me to hard code the same huge manuals, Gemini updated the script to create a “Combined Semantic Signature.” It dynamically translates key function phrases into slugs, successfully bridging the language gap of the semantic matching model and linking dozens of orphaned pages almost instantly.

Dive deep: Cultural SEO: An effective framework for Spanish markets in AI search

Owner of the chat ahead of your competitors.

See where your brand is appearing, where it isn’t, and how to win more visibility across search, AI, location, social, and all key channels.

Start your free trial

The project reinforced a simple lesson: AI works best when treated as a collaborator rather than a detractor.

  • Be the strategist, let the AI ​​be the code: Don’t just want the end product. Discuss architecture, edge cases, and logic first. Treat AI like a young developer who needs clear architectural direction.
  • Give concrete examples: If the script fails, you can just say, “We’re broken.” In this project, I assigned specific URLs that failed and URLs that should have matched them, or groups of URLs that had mismatches. AI needs tangible patterns to make sense of itself.
  • Adopt an iterative loop: Wait to run the code, identify the anomaly, and provide it in the notification. Each iteration makes the tool smarter.
  • Use Google Colab: You don’t need to be a Python expert to use Python for SEO. Colab bridges the technology gap, allowing you to use complex data libraries directly in your browser.

By the end of the project, we had a robust, highly customizable Python script that could process a large CSV and generate a cross-referenced XML sitemap in minutes.

AI will not replace technical SEOs anytime soon. However, SEOs know how to work with AI to create custom, scalable, and useful tools that will be more profitable.

Dig deeper: How AI search defines market relevance beyond hreflang

Contributing writers are invited to create content for Search Engine Land and are selected for their expertise and contribution to the search community. Our contributors work under the supervision of editorial staff and contributions are assessed for quality and relevance to our students. Search Engine Land is owned by Semrush. The contributor has not been asked to speak directly or indirectly about Semrush. The opinions they express are their own.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button