The Trouble With Private Company Identifiers (And What Can Save Us)

Since I started in Alternative Data nearly a decade ago, a few hard problems have persisted no matter how much capital, compute, or clever engineering the industry has thrown at them. My personal top three:

Translating foreign-language text into English (vs. processing it in its native language)
Brand-to-ticker mapping at true scale (far harder than it sounds).
Private company identifiers

At BattleFin’s Alt Data Discovery Day 2025 in London last week, #3 was front and center again. One VC firm mentioned they’ve taken matters into their own hands by using company URLs + LinkedIn profiles as their primary private-company identifiers.

And honestly, for VCs looking for fast-moving startups, this is chef’s kiss practical. URLs and LI profiles are often the first signals a young company emits into the digital world. But scaling that approach across the broader Alternative Data and Gen-AI ecosystem is… complicated. Before we get into that, a quick acknowledgment:

Alex Izydorczyk’s 2024 Magis essay “Unfair Data Moats & Regulatory Capture” remains one of the sharpest perspectives on the role identifiers play in market structure. I won’t recap his argument here, but it’s worth keeping in mind that identifiers affect far more than metadata - they shape the economics of data itself.

Why Private Company Identifiers Are So Hard

Public companies are easy. We have tickers, FIGIs, LEIs (increasingly common), CIKs, RICs, ISINs - pick your flavor. They all map back to the same legal entity with ~98–100% consistency.

But private companies? Wild West.

You’ve got: DUNS (legacy, expensive, patchy), BvD IDs (proprietary), Crunchbase UUIDs (limited to their universe), PitchBook IDs (excellent but closed), Clearbit domain IDs (works for digital-first companies), LinkedIn IDs (semi-stable but subject to LI’s rules). And none of them are universal. Which means: no global join key.

Gen AI companies feel this pain even more. It’s hard to build embeddings, vectors, and knowledge graphs when the underlying entity resolution is unreliable.

Why URLs Are Attractive as an Identifier

URLs are: global, human-readable, relatively stable, trivially extractable, often unique in practice, already part of many datasets.

And from a machine-learning perspective, the root domain becomes a natural anchor for: scraping, entity linking, LLM retrieval, similarity learning, feature engineering

If you’re a VC, using URLs as the primary join key is honestly brilliant.
Startups launch websites before they launch products. But… URLs crack quickly at enterprise scale.

The Brand / Umbrella Company Problem (Where URLs Break)

Let’s use an example for illustrative purposes. Take Yum! Brands. It has a master corporate site at yum.com. It also has subsidiary restaurant brands with their own sites, like kfc.com, tacobell.com and pizzahut.com (with regional domain variations).

Think about this at the top level for some of the biggest CPG brands out there and you get a headache:

From a URL-only perspective, you'd think these brands are completely unrelated entities. But in reality, they roll up to the same parent.

Now imagine using solely URLs as identifiers. How do you know KFC is part of Yum? How do you prevent double counting? How do you handle regional sites? How do you track restructurings or divestitures?

It gets messy fast. This same issue appears in private markets. Two good examples:

Cargill (cargill.com) with sub-brands cargillag.com, provimi.com (animal nutrition), purakos.com (food ingredients) etc.
Marc Inc. (mars.com) with sub-brands mms.com, kindsnacks.com, pedigree.com, whiskas.com, etc.

Now, if you’re mapping URLs as entities, do these represent unique companies? Divisions? Brands? Operating subsidiaries? Marketing wrappers? Without a hierarchy layer, URL-mapping breaks.

So… What About LinkedIn?

LinkedIn does a better job than URLs at encoding corporate hierarchy… sometimes. You’ll often find a master company page (e.g., “Mars”) and separate LI pages for business units (“Mars Petcare”). Sometimes companies choose to create separate LI pages for brands (“Royal Canin”), sometimes they do no pages for brands, and sometimes they even do multiple pages for the same company due to regional entities.

As you can see, there is no standard. But LinkedIn has two key advantages:

Stable numeric company IDs
Often includes industry, employee count, and location metadata

Which are both useful. However, LinkedIn breaks in a few key areas: 1) foreign subsidiaries may not be listed, 2) entities may have multiple duplicate pages, 3) some brands intentionally avoid LinkedIn, and 4) some companies consolidate pages every few years. So LinkedIn is a strong input, but not a universal identifier.

Why This Matters Specifically to Alternative Data + GenAI

Where both URLs and LinkedIn pages fall short on their own in the world of private company identification, a clever individual might come to the realization: you could use both URL + LinkedIn mapping.

And while this might not be perfect, it offers a few important things. First, it’s a semi-open, semi-standard, semi-reliable way to map companies. Not only that, but they’re both something LLM’s can interpret natively, and that most datasets already contain.

For LLMs, URLs and LinkedIn links become anchors for retrieval, keys for embeddings, identifiers for clustering, ultimately ways to connect unstructured and structured data

In other words, combining the two might be the best practical, not perfect, solution available today. But to make this work responsibly across the ecosystem, we need a few important things:

brand-to-parent co rollups
explicit subsidiary mappins
canonical root-domain selection rules
LinkedIn-to-URL crosswalk tables
version control as companies restructure
industry collaboration, not proprietary moats

This is where the industry tends to stall.

Are URLs the Future Identifier?

So all of that said, here’s my honest view:

Company URLs (domains) are the best open, universal, machine-friendly identifier we have today for private companies. LinkedIn is the best metadata companion identifier. Neither is perfect; both require hierarchy layers to prevent chaos.

But unlike proprietary systems (DUNS, CUSIP, etc.), URLs and LinkedIn IDs come with lower friction, higher openness, and better alignment with how Gen-AI models understand the world. Are they the future? Probably not. But they may be the most practical bridge we have to get there.

Across Alternative Data and GenAI, the debate about private-company identifiers isn’t academic. It’s the backbone of many critical things: entity resolution, model accuracy, signal quality, compliance, customer trust, data interoperability, and more.

And until the industry aligns on a standard (open, global, and hierarchy-aware) we’re all building on slightly unstable ground. If URLs and LinkedIn IDs end up being the scaffolding we use to fill that gap, well… It wouldn’t be the worst thing.

And for now? It’s better than waiting another decade for the perfect system that still hasn’t arrived.

Chef’s kiss.
Let’s build something that works.

Doug Hopkins

(Have thoughts or want to discuss identifiers, alternative data, or partnerships?
You can reach Douglas at: doug@babbl.dev or on LinkedIn)

‍

Advanced microdata techniques