The training data problem nobody talks about

There is a known hierarchy of what makes a good large language model: architecture, scale, alignment, and data. Architecture and scale get most of the research attention. Alignment gets most of the press. Data gets treated as a solved problem: crawl the web, filter obvious junk, train.

But it is not a solved problem. The difference between a model that produces confident, accurate, domain-specific responses and one that produces fluent but hollow text is almost always the quality of its training data. And most training pipelines have no way to distinguish between a post written by a leading expert and a post written by someone who read a few Wikipedia articles. Both end up in the same crawl dump, weighted equally.

Authority signals are a way to break that tie.

What makes training data high-quality?

The academic literature on data quality for LLMs focuses on a few proxies: perplexity filtering, deduplication, toxicity removal, and language identification. These are all useful for removing obviously bad data. None of them tell you whether the data is accurate, expert, or trustworthy.

Consider two posts about immunology. One is written by an immunologist with 20 years of research experience, 30,000 followers on X, and 80 published papers. The other is written by a health blogger paraphrasing the first. Both pass perplexity filters. Both are in English. Neither contains toxicity. They look identical to a standard data pipeline. But one is primary signal. The other is secondary noise. At scale, this distinction compounds: a model trained predominantly on secondary content learns to sound confident about things it is effectively making up.

Human authority is a direct proxy for primary signal. If you know who the verified experts are on a given topic, you can preferentially include their output and give it higher weight during training.

The dead internet problem for training pipelines

There is a second, increasingly urgent reason to care about this. A growing percentage of web content is AI-generated. LLMs trained on data that includes large amounts of AI-generated text develop compounding errors: they learn stylistic patterns of AI output, reinforce factual drift from previous model generations, and lose the grounding that comes from genuinely expert human writing.

The Human Authority Index is, by construction, an index of verified humans. Each profile corresponds to a real person with a demonstrated track record in a specific domain. Content attributed to those names and handles is, at minimum, human-produced. In an environment where distinguishing human from AI-generated content becomes harder with every model generation, a verified human signal is increasingly rare and valuable.

A practical pipeline

The approach fits naturally into any web crawl pipeline. You already have content with author names and handles attached. The three steps are: verify those authors against the authority index, tier your dataset by the results, and apply weighted sampling during training.

Step 1: verify author names and handles from your crawl

After crawling, collect all unique author names and handles from your content and pass them to Amygdala's /match/ endpoint in a single request. Names are the most natural input — "Andrej Karpathy", "Katharine Hayhoe", "Paul Krugman" — and name-based matching is coming soon as the primary way to use the API. Social media handles (X/Twitter, Instagram, YouTube, and others) are supported today. The API returns only entries that belong to verified authorities; unrecognised names or handles are simply absent from the response. No looping, no topic taxonomy required.

Python: verify crawled author names and handles
import requests

AMYGDALA_API_KEY = "amyg_..."

def match_names_and_handles(handles: list[str]) -> dict:
    """Verify a batch of author names or handles against the authority index.
    Names (e.g. 'Andrej Karpathy') are the primary input — coming soon.
    Social media handles (X/Twitter, Instagram, YouTube, and others) are supported today.
    Returns a dict keyed by name/handle — only verified authors are included."""
    resp = requests.get(
        "https://api.amygdala.eu/api/v1/match/",
        params={"handles": handles},
        headers={"Authorization": f"Bearer {AMYGDALA_API_KEY}"},
    )
    resp.raise_for_status()
    return {r["handle"]: r for r in resp.json().get("results", [])}

# your_dataset: list of {"text": str, "source_url": str, "author_name_or_handle": str}
# this is your crawled web content — each record has the author's name or handle attached

# Names are the most natural input and will be the primary way to use this API.
# Name-based matching is coming soon; social media handles are supported today.
example_handles = [
    "Andrej Karpathy",        # name-based matching — coming soon
    "Yann LeCun",             # name-based matching — coming soon
    "Katharine Hayhoe",       # name-based matching — coming soon
    "Richard Thaler",         # name-based matching — coming soon
    "Michael Mann",           # name-based matching — coming soon
    "Paul Krugman",           # name-based matching — coming soon
    "Peter Hotez",            # name-based matching — coming soon
    "Demis Hassabis",         # name-based matching — coming soon
    "karpathy",               # social media handle (supported now)
    "ylecun",                 # social media handle (supported now)
    "KHayhoe",                # social media handle (supported now)
    "R_Thaler",               # social media handle (supported now)
]

# In practice, extract names and handles from your crawled records:
all_names_and_handles = list({
    record["author_name_or_handle"]
    for record in your_dataset
    if record.get("author_name_or_handle")
})

# Verify all names and handles in a single API call
# entries that don't match a verified authority are simply absent from the result
verified_authors = match_names_and_handles(all_names_and_handles)

print(f"Crawled authors:  {len(all_names_and_handles):,}")
print(f"Verified authors: {len(verified_authors):,}")

Step 2: tier your dataset by verification result

With the verified author set in hand, a single pass over your dataset is enough to assign each record to a quality tier. Records whose author matched a top-ranked authority go to tier 1, other verified authors to tier 2, and everything else to the general pool.

Python: tier dataset by authority rank
# verified_authors: dict keyed by name/handle, built in previous step
# your_dataset:     list of {"text": str, "source_url": str, "author_name_or_handle": str}

# Split dataset into tiers based on authority rank
tier_1  = []  # verified authorities, rank 1-5
tier_2  = []  # verified authorities, rank 6+
general = []  # unverified authors

for record in your_dataset:
    authority = verified_authors.get(record.get("author_name_or_handle", ""))
    if authority and authority["rank"] <= 5:
        tier_1.append(record)
    elif authority:
        tier_2.append(record)
    else:
        general.append(record)

print(f"Tier 1 (top authorities): {len(tier_1):,} records")
print(f"Tier 2 (authorities):     {len(tier_2):,} records")
print(f"General:                  {len(general):,} records")

Step 3: apply tiered sampling

Rather than filtering out non-authority content entirely, use tiered sampling to over-represent authority content in each training batch. This preserves diversity while increasing the density of high-signal data. The right weights depend on your domain and the proportion of authority content in your corpus.

Python: weighted training batch sampler
import random

def weighted_sample(tier_1, tier_2, general, total: int, weights=(0.3, 0.3, 0.4)):
    """
    Sample a training batch with explicit tier weights.
    Default: 30% top authorities, 30% authorities, 40% general.
    Adjust weights based on your quality/diversity tradeoff.
    """
    n1 = int(total * weights[0])
    n2 = int(total * weights[1])
    n3 = total - n1 - n2

    sample = (
        random.sample(tier_1, min(n1, len(tier_1))) +
        random.sample(tier_2, min(n2, len(tier_2))) +
        random.sample(general, min(n3, len(general)))
    )
    random.shuffle(sample)
    return sample

batch = weighted_sample(tier_1, tier_2, general, total=100_000)
print(f"Training batch size: {len(batch):,}")

Beyond names and handles: using authority data as a quality signal

If your dataset does not include author names or handles, authority data can still be used as a quality signal in other ways:

  • Domain-level filtering. Use the authority index to identify the top sources (websites, publications, forums) where verified experts publish. Preferentially crawl and include those sources in your training set, even without per-record attribution.
  • RLHF annotator selection. Knowing who the domain experts are helps you select better human feedback providers. An annotator who is a verified authority in immunology will give more reliable feedback on immunology-related model outputs than a general annotator.
  • Evaluation set construction. Authority-produced content makes better benchmark data. Content written by verified experts sets a higher bar for model evaluation because it contains the kind of precise, domain-specific knowledge that distinguishes a genuinely capable model from a fluent but shallow one.
  • Synthetic data grounding. If you use an LLM to generate synthetic training data, grounding the generation prompt with verified expert profiles produces higher-quality synthetic output. Match the names or handles of known domain experts and inject their profiles as context for the generator.

What this does not solve

Authority signals are a quality filter, not a quality guarantee. A verified expert can write an off-topic or low-quality post. The authority rank reflects their standing in their domain, not the quality of every piece of content they have ever produced. Tiered sampling improves the average quality of your training data; it does not replace content-level quality filters.

The approach also does not apply to domains where the Amygdala index has sparse coverage. The index is deepest in technology, science, finance, and media. Niche academic sub-fields or non-English domains may have limited coverage.

The compounding return

The case for authority-weighted training data is strongest when you zoom out. A single training run with slightly better data produces a slightly better model. But models trained on better data produce better synthetic data, which trains better successor models. The compounding effect of starting with higher-quality signal is what separates model families over multiple generations. Authority weighting is a low-friction way to move the starting point.

Try the Amygdala Authority Index

$50 in free credits. No credit card required.