The internet ran out of data. Now what?

April 24, 2026

The AI Race Just Changed — And These Are the New Oil Fields

Mining the Human Mind | Active Trader Editorial


Digital data streams representing the AI data scarcity crisis

The Machines Are Hungry. The Internet Is Empty.

For three years, the AI narrative has been about hardware. Who has the most GPUs. Who owns the most data centers. Who can cram the most transistors onto a chip. That trade worked — spectacularly — for NVDA, AMD, and the hyperscalers that rode the infrastructure wave. But the next leg of this race is being decided somewhere else entirely. In archives. In forum threads. In credit files. In 175-year-old newspaper morgues.

We’ve hit Peak Text.

That’s not hyperbole. A group of researchers predicted the AI industry will run out of high-quality text data before 2026 if current training trends continue. The Epoch AI research group — one of the more credible shops tracking this — projects that tech companies will exhaust the supply of publicly available training data for AI language models by roughly the turn of the decade, somewhere between 2026 and 2032. And at the high-quality end of the spectrum, machine learning datasets may deplete all “high-quality language data” as early as 2026.

The implication for traders isn’t academic. It’s structural. The companies that spent the last decade accumulating proprietary, specialized, hard-to-replicate data are now sitting on something the entire AI industry desperately needs — and can’t manufacture. Today’s largest AI models have been trained on almost all of the free and available data they can find on the internet, and from open-source datasets, and they’re running out of new data to learn from. That changes the entire value chain.


Sponsored


EXPERT: “Every American Investor Could Soon Become a Millionaire”

Imagine a bull market so powerful, every single investor became a millionaire. Not by finding the next NVIDIA or Bitcoin, but by owning a simple index fund.

It sounds impossible. Yet it happened – just a short time ago. Now a legendary figure says: “Brace yourselves. It’s about to happen here, in America. But fair warning – it could be the worst thing that ever happens to you.”

This story has received little coverage in the press. But if history repeats, it could bump tens of millions of Americans into a 7-figure net worth practically overnight.

Click here for the full story.

Key Takeaways

  • Researchers estimate the stock of public human-generated text at roughly 300 trillion tokens — and if trends continue, language models will completely exhaust this supply between 2026 and 2032.
  • The amount of text data consumed by large models has been doubling at extraordinary rates, making the scarcity problem structural, not cyclical.
  • Reddit’s IPO prospectus disclosed over 1 billion posts and more than 16 billion comments, with data licensing arrangements totaling $203 million in aggregate contract value entered in January 2024.
  • Reddit Q2 2025 revenue hit $500 million — a 78% year-over-year increase — with advertising revenue surging 84% and net income reaching $89 million, its most profitable quarter to date.
  • Amazon reached a multi-year agreement to pay The New York Times between $20 million and $25 million annually for the right to license the publisher’s content for AI training and Alexa product responses.
  • Equifax reported record Q1 2026 revenue of $1,648.9 million, up 14% year-over-year, with the company explicitly calling its proprietary dataset its primary competitive moat in an AI-driven world.
  • The global AI training dataset market is projected to reach $16.32 billion by 2033, growing at a 22.6% CAGR from 2026.

Market Context: From Big Compute to Big Context

The macro setup here matters. We’re operating in an environment where AI capex is still running at historic levels — the hyperscalers alone have committed hundreds of billions to infrastructure buildout through 2026 and beyond. The global AI sector reached new heights in 2025, with over $1.5 trillion in projected AI spending and historic valuations for leaders like OpenAI ($500 billion) and Anthropic ($183 billion). That money has to go somewhere productive. And increasingly, productivity in AI is bottlenecked not by compute — but by context. By signal. By the irreplaceable grain of human-generated information that no model can synthesize from scratch without degrading in quality.

The synthetic data market is one attempted workaround. The synthetic data market is estimated at $710 million in 2026, growing from a 2025 value of $510 million, with 2031 projections showing $3.67 billion at a 38.96% CAGR. In March 2025, Nvidia acquired synthetic data firm Gretel and folded its team into the chip giant’s growing suite of cloud-based generative AI services for developers. But here’s the part people skip — synthetic data has a ceiling. You can generate more of it, but you can’t make it richer than the underlying real-world signal it was derived from. A model trained only on synthetic data is essentially learning to mimic a mimic. Model collapse is a real risk, and the research community knows it.

Stanford University’s AI Index 2025 Report sounds a clear alarm that the internet’s treasure trove of training data is rapidly depleted. Meanwhile, MIT’s Data Provenance Initiative has documented a “dramatic drop in content made available” as publishers and platforms increasingly restrict AI companies from scraping their data. That restriction is not altruistic — it’s commercial. The platforms realized what they had. And now they’re charging for it.

That’s the transition we’re tracking. From Big Compute to Big Context. From the GPU race to the data licensing war. From infrastructure plays to what you might call Proprietary Knowledge Moats — companies whose archives, communities, and databases represent decades of irreplaceable human signal that AI developers cannot obtain anywhere else.


Sector Breakdown: Where the Capital Is Rotating

The traditional media and data analytics sectors are being re-rated in real time, and institutional flows are beginning to reflect it — though the market hasn’t fully priced the structural shift yet. The licensing economy is still in its early innings. What’s clear is that the companies positioned to benefit share one defining characteristic: they hold data that is live, growing, and non-replicable.

Slight tangent, but it matters — the music industry went through exactly this dynamic in the early 2000s. The same question of who owned the underlying rights, and what those rights were worth in a new distribution environment, took nearly a decade to fully resolve. Reddit joined Yahoo, Medium, and others in backing Really Simple Licensing (RSL), which aims to standardize how AI developers license content and compensate publishers — modeled after music industry frameworks like ASCAP and BMI, creating a clearinghouse where publishers can set payment terms and attribution requirements. The analogy is apt. We’re early in the royalty negotiation phase of an industry-wide data licensing cycle.

Three sectors are at the center of this shift: social media (community-generated conversational data), premium journalism (curated, authoritative long-form content), and financial data analytics (regulated, proprietary non-public data). Each has a different monetization profile and risk structure — which matters for how you position.


Sponsored


Done Trading by 10 AM With Triple-Digit Gains

What if one simple pattern at the market open could hand you gains like 240% on META and 139% on GLD? The Opening Bell Trade Guide reveals the exact setup, the timing, and why this window keeps producing outsized wins. Download it free before we start charging for it.

Download the free guide today

Reddit (NYSE: RDDT) — The Conversational Archive

Reddit is the most direct and volatile expression of this thesis. The platform’s value to AI developers isn’t just size — it’s the nature of the content. Reddit’s data APIs provide real-time access to evolving and dynamic topics such as sports, movies, news, fashion, and the latest trends. The company believes Reddit’s massive corpus of conversational data and knowledge will continue to play a central role in training and improving large language models.

The licensing machine is already running. In February 2024, Reddit announced a content-licensing deal with Google for $60 million a year, giving Google access to real-time content from Reddit’s vast user-authored forums. A few months later, Reddit struck a similar partnership with OpenAI estimated to be worth around $70 million a year. That’s roughly $130 million annually in high-margin data licensing revenue — and it was just the opening bid.

Operationally, the business is accelerating well beyond data deals alone. Reddit’s Q2 2025 earnings showed revenue reaching $500 million — a substantial 78% year-over-year increase. Advertising revenue surged 84% year-over-year. The company also achieved net income of $89 million, its most profitable quarter to date, with daily active unique visitors increasing 21% to 110.4 million. AI data licensing agreements contributed $35 million — a 24% increase from Q2 2024. For Q3 2025, Reddit guided for revenue between $535 million and $545 million, representing 54%–56% year-over-year growth, with adjusted EBITDA projected to grow 100%–110% year-over-year.

The stock is currently trading near $153 — down hard from its all-time high of $282.95 reached in September 2025. Wall Street analyst sentiment shows 21 Buy, 10 Hold, and 1 Sell ratings from 32 analysts covering the stock, with a consensus price target near $231. The October 2025 selloff — driven by fears that ChatGPT was reducing its citation of Reddit content — appears to have been an overreaction that institutional buyers used as an entry. Reddit’s proactive strategy in securing substantial data licensing agreements has played a pivotal role in reassuring investors, with the $60 million Google deal alongside the similarly lucrative OpenAI arrangement transforming what initially appeared to be a threat into a significant new revenue stream.

Forward valuation is high by traditional media standards. Under discounted cash flow frameworks, Reddit’s free cash flow is projected to rise to roughly $4.75 billion by 2035, implying very strong long-term growth as the platform scales advertising and data licensing. The bull case hinges on dynamic pricing — Reddit executives are floating a pricing model where the social platform can be paid more as it becomes more vital to AI answers — which, if it materializes, could make current licensing revenue look like a rounding error.

The New York Times (NYSE: NYT) — The Premium Archive Play

NYT is the institutional version of this trade. Less volatility, less upside — but a more defensible moat built on 175 years of authoritative, professionally edited journalism that no language model can replicate through scraping alone. The combination of legal aggression and commercial negotiation is frankly impressive to watch.

Amazon reached a multi-year agreement to pay the New York Times between $20 million and $25 million annually for the right to license the publisher’s content for use in artificial intelligence training and Alexa product responses. Content includes “real-time display of summaries and short excerpts” of news articles and items from NYT Cooking and The Athletic. Meanwhile, News Corp’s OpenAI deal reportedly tops $250 million over five years — establishing a rate card for premium journalism that has reshuffled how publishers think about their back catalogs.

Q2 2025 earnings showed the Times pulling $70 million in licensing and affiliate revenue. Analyst scenarios incorporating the Amazon generative AI agreement and other high-margin partnerships project revenue rising to approximately $3.6 billion by 2029, with earnings of $528 million, supported by margin improvement, ongoing buybacks, and a forward P/E of 33.3x. The dual strategy — sue OpenAI and Microsoft while licensing to Amazon — reflects a sophisticated approach to IP monetization. Amazon’s interest underscores how vital trusted, high-quality data has become in differentiating AI products, especially as it lags behind firms like OpenAI and Google in consumer-facing AI adoption.

Equifax (NYSE: EFX) — The Regulated Data Moat

Equifax is the least obvious name in this thesis — and probably the most structurally defensible. Unlike Reddit or NYT, EFX isn’t licensing data outward to AI developers. The moat works differently here: the company uses AI internally on data that nobody else can legally access, generating a product quality advantage that competitors simply cannot buy their way into.

About 90% of Equifax’s revenue is generated from its proprietary data, and only Equifax can access that data with AI. Let that sit. Its Workforce Solutions unit is anchored by The Work Number platform, which held about 209 million active and 813 million total employment records at December 31, 2025. That’s income and employment verification data covering nearly every working American — gathered through payroll integrations that took decades to build and cannot be replicated. This is what a genuine data moat looks like.

The financials reflect it. The company reported revenue of $1,648.9 million in Q1 2026, up 14% on a reported basis compared to Q1 2025. Diluted EPS was $1.42 per share in Q1 2026, up 34% compared to $1.06 per share in Q1 2025. Innovation is accelerating. Equifax gauges product innovation with its Vitality Index — the percentage of revenue derived from new products introduced over the current and prior three years. In Q1, the index hit 17%, outpacing the long-term target of 10%, powered by the Equifax Cloud, EFX.AI, and proprietary data.

For full-year 2026, Equifax guided for revenue of $6.728 billion, representing 10.6% growth, with adjusted EPS of $8.50, up 11% year-over-year. The stock trades at a significant discount to the structural value of what it owns — particularly if the market comes to fully appreciate how much pricing power accrues to companies sitting on regulated, non-replicable data archives in an era of Peak Text.


Sponsored

The Stock-Picking AI That Could Triple Your Money This Year

Early users could have already tripled their money every single year this AI has been live, based on the average winning trade spotted – WITHOUT having to check the news, WITHOUT watching the Fed, and WITHOUT all the stress most traders have to deal with. For now, you can try this AI yourself, completely free of charge – no email, no credit card.

Click here to learn more.

Technical Framework

These three names trade very differently — understanding the technical structure of each matters before sizing any position.

RDDT is the high-beta expression. The stock has retraced over 45% from its September 2025 all-time high near $283, and is now consolidating in a range roughly between $120 support and $175 resistance. Volume on the recent bounce has been constructive. The 200-day moving average is overhead, which typically creates resistance on a first test — watch for price action near the $170–$180 zone on any rally attempt. RSI has rebounded from deeply oversold levels (sub-30) in March, now recovering toward mid-range. The earnings print on April 30, 2026 is the next binary catalyst. A clean beat with maintained licensing guidance could set up a meaningful re-rate.

NYT is the lower-volatility expression — essentially a media company that now has a recurring high-margin licensing revenue stream the market is still learning to value. The stock has been trading in a relatively tight band. The key technical watch is whether the licensing revenue narrative drives a meaningful multiple expansion. At current prices near $81, the risk-reward on a 12–18 month basis appears reasonably balanced, with limited downside given subscription business stability.

EFX is the institutional quality play. The stock’s technical picture is cleaner than the other two — it’s been in a steady recovery trend post-2025 mortgage-market headwinds, with Q1 2026 earnings accelerating that trend. Key support sits near the $165–$170 range established during the rate-driven selloff. The 50-day moving average is beginning to cross above the 100-day, which is a mild constructive signal. VWAP from the Q1 earnings release is a useful near-term reference level for active traders.


Scenario Modeling

Bull Case — The Data Licensing Economy Fully Reprices

Conditions required: AI developers accelerate their push for proprietary data partnerships as synthetic data quality hits its ceiling. Reddit’s Q2 2026 earnings confirm dynamic pricing on licensing contracts — meaning the per-unit value of the data increases with AI model usage. NYT wins its OpenAI/Microsoft lawsuit on at least some claims, forcing the industry to negotiate rather than litigate, creating a wave of licensing deals at escalating valuations. Equifax continues to execute on EFX.AI product launches, pushing Vitality Index above 20% and expanding operating margins. In this scenario, RDDT retests the $230–$270 range, EFX approaches $240, and NYT sees a meaningful re-rate toward $100+. The AI training dataset market — projected to reach $16.32 billion by 2033 at a 22.6% CAGR — begins pulling forward capital into the most defensible data holders.

Base Case — Slow Repricing, Licensing Revenue Compounds Quietly

The most probable path: licensing deals continue to proliferate but pricing negotiations are slow, multi-year processes. Reddit maintains its $130 million annual licensing run rate and grows it modestly. NYT converts its legal leverage into 2–3 additional licensing agreements over the next 18 months, adding $40–$60 million in recurring high-margin revenue. Equifax continues to compound at 10%–14% revenue growth, driven by the Workforce Solutions segment and AI-enabled new product cadence. RDDT oscillates between $140 and $200, EFX grinds toward $210, NYT stays range-bound but with improving quality of earnings. The data moat story builds gradually — this is a 12–24 month setup, not a 3-month trade.

Bear Case — Synthetic Data Solves the Problem, Licensing Premium Collapses

The risk: synthetic data generation technology improves faster than the market expects. New techniques enable AI researchers to make better use of the data they already have and sometimes “overtrain” on the same sources multiple times. If reinforcement learning from AI feedback (RLAIF) matures rapidly, the marginal value of proprietary human-generated text declines. Reddit faces the same structural threat it briefly experienced in Q4 2025 — AI models deprioritizing its content. ChatGPT citation of Reddit content fell from a peak of 14% in September to less than 2% by October 2025 — a preview of what a sustained shift could look like. In this scenario, RDDT retests the $120 support zone or lower. NYT’s licensing revenue remains a modest add rather than a transformative one. EFX is most insulated here, given the non-public regulatory nature of its data.


Sponsored

Public Law 63-43: Trump’s Secret Weapon to Win the Midterms?

Most pundits are predicting the coming midterm elections will be a disaster for Republicans.

But the 112-year-old little-known law you see below could save Trump from this disaster.

Click here to see the details

Because Public Law 63-43 could also have a huge impact on your wealth in 2026 – starting on May 15th.

Active Trader Strategy Framework

A few things worth thinking through before approaching this theme tactically:

  • Position sizing matters more than entry precision here. RDDT carries meaningful event risk (Q1 2026 earnings April 30). For traders who want exposure ahead of the print, consider reduced sizing relative to a post-earnings confirmation setup. The implied move in the options market will tell you what the crowd expects — decide whether you want to own that uncertainty or wait.
  • EFX as the anchor, RDDT as the trade. Structurally, Equifax is the highest-quality expression of this theme with the most defensible data moat. RDDT is where the volatility and upside are — but the licensing thesis there is binary enough that position sizing discipline is critical. A 70/30 EFX/RDDT split within a data moat basket would tilt quality while preserving upside optionality.
  • Licensing revenue confirmation is the key metric to watch. For Reddit specifically, the trajectory of its “other revenue” category (data licensing) is more important than advertising for the thesis. Q2 2025 showed 24% year-over-year growth to $35 million — any acceleration there is the signal. Any contraction is the red flag.
  • Monitor the litigation calendar for NYT. In April 2025, a federal judge refused to dismiss most Times claims against OpenAI and Microsoft. A settlement or licensing resolution on those suits could be a meaningful catalyst for the stock — and for how the broader industry prices content deals.
  • Volatility expectations: RDDT implied volatility will likely expand heading into the April 30 earnings. Elevated IV reduces the attractiveness of long call structures near the money. Spreads or defined-risk strategies may be more appropriate for options traders in this window.

The Bigger Picture

Here’s what’s interesting about this moment. The AI bull narrative has been almost entirely about supply — supply of chips, supply of power, supply of compute. The bottleneck was always assumed to be on the infrastructure side. But you can throw unlimited power at a model that’s consuming the same recycled text corpus and eventually you stop getting better outputs. The quality ceiling is a data ceiling.

Access to large, clean, and proprietary datasets is essential for building and improving AI models, making data a strategic asset. The market is just beginning to price that reality. The 2025–2026 period has seen the emergence of AI-specific valuation models that emphasize the value of proprietary technology, recurring data monetization, and the scalability of ML-based products.

The companies that accumulated archives — not because they anticipated this moment, but simply because it was their business — are now sitting in an inadvertent position of extraordinary strategic leverage. Reddit didn’t build its platform to become an AI training ground. The New York Times didn’t publish 175 years of journalism to generate licensing fees from large language models. Equifax didn’t assemble 800 million employment records as an AI data moat. But here we are.

The transition from Big Tech to Big Context is underway. The question isn’t whether it’s happening — it’s whether your positioning reflects it before the broader market catches up.


For informational and educational purposes only. Not investment advice. Trading involves risk, including loss of principal.

More From Author

Only five ships pass through Strait of Hormuz in 24 hours

Live Market Pulse

The charting technology is provided by TradingView. Learn how to use theTradingView Stock Screener.

Categories