Building an NLP-Ready Corpus of Berkshire Hathaway Shareholder Letters

As part of a larger NLP project focused on analyzing shareholder communications, I needed a clean, consistent corpus of historical letters to shareholders. One obvious and valuable dataset is the collection of Chairman’s Letters from Berkshire Hathaway, which spans nearly five decades.

While the letters are publicly available on Berkshire’s website, assembling them into a single, NLP-ready dataset isn’t quite as simple as it sounds.

Here’s a link to this on GitHub

The Practical Problem

Over time, the way Berkshire publishes its letters has changed:

Earlier letters are available as HTML pages
Later letters are published directly as PDFs
URL patterns and filenames are not consistent across years

For an NLP workflow, this matters. Inconsistent ingestion leads to rework later — especially when you want to:

re-run analyses
extend the dataset with new years
or reuse the corpus across multiple experiments

Rather than relying on ad-hoc scraping or assumptions about URL structure, I wanted a deterministic and reproducible way to assemble the full history of letters.

The Approach

The solution was to treat this as a data ingestion problem, not a scraping problem:

Explicitly define the source for each year (HTML or PDF)
Convert HTML letters into PDFs for consistency
Download canonical PDFs directly where they already exist
Store everything locally using a single, stable naming convention

The result is a complete, version-controlled corpus of Berkshire Hathaway Chairman’s Letters — ready for text extraction, tokenization, and downstream NLP analysis.

Why This Is Useful (Even by Itself)

Even outside of the broader NLP project, this dataset is valuable:

All letters are available in one place
Formatting is consistent
The process is repeatable and extensible
Adding future years requires minimal effort

This makes it easy to build analyses on top — whether that’s sentiment tracking, thematic shifts over time, or benchmarking new shareholder letters against historical language.

What’s Next

This ingestion step is just the foundation. The next phases of the project focus on:

extracting clean text from the documents
structuring sections and paragraphs
and applying NLP techniques to evaluate shareholder communications

I’ll share more as those pieces come together.

Sharing

All Post
Articles
Blog Post
General Business Automation
Portfolio
Stock Market & Finance

All rights are reserved.

Building an NLP-Ready Corpus of Berkshire Hathaway Shareholder Letters

The Practical Problem

The Approach

Why This Is Useful (Even by Itself)

What’s Next

Categories

Sharing

Related Articles

Building a Practical Macro Regime Research Lab with FRED

Using LLMs to Review and Understand Excel Models

Building a Practical SEC Fundamentals Pipeline with Polars, Excel, and Plotly