As part of a larger NLP project focused on analyzing shareholder communications, I needed a clean, consistent corpus of historical letters to shareholders. One obvious and valuable dataset is the collection of Chairman’s Letters from Berkshire Hathaway, which spans nearly five decades.
While the letters are publicly available on Berkshire’s website, assembling them into a single, NLP-ready dataset isn’t quite as simple as it sounds.
Here’s a link to this on GitHub
The Practical Problem
Over time, the way Berkshire publishes its letters has changed:
- Earlier letters are available as HTML pages
- Later letters are published directly as PDFs
- URL patterns and filenames are not consistent across years
For an NLP workflow, this matters. Inconsistent ingestion leads to rework later — especially when you want to:
- re-run analyses
- extend the dataset with new years
- or reuse the corpus across multiple experiments
Rather than relying on ad-hoc scraping or assumptions about URL structure, I wanted a deterministic and reproducible way to assemble the full history of letters.
The Approach
The solution was to treat this as a data ingestion problem, not a scraping problem:
- Explicitly define the source for each year (HTML or PDF)
- Convert HTML letters into PDFs for consistency
- Download canonical PDFs directly where they already exist
- Store everything locally using a single, stable naming convention
The result is a complete, version-controlled corpus of Berkshire Hathaway Chairman’s Letters — ready for text extraction, tokenization, and downstream NLP analysis.
Why This Is Useful (Even by Itself)
Even outside of the broader NLP project, this dataset is valuable:
- All letters are available in one place
- Formatting is consistent
- The process is repeatable and extensible
- Adding future years requires minimal effort
This makes it easy to build analyses on top — whether that’s sentiment tracking, thematic shifts over time, or benchmarking new shareholder letters against historical language.
What’s Next
This ingestion step is just the foundation. The next phases of the project focus on:
- extracting clean text from the documents
- structuring sections and paragraphs
- and applying NLP techniques to evaluate shareholder communications
I’ll share more as those pieces come together.