As part of a larger NLP project focused on analyzing shareholder communications, I needed a clean, consistent corpus of historical letters to shareholders. One obvious and valuable dataset is the collection of Chairman’s Letters from Berkshire Hathaway, which spans nearly five decades. While the letters are publicly available on Berkshire’s website, assembling them into a single, NLP-ready dataset isn’t quite as simple as it sounds. Here’s a link...