I recently finished the first version of a philosophy-oriented NLP research pipeline focused on shareholder letters.
The idea behind philo_nlp is simple: long-form corporate communication often reflects an underlying philosophy. In this first implementation, the reference philosophy is Berkshire Hathaway and Warren Buffett shareholder communication.
The project is split into three repositories.
Berkshire_Letters builds the Buffett reference corpus by downloading and normalizing Berkshire shareholder letters into a reusable NLP dataset.
shareholder_letters_downloader expands the process to other companies. It downloads shareholder letters, extracts text, performs quality checks, and generates structured sentiment and keyword features.
philo_nlp consumes those structured datasets and performs the actual similarity scoring and ranking.
The workflow looks like this:
Berkshire reference corpus
↓
Shareholder letter pipeline
↓
Structured NLP datasets
↓
Buffett-style scoring and ranking
For the first end-to-end run, I tested:
- Berkshire Hathaway
- Markel
- Brookfield
- Amazon
- Danaher
- Costco
- Apple
- Meta
- Alphabet
The pipeline successfully produced a ranked Buffett-style screening output. Markel and Brookfield ranked highly, which is directionally intuitive for a first-pass semantic comparison against the Berkshire reference profile.

The most important result isn’t the ranking itself. The important result is that the entire workflow now runs end to end:
download letters
→ extract text
→ build features
→ score similarity
→ rank companies
This is not investment advice or a production investment model. It is research infrastructure designed to make shareholder-letter analysis reproducible and extensible.
Future work may include multi-year company histories, richer Buffett reference profiles, explainability, and broader universe coverage.
Repository links: