A very brief note to announce reaching a long term goal and major milestone for marginalia search.
The search engine now indexes 106,857,244 documents!
The previous record was a bit south of seventy million. A hundred million has been a pie-in-the-sky goal for a very long time. It’s seemed borderline impossible to index a that many documents on a PC. Turns out it’s not. It’s more than possible.
Twice this may even be technically doable, but is way past the pain point of sheer logistics. It’s already a real headache to deal with this much data.
- The crawl takes two weeks.
- Processing the crawl data to extract keywords and features takes several days.
- Loading the processed data into the database takes another day.
- Constructing the index takes another day.
A hundred million probably more than good enough.
Focus should instead be on improving the quality of what is indexed, on making it better, faster, more relevant. Sadly it’s not as easy to find vanity goals like hitting 100,000,000 in that area.