Marginalia Search

Information relating to the Marginalia Search project.

Marginalia Search is an independent DIY search engine that focuses on non-commercial content, and attempts to show you sites you perhaps weren’t aware of in favor of the sort of sites you probably already knew existed.

URL:  🌎 https://search.marginalia.nu/
Git:  🌎 https://git.marginalia.nu/

You may also be interested in the 🏷️ search-engine tag.

Documents

NameDate
📁 ../2025-01-18
📄 FAQ2023-03-28
📄 API2023-03-23
📄 About Marginalia Search2022-12-23
📄 For Webmasters2022-10-28
📄 Privacy Considerations2022-09-22
📄 Donate To This Project2022-09-05

Recent Posts in 🏷️ search-engine

2024-12-26 RSS Feeds and Real Time Crawling

A while back an update went live that, with some caveats, changes the time it takes for an update on a website to reflect in the search engine index from up to 2 months to 1-2 days. Conditions being if the website has an RSS or Atom feed. The big crawl job takes about two months, and is run partition by partition, meaning there’s typically a slice of the index that is two months stale at any given point in time.

2024-11-05 Notes on binary soup

I recently put together a small library called Slop, for intermediate on-disk data representation for the search engine, replacing a few ad-hoc formats I had in place before. This post isn’t so much an attempt to convince anyone else to use this library, as it makes trade-offs catering to a fairly niche use case, but to explore some of its design ideas, as it all came together very nicely, in the hopes that other libraries can draw ideas from it.

2024-09-30 Phrase Matching in Marginalia Search

Marginalia Search now properly supports phrase matching. This not only permits a more robust implementation of quoted search queries, but also helps promote results where the search terms occur in the document exactly in the same order as they do in the query. This is a write-up about implementing this change. This is going to be a relatively long post, as it represents about 4 months of work. I’m also happy and grateful to announce that the nlnet people reached out after the run of the grant was over and asked me if I had more work in the pipe, and agreed to fund this change as well!

2024-06-18 One year of solo dev, wrapping up the grant-funded work

A year ago I walked out of the office for the last time. I handed in my corpo laptop, said some good-byes, and since then I have been my own boss. This first year has been funded by an NLnet grant, which I’m in the midst of wrapping up. As of now, the work is all done, the final request for payment has been sent. There’s a similar last-day-of-school levity to both these events.

2024-05-16 Experiment in Java native calls

I’ve experimentally replaced some of the Java implementations of quicksort and binary search with calls to C++ code, and saw huge benefits for the sorting code but the same or worse performance for binary search. The Marginalia Search engine is mainly written in Java, which is language that is good at many things, but not particularly pleasant to work with when it comes to low level systems programming. Unfortunately, a part of building an internet search engine involves database-adjacent low level programming.

2024-04-17 Query Parsing and Understanding

Been working on improving Marginalia Search query parsing and understanding. This is going to be a pretty long update, as it’s a few months’ work. Apart from cleaning up the somewhat messy query parsing code, a problem I’m trying to address is that the search engine is currently only good at dealing with fairly focused queries, they don’t need to be short, but if you try to qualify a search that is too broad by adding more terms, it often doesn’t produce anything useful.

2024-04-10 Deep Bug

The project has been haunted by a mysterious bug since sometime February. It relates to the code that constructs the index, particularly the code that merges partial indices. In short the search engine constucts the reverse index through successive merging of smaller indices, which reduces the overall memory requirement. You can conceptualize the revese index itself as two files, one with offset pointers into another file, which has sorted numbers. This code runs after each partition finishes crawling and processing its data, and has a run time of about 4 hours.

2024-02-28 The Yak Shave

I set out a little over a week ago to add a service registry to Marginalia Search, primarily to reduce its dependence on docker. I would like it to be able to run on bare metal as well, which poses a problem since configuring the application manually is a bit of a headache with dozens of ports that need to be set up. It would also be desirable to be able to run multiple instances of important services in order elliminate downtime during upgrades.

2024-02-25 Marginalia: 3 Years

It’s been three years since the inception of Marginalia Search, then a dinky experiment to find where the heck the cool Internet has gone, now my full time job. While there’s always things that can be improved, it’s fair to say the search engine has never worked as well as it does right now. A great number of milestones have been reached, perhaps biggest of all the search engine has moved out of my living room and into a proper enterprise server.

2024-02-07 Best SEO spam 2024 reddit

One of the great joys of working on a search engine is that you get to reverse engineer SEO spam, and overall study how it evolves over time. I’ve been noticing the search engine spam strategy of adding ‘reddit’ to page titles for a few years now, but it feels like it’s been growing a lot recently. I don’t think it’s actually working, but it’s so cute that they are trying.

2023-12-22 A Frivolous Feature

Marginalia Search very recently gained the ability to filter results by Autonomous System, not only searching by ASN but by the organization information for that AS. At a glance this seems like a somewhat frivolous feature, but it has interesting effects. Autonomous Systems are part of the Internet’s routing infrastructure. If your mental model of an IP number is that they are the phone number of the computer, this is something akin to a postal code.

2023-12-20 WARC'in the crawler

The Marginalia Crawler has seen improvements! A long term problem with the crawler design is that if for whatever reason the crawler shuts down, then it needs to re-start fetching whatever domains it was currently traversing during the termination from zero. This isn’t fantastic, since not only does crawling a website take a fair bit of time, it’s a nuisance for the server admins to re-crawl stuff that was already fetched, and a real liability for ending up in robots.

2023-11-07 Anchor Tags

I’ve been working on getting anchor tag keywords into the search engine, basically using link texts to complement the keywords on a webpage. The problem I’m attempting to address is that many websites don’t really describe themselves particularly well. As Steve Ballmer’s stage performance once illustrated, merely repeating a word doesn’t on its own make what you’re saying relevant to the term. Another good example of how it falls short is PuTTY’s website, which will be used as a pilot case to improve.