Mailing List Archive: Remove similar (geographic/named) results

Hi *,

I'm indexing public transport stops in a Lucene 9.7.0 index in order to get a fuzzy stop search for start/end points for a trip planner.

The documents look like this:

[.
{
"id": "MARTA:485",
"name": "ASHBY STATION",
"coordinate": {
"lat": 33.756478,
"lon": -84.41723
}
},
{
"id": "MARTA:486",
"name": "ASHBY STATION",
"coordinate": {
"lat": 33.756477,
"lon": -84.417328
}
},
{
"id": "MARTA:79496",
"name": "ASHBY STATION - SOUTHBOUND",
"coordinate": {
"lat": 33.756281,
"lon": -84.417724
}
},
{
"id": "MARTA:79028",
"name": "ASHBY STATION - NORTHBOUND",
"coordinate": {
"lat": 33.756066,
"lon": -84.417371
}
}
]

When I execute a term query for "ashby" all of the above results are returned. I would like to ask for advice on how to de-duplicate the results - at the very least the first two results with identical names, which are very close to each other geographically, should be aggregated. Ideally there woudl also be a fuzzy combination of the thirrd and fourth result based on similarity and geographic closeness, but that is a secondary concern.

I've tried to read up on Collectors like DiversifiedTopDocsCollector and Aggregation but I'm having a bit of a hard time figuring out what is the best approach and how this slots into my current search code.

Can anyone give advice?

Many thanks.
--
Leonard Ehrenfried
mail@leonard.io - https://leonard.io