Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for structured queries (opensearch only) #815

Merged
merged 3 commits into from
Jul 9, 2024

Conversation

tobiass-sdl
Copy link
Contributor

@tobiass-sdl tobiass-sdl commented Jun 6, 2024

this adds a new function

http:https://localhost:2322/structured?city=berlin

Supported parameters are

"lang", "limit",  "lon", "lat", "osm_tag", "location_bias_scale", "bbox", "debug", "zoom", "layer", "countrycode", "state", "county", "city", "postcode", "district", "housenumber", "street"

Result format is equal to /api.

For those parameters that are also available for /api the meaning is the same.
CountryCode has to be the ISO2 code (e.g. DE instead of "Germany"). The country code must match exactly. Fuzzy matching does not make sense for 2 letter codes. In my experience the case that address is fine except for the country code is rare compared to nonsense hits in other countries.

In order to use structured queries, the new option "-structured" has to be used when importing from Nominatim. Expect 10-20% higher file size. If you run photon on an existing index folder, photon detects automatically if the data was imported with structured query support.
For data imported without "-structured" /structured?city=berlin returns a technical error as the route is not mapped.
Structured queries are only supported for OpenSearch based photon.

Known issues

  • state information is used with low priority. This can cause issues with cities that exist in several states (e.g. "Springfield" in the US). Reason is that states are not normalized - some documents have abbreviations like "NY", other spell "New York" out. To fix this either an alias list is needed or this is normalized on import (maybe even for nominatim).
  • no fine tuning of the scores yet - would require a large test set with reliable expected results and some automated way for guessing "good" scores.
  • /structured?city=some_hamlet_or_isolated_dwelling does not always work

@tobiass-sdl
Copy link
Contributor Author

  • adapt readme / documentation

Copy link
Collaborator

@lonvia lonvia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall structure is fine. I've added some smaller comments regarding style as well as comments regarding index structure.

My main concern is the handling of mismatches or missing address parts in the structured query. The lenient 'should' matching produces a lot of false positive results. This would be rather unexpected.

private static final float StreetBoost = 5.0f; // we filter streets in the wrong city / district / ... so we can use a high boost value
private static final float HouseNumberBoost = 10.0f;
private static final float HouseNumberUnmatchedBoost = 5f;
private static final float FactorForWrongLanguage = 0.1f;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nitpicking: can you make these constant upper-case.

.copyTo("collector.base", "collector." + lang)));
}

mappings.properties("name." + lang,
b -> b.text(p -> p.index(false)
b -> b.text(p -> p.index(supportStructuredQueries)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need this index? Names really should be queried only through the .ngrams or .raw indexes.


public boolean hasStreet() { return this.street != null || this.houseNumber != null; }

public boolean hasHouseNumber() { return this.houseNumber != null; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to extend all these to check for the empty string. (Or do the check in the set functions and set to null when the input string is empty.)

@@ -183,6 +184,16 @@ public DatabaseProperties recreateIndex(String[] languages, Date importDate) thr
return dbProperties;
}

private void createAndPutIndexMapping(String[] languages, boolean supportStructuredQueries)
{
if(supportStructuredQueries) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nitpicking: please leave spaces between if and the bracket. (Also in other parts of the code.)

}
}

private void AddQuery(Query clause) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style-nitpicking: functions should start with a lower-case letter.

.boost(boost)
.build()
.toQuery());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a complex query over all languages here, you could also work with a collector field for each address part. Define a field 'state_collector' which is indexed but not stored. Then instead of setting the index for 'state.', you copy to 'state_collector'.

You'd loose the boost factor for wrong language. In my experiments it made little enough difference in accuracy for quite a bit of performance boost.

var query4QueryBuilder = new AddressQueryBuilder(lenient, language, languages)
.addCountryCode(request.getCountryCode())
.addState(request.getState(), request.hasCounty() || request.hasCityOrPostcode())
.addCounty(request.getCounty(), request.hasCityOrPostcode())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two are missing hasStreet() in the 'hasMoreDetails' field. It might seem that it doesn't make much sense to have 'street' and 'county' in the query but no 'city' parameter, but that has never stopped users from trying anyway.

.build()
.toQuery())
.boost(boost);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the logic here. I would expect that you'd always search in the address field when hasMoreDetails is true and in the name otherwise. Right now it looks like you can also return results where all searched fields are in the address only. That seems rather confusing. For example a query with /structured?city=vaduz returns the city of Vaduz as expected but then also a lot of address results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea was that you still find the city if the street/housenumber is wrong (or not yet in openstreetmap) but I clearly messed something up. Returning houses or streets when no street is given is obviously a bad idea and I thought I handled that case (and failed) -> have to investigate.

.boost(HouseNumberBoost)
.build()
.toQuery());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: bad indent

@@ -32,12 +36,12 @@ public IndexMapping addLanguages(String[] languages) {
for (var field: ADDRESS_FIELDS) {
mappings.properties(String.format("%s.%s", field, lang),
b -> b.text(p -> p
.index(false)
.index(shouldIndexAddressField(field))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the new indexes simply use the default analyser. I wonder if 'analyser("index_raw")' and searchAnalyser("search") wouldn't be better here. At a minimum, it gives you German Umlaut-Folding (ö vs. oe).

You can set analysers unconditionally. They simply have no effect when the index is disabled.

@lonvia
Copy link
Collaborator

lonvia commented Jun 11, 2024

Sorry for creating the conflict. I found #816 while reviewing this PR. Please just rebase on master at some point. I don't mind force pushes to PR branches.

results = sendQuery(buildQuery(photonRequest, true).buildQuery(), extLimit);

if (results.hits().total().value() == 0 && photonRequest.hasStreet()) {
var street = photonRequest.getStreet();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this turned out to be easier than having an unmaintainable complex query.
Of course this could also be done on the client side...

// match only name
if (hasPostCode) {
// post code can implicitly specify a district that has the city in the address field (instead of the name)
combinedQuery = QueryBuilders.bool()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. /structured?state=Baden Württemberg&district=Plotzsägmühle&housenumber=2

Copy link
Collaborator

@lonvia lonvia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query works much better now.

You don't allow typos in the second lenient try. Is this intentional? I would have expected auto fuzziness to be enabled for it.

/structured/?city=vaduz&street=lettstrasse still returns me the bus stop "Lettstrasse". We might need a layer filter here.

The other thing that stands out is that the usually abbreviations don't work ('strasse' -> 'str'). The freetext search mitigates that somewhat by always doing a prefix search so that some abbreviations just accidentally work (doesn't help with 'road' -> 'rd', though). We don't have this here. I would defer this topic to later. Just wanted to mention it, in case somebody tries this and doesn't get results.

On the nitpicking side: there are still some function names that start with an upper-case letter.

}

mappings.properties(propertyName,
b -> b.text(p -> p.copyTo(collectors)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the 'index(false)' is gone. I'm not entirely sure what the default is, so prefer explicit statements.

mappings.properties(field + ".default", b -> b.text(p -> p
.index(false)
.copyTo("collector.default", "collector.base")));
.index(shouldIndexAddressField(field))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need an index here?

@tobiass-sdl
Copy link
Contributor Author

Luckily I got rid of the bus stops without an overly complex layer filter. There was already a filter to filter out results with house numbers if the request does not contain one. Bus stops and a few other locations passed that filter because they have no house number. Therefore fixing the filter from "no house number" to "no house number and not a house" should be sufficient.
I totally agree with ignoring abbrevations for know - that's similar to the issue with state="NY" vs state="New York".
No fuzziness for lenient=false was a bug.

@lonvia
Copy link
Collaborator

lonvia commented Jun 26, 2024

Therefore fixing the filter from "no house number" to "no house number and not a house" should be sufficient.

You may even want to put a global filter on the query "when type is house then it must have a house number". And also add a global filter for "type != other". This should consistently filter out all object that are not "address-like".

@lonvia lonvia merged commit be99d7d into komoot:master Jul 9, 2024
4 checks passed
@lonvia
Copy link
Collaborator

lonvia commented Jul 9, 2024

Looks good as a first version. We can fine-tune in follow-up PRs, if necessary.

@lonvia
Copy link
Collaborator

lonvia commented Jul 9, 2024

Oh dear, now I forgot about documentation. Do you mind adding a little bit of documentation for the new call in a separate PR?

I would suggest to do this in an extra file docs/structured.md and in the README.md just link to it (with the warning that it is optional). If the call is described directly in the main README, we will end up with repeated questions why it doesn't work on photon.komoot.io.

@tobiass-sdl
Copy link
Contributor Author

Yes, I can add some documentation. I'm off next week, not sure whether I get to it before the 20th.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants