Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add libpostal address detection #23

Open
huu4ontocord opened this issue Feb 22, 2022 · 5 comments
Open

Add libpostal address detection #23

huu4ontocord opened this issue Feb 22, 2022 · 5 comments

Comments

@huu4ontocord
Copy link
Owner

huu4ontocord commented Feb 22, 2022

Add regex for basic potential addresses such as a \d+ followed by \s+ and a \w {5,30} and a comma and then another \d+. Then test if there's no stopwords within the \w, and then feed the whole thing to libpostal to check if there is an address. Libpostal will tell us house, road, etc. We need to check if there is a road, etc. "house" doesn't really tell us anything as that is almost always caught.

@huu4ontocord
Copy link
Owner Author

This test should be done after an stdnum and a date test. in this way, we know it's not a stdnum and not a date

@huu4ontocord
Copy link
Owner Author

@paulovn

@paulovn
Copy link
Collaborator

paulovn commented Feb 26, 2022

the idea looks fine to me, and using libpostal excluding house seems a sane approach.

Except for the base regex, which in reality is not fully language independent. The current one proposed is

  • number + space + name + comma + number

That may work for US English, though it might miss the state (e.g. 10 Boulevard Rd Los Angeles, CA 38718 ). But e.g. in British addresses the postcode has numbers AND letters : 71 Cherry Court Southampton SO53 5PD

And other languages can be more diverging: These are a few examples of sequences I came up with:

  • UK: number + s + name + s + street-designator + s + postcode + s + city
  • Spain: street-designator + s + name + s + comma + s + number + [comma + name/number] + s + postcode-number + s + city
  • Germany name + street-designator + s + number + s + postcode-number + s + city (a speciall feature of German addresses is the lack of space between the street name and street designer, e.g. Hauptstraße )
  • France: number + s + street-designator + s+ name + s + postcode-number + s + city

Other countries might fit in these patterns, for instance Portugal is quite similar to Spain (but Brazil puts postcode after the city)
The point is, we would need a library of regexes by lang/country. I would rather make them liberal (e.g. don't force having a comma) to fit in more variations.

The wikipedia page about addresses has a good recollection: https://en.wikipedia.org/wiki/Address

@huu4ontocord
Copy link
Owner Author

@shamikbose see above ^^.

@shamikbose
Copy link
Collaborator

Thanks, @ontocord ! I will look into it this weekend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants