Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code duplication: determine an entity type by looking at the format of its id #424

Closed
wetneb opened this issue Aug 2, 2019 · 7 comments · Fixed by #498
Closed

Code duplication: determine an entity type by looking at the format of its id #424

wetneb opened this issue Aug 2, 2019 · 7 comments · Fixed by #498
Labels
good first issue Well-scoped issue that can be a good first step to get started

Comments

@wetneb
Copy link
Member

wetneb commented Aug 2, 2019

There are at least two places in our code where we try to guess the type of an entity based on the format of its id:

/**
* RReturns the entity type of the id like "item" or "property"
*
* @param id
* the identifier of the entity, such as "Q42"
* @throws IllegalArgumentException
* if the id is invalid
*/
static String guessEntityTypeFromId(String id) {
if(id.isEmpty()) {
throw new IllegalArgumentException("Entity ids should not be empty.");
}
switch (id.charAt(0)) {
case 'L':
if(id.contains("-F")) {
return JSON_ENTITY_TYPE_FORM;
} else if(id.contains("-S")) {
return JSON_ENTITY_TYPE_SENSE;
} else {
return JSON_ENTITY_TYPE_LEXEME;
}
case 'P':
return JSON_ENTITY_TYPE_PROPERTY;
case 'Q':
return JSON_ENTITY_TYPE_ITEM;
default:
throw new IllegalArgumentException("Entity id \"" + id + "\" is not supported.");
}
}

and

/**
* Returns the IRI of the primitive Type of an Property for
* {@link EntityIdValue} objects.
*
* @todo this really ought to be exposed by the wdtk-datamodel
* module and reused here. The same heuristic is implemented in {@class EntityIdValueImpl}.
* @param propertyIdValue
* @param value
*/
public String setPropertyTypeFromEntityIdValue(
PropertyIdValue propertyIdValue, EntityIdValue value) {
switch (value.getId().charAt(0)) {
case 'Q':
return DatatypeIdValue.DT_ITEM;
case 'P':
return DatatypeIdValue.DT_PROPERTY;
case 'L':
if (value.getId().contains("F")) {
return DatatypeIdValue.DT_FORM;
} else if(value.getId().contains("S")) {
return DatatypeIdValue.DT_SENSE;
}
return DatatypeIdValue.DT_LEXEME;
default:
logger.warn("Could not determine datatype of "+ propertyIdValue.getId() + ".");
logger.warn("Example value "+value.getId()+ " is not recognized as a valid entity id.");
logger.warn("Perhaps this is a newly introduced datatype not supported by this version of wdtk.");
logger.warn("Consider upgrading the library to a newer version.");
return null;
}
}

That is a problem: it is easy to forget to update one of them (it caused #407).
We should expose this functionality publicly in the wdtk-datamodel module. This raises questions:

  • where should this static method live?
  • what representation of the entity types should it return? Currently one uses the JSON strings, the other uses IRIs for datatypes. I think it should be IRIs for entity types since that is what the EntityIdValue interface offers… But then we probably need other static methods to help convert between all these formats? It's a bit of a mess :-/
@Tpt
Copy link
Collaborator

Tpt commented Aug 2, 2019

Yes, indeed it's a bit of a mess. The choice have been done at the beggining of WikidataToolkit to create URIs for each Wikibase datamodel enumeration (datatypes, entity types...), leading to this duplication. We have two options here:

  1. Drop everywhere the URIs and only returns the JSON string ids that are used by Wikibase. It has the strong advantage of not having to update the mappings each time a new value is introduced.
  2. Keep the library working as it is.

I would slightly prefer 2. because it is less disruptive for the library user. In this case we should probably have a method that returns the Datatype IRI. This method could be public in the EntityIdValue interface. We could then have package private conversion methods between the two representation in EntityIdValueImpl, just like it is done in DatatypeIdImpl for the datatype ids.

@wetneb
Copy link
Member Author

wetneb commented Aug 2, 2019

Option 2 is fine with me too.
Since the entity type IRIs are also exposed publicly in DatatypeIdValue I would say that it would be good to have a public conversion utility, transforming http:https://wikiba.se/ontology#WikibaseForm (datatype IRI) to http:https://www.wikidata.org/ontology#Form (entity type IRI) and back. Conversion to JSON strings can stay private since they are not exposed except in the pathological case of unsupported entities.

@wetneb wetneb added the good first issue Well-scoped issue that can be a good first step to get started label Mar 5, 2020
@snehasi
Copy link
Contributor

snehasi commented May 25, 2020

Hi

Is this issue still open? I was planning to work on it. Would appreciate if you could provide some pointers on it.

From first look, it seems that creating a function entityIdToType in some common module should suffice.

Then we can replace the switch cases in guessEntityTypeFromId and setPropertyTypeFromEntityIdValue with a function call to entityIdToType.

The above mentioned function could have a flag to indicate whether JSON string should be returned, and we could default to IRIs or other way round. This way we might not even need to add a public function for transforming between JSON string and IRIs since we are preserving the old functions and just abstracting out the conversion logic to a common function.

What are your thoughts about it?

@Tpt
Copy link
Collaborator

Tpt commented May 25, 2020

Hi! Yes, the issue is still open. Thank you for being interested in it!

For me entityIdToType (or guessEntityTypeFromId?) should return a DatatypeIdValue to be completely generic. And then, EntityIdValueImpl.guessEntityTypeFromId would just need to convert these types to the JSON IDs with a simple switch. We probably do not want to publicly present a method that returns JSON serialization internals.

@snehasi
Copy link
Contributor

snehasi commented May 25, 2020

HI @Tpt,

guessEntityTypeFromId sounds like a better name for the function.

Although, if we return DatatypeIdValue and do conversions in EntityIdValueImpl.guessEntityTypeFromId then again we'd run into the same problem of keeping in sync two pieces of code that need to be updated anytime a new datatype is added

Alternatively, we could have a package-private / protected function that has the switch-case logic and can return JSON or IRI based on flag.

And then we have a public function wrapper around it, that just returns IRI

protected static String guessEntityTypeFromId(String id, boolean returnJsonEntity) {
    // switch-case logic here
}

public static String guessEntityTypeFromId(Stirng id) {
    return guessEntityTypeFromId(id, false);
}

This way we can get the IRI-JSON flag without ever exposing JSON serialization internals.

@Tpt
Copy link
Collaborator

Tpt commented May 26, 2020

It sounds like a great plan! Thank you!

@snehasi
Copy link
Contributor

snehasi commented May 28, 2020

Hi @Tpt,

I've opened PR #498 with these changes, could you please take a look?

Thank you!

@Tpt Tpt closed this as completed in #498 May 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Well-scoped issue that can be a good first step to get started
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants