Paper
https://aidanhogan.com/docs/sparql-autocompletion.pdf
You will first need a Wikidata dump.
-
Wikidata dump latest truthy (
.nt.gz
is larger, but faster for building the index.) https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.gz [~50GB] -
For testing you can use an internal file:
SPARQLforHumans\Sample500.nt
Then some tools for compiling the source code.
dotnet SDK
https://dotnet.microsoft.com/download/dotnet/thank-you/sdk-5.0.402-windows-x64-installergit for windows SDK
https://github.com/git-for-windows/build-extra/releases/latest
Git For Windows SDK
since need an update version ofgzip
to sort the large output files
- For the
RDFExplorer
client, we will also neednode
https://nodejs.org/en/download/
If your planning on running the benchamarks only, then
node
is not required.
On the Git SDK-64
console (Git for Windows SDK Console
)
Install gzip
via pacman
$ pacman -S gzip
We'll now need to clone the repository and build the project.
On the Git SDK-64
console
$ git clone https://github.com/gabrieldelaparra/SPARQLforHumans.git
> Cloning into 'SPARQLforHumans'...
> [...]
$ cd SPARQLforHumans
$ dotnet build .
> [...]
> Build succeeded.
> 0 Warning(s)
> 0 Error(s)
Now we will run some tests to check that everything works.
$ cd SparqlForHumans.UnitTests/
$ dotnet test
> [...]
> Passed! - Failed: 0, Passed: 214, Skipped: 0, Total: 214, Duration: 9 s - SparqlForHumans.UnitTests.dll (netcoreapp3.1)
If any of the tests do not pass, you can create an issue and I will get in touch with you :)
Now we will run the Command Line Interface to filter and index our Wikidata dump.
$ cd ../SparqlForHumans.CLI/
$ dotnet run -- --version
> SparqlForHumans.CLI 1.0.0
For the following sections a given Sample500.nt
file is given on the root folder of the repository.
To build the complete index (production), latest-truthy.nt.gz
should be used.
Note that filtering, sorting and indexing the
latest-truthy.nt.gz
will take~40 hours
, depending on your system.
Filters an input file:
- Keeps all subjects that start with
https://www.wikidata.org/entity/
- Keeps all predicates that start with
https://www.wikidata.org/prop/direct/
- and object starts with
https://www.wikidata.org/entity/
- and object starts with
- or predicate is
label
,description
oralt-label
- and object is literal and ends with
@en
.
- and object is literal and ends with
These can be changed on the SparqlForHumans.RDF\FilterReorderSort\TriplesFilterReorderSort.IsValidLine()
method.
To filter run:
$ dotnet run -- -i ../Sample500.nt -f
The command for sorting
is given in the console after filtering.
It will add the .filterAll.gz
sufix as filtered output and .filterAll-Sorted.gz
for sorting.
Filter for
latest
takes~10 hours
on my notebook computer (16GB RAM).
Sorting takes Sample500.filterAll.gz
as input and outputs Sample500.filterAll-Sorted.gz
.
The sorting command process gives no notifications about the status.
Sortinglatest
takes~5 hours
and requires3x
the size ofFiltered.gz
disk space (~40GB
free forlatest
)
$ gzip -dc Sample500.filterAll.gz | LANG=C sort -S 200M --parallel=4 -T tmp/ --compress-program=gzip | gzip > Sample500.filterAll-Sorted.gz
After filtering and sorting, we can now create our index. As a note, both "-e -p
" can be given together for the sample file to generate both Entities and Properties Index. For a large file, it is better to do them in 2 steps.
Entities Index will be created by default at %homepath%\SparqlForHumans\LuceneEntitiesIndex\
$ dotnet run -- -i Sample500.filterAll-Sorted.gz -e
If -p
was not used above, then we need to create the Properties Index.
Building the Entities Index takes
~30 hours
to complete.
$ dotnet run -- -i Sample500.filterAll-Sorted.gz -p
Properties Index will be created by default at %homepath%\SparqlForHumans\LucenePropertiesIndex\
Now our index is ready.
- We can now run our backend via
SparqlForHumans.Server/
using theRDFExplorer
client. - Or recreate the results from the paper via
SparqlForHumans.Benchmark/
.
Building the Properties Index takes
~2 hours
to complete.
The backend will listen to request from a modified version of RDFExplorer
. First we will need to get the server running:
$ cd ../SparqlForHumans.Server/
$ dotnet run
With the server running we can now start the client.
We will now need another console for this.
$ cd `path-for-the-client`
$ git clone https://github.com/gabrieldelaparra/RDFExplorer.git
$ cd RDFExplorer
$ npm install
$ npm start
Now browse to https://localhost:4200/
With the full index we can compare our results agains the Wikidata Endpoint
.
67
Properties ({Prop}
) have been selected to run4
type of queries (For a total of268
)?var1 {Prop} ?var2 . ?var1 ?prop ?var3 .
?var1 {Prop} ?var2 . ?var3 ?prop ?var1 .
?var1 {Prop} ?var2 . ?var2 ?prop ?var3 .
?var1 {Prop} ?var2 . ?var3 ?prop ?var2 .
268
queries are run against ourLocal Index
and theRemote Endpoint
.- We will query for
?prop
on both (Local and Remote) and compare the results. - Running the benchmarks takes
2~3 hours
, due to the 50 seconds timeout if the query cannot be completed on the Wikidata Endpoint. - The details of the runs will be stored at
benchmark.json
. - The time results will be summarized at
results.txt
. - The time results, for each query, will be exported to a
points.csv
. Each row is a query. TheId
of the query can be found on thebenchmark.json
file asHashCode
. - A qualitative comparison (
precision
,recall
,f1
), for each query, will be exported tometrics.csv
. Each row is a query. This will only consider those queries that returned results on theRemote Wikidata Endpoint
.
$ cd ../SparqlForHumans.Benchmark/
$ dotnet run