Skip to content
This repository has been archived by the owner on Dec 21, 2017. It is now read-only.

Parsing very slow on larger files #3

Open
ijdickinson opened this issue Jul 1, 2010 · 1 comment
Open

Parsing very slow on larger files #3

ijdickinson opened this issue Jul 1, 2010 · 1 comment

Comments

@ijdickinson
Copy link

I'm reading in a bunch of RDF files, each into their own RdfContext::Graph. The results below show the timings I'm getting. Small files load just fine; larger files take disproportionately long. One file takes 8.5 minutes to load 38k triples. I'm running on a quad-core 64 bit Ubuntu system with 8Gb memory and using ruby 1.9.1, so I don't think the raw performance of the machine is an issue.

log file output:

loading concept definitions...
Initializing coins_concept with target/def/sector.nt
... parsing complete in 0.1s producing 39 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/sector
Initializing coins_concept with target/def/data-type.nt
... parsing complete in 1.6s producing 487 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/data-type
Initializing coins_concept with target/def/programme-admin.nt
... parsing complete in 0.2s producing 47 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/programme-admin
Initializing coins_concept with target/def/cga-body-type.nt
... parsing complete in 0.2s producing 47 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/cga-body-type
Initializing coins_concept with target/def/resource-capital.nt
... parsing complete in 0.1s producing 39 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/resource-capital
Initializing coins_concept with target/def/pesa-transfer.nt
... parsing complete in 0.3s producing 87 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/pesa-transfer
Initializing coins_concept with target/def/account-code.nt
... parsing complete in 20.2s producing 4711 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/account-code
Initializing coins_concept with target/def/estimate-number.nt
... parsing complete in 2.5s producing 503 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/estimate-number
Initializing coins_concept with target/def/cofog.nt
... parsing complete in 4.5s producing 1271 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/cofog
Initializing coins_concept with target/def/department-code.nt
... parsing complete in 3.3s producing 847 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/department-code
Initializing coins_concept with target/def/budget-capital-current.nt
... parsing complete in 0.3s producing 47 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/budget-capital-current
Initializing coins_concept with target/def/request-for-resources-next-year.nt
... parsing complete in 0.2s producing 63 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/request-for-resources-next-year
Initializing coins_concept with target/def/counterparty-code.nt
... parsing complete in 1.7s producing 431 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/counterparty-code
Initializing coins_concept with target/def/pesa-delivery.nt
... parsing complete in 0.1s producing 31 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/pesa-delivery
Initializing coins_concept with target/def/income-category.nt
... parsing complete in 0.5s producing 111 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/income-category
Initializing coins_concept with target/def/estimate-line.nt
... parsing complete in 2.1s producing 615 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/estimate-line
Initializing coins_concept with target/def/programme-object-group-code.nt
... parsing complete in 125.7s producing 15895 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/programme-object-group-code
Initializing coins_concept with target/def/estimates-aina.nt
... parsing complete in 0.1s producing 39 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/estimates-aina
Initializing coins_concept with target/def/estimates-capital-current.nt
... parsing complete in 2.1s producing 63 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/estimates-capital-current
Initializing coins_concept with target/def/activity-code.nt
... parsing complete in 6.0s producing 1375 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/activity-code
Initializing coins_concept with target/def/estimate-number-next-year.nt
... parsing complete in 2.4s producing 503 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/estimate-number-next-year
Initializing coins_concept with target/def/accounting-authority.nt
... parsing complete in 0.9s producing 159 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/accounting-authority
Initializing coins_concept with target/def/pesa-current-grants.nt
... parsing complete in 1.0s producing 215 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/pesa-current-grants
Initializing coins_concept with target/def/estimate-line-next-year.nt
... parsing complete in 2.8s producing 615 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/estimate-line-next-year
Initializing coins_concept with target/def/request-for-resources.nt
... parsing complete in 0.2s producing 63 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/request-for-resources
Initializing coins_concept with target/def/pesa-services.nt
... parsing complete in 0.4s producing 39 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/pesa-services
Initializing coins_concept with target/def/estimate-line-last-year.nt
... parsing complete in 2.6s producing 575 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/estimate-line-last-year
Initializing coins_concept with target/def/nac.nt
... parsing complete in 4.0s producing 951 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/nac
Initializing coins_concept with target/def/estimate-number-last-year.nt
... parsing complete in 2.5s producing 495 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/estimate-number-last-year
Initializing coins_concept with target/def/budget-boundary.nt
... parsing complete in 0.1s producing 39 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/budget-boundary
Initializing coins_concept with target/def/pesa-1.1.nt
... parsing complete in 0.1s producing 31 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/pesa-1.1
Initializing coins_concept with target/def/esa.nt
... parsing complete in 2.6s producing 543 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/esa
Initializing coins_concept with target/def/territory.nt
... parsing complete in 0.2s producing 71 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/territory
Initializing coins_concept with target/def/data-subtype.nt
... parsing complete in 2.3s producing 471 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/data-subtype
Initializing coins_concept with target/def/department-group.nt
... parsing complete in 2.1s producing 439 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/department-group
Initializing coins_concept with target/def/signage.nt
... parsing complete in 0.1s producing 31 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/signage
Initializing coins_concept with target/def/request-for-resources-last-year.nt
... parsing complete in 0.4s producing 63 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/request-for-resources-last-year
Initializing coins_concept with target/def/programme-object-code.nt
... parsing complete in 513.3s producing 38855 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/programme-object-code
Initializing coins_concept with target/def/sbi.nt
... parsing complete in 8.1s producing 455 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/sbi
Initializing coins_concept with target/def/time.nt
... parsing complete in 0.8s producing 119 triples
... indexed as https://finance.data/gov.uk/def/statistical-concept/time
Total time taken 720.9s

The files are in n-triples format: I also tried with Turtle input but gave up after waiting too long! I've tried with :list_store and :memory_store, it doesn't make much difference.

My guess is that something in the parser loop is not scaling linearly with the size of the input file, but that's just a guess. I don't think there's anything special about the input files themselves, but am happy to provide copies if that helps with debugging.

Ian

@gkellogg
Copy link
Owner

gkellogg commented Jul 1, 2010

The SQLite3 store will provide persistent storage, and may scale better for even larger graphs, but it is slower for smaller graphs. That would be :store => SQLite3.new(:path => "store.db"). You may have also found a memory leak within the Parser. The NTriples parser is the same as the Turtle/N3, so that could be an issue. Do you have the same problem parsing large files in other serializations?

If you have a script to run through these, I'll check it out.

Also, note that the same parsers and serializers in RdfContext are also available through RDF.rb as rdf-rdfa, rdf-n3 and rdf-rdfxml. RDF.rb has a richer infrastructure for graph storage than RdfContext. I've also noticed that RDF/XML parsing is substantially faster, due to some underlying optimizations in that implementation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants