Skip to content
yacy edited this page Sep 14, 2010 · 4 revisions

Cider is a parser library framework for the semantic web.

The framework containes a large number of text, image and audio parsers that are used to store the parsing result into a RDF data structure provided by jena. The resulting RDF contains annotated data using vocabularies such as Dublin Core, FOAF, SKOS and other vocabularies that are necessary to track the origin, references and processing of the data usually needed to feed a search engine.

To provide a semantic view on the parsed documents, Cider also applies content analysation like:

  • detecting the language
  • annotation of persons, locations and dates
  • finding references to other documents
  • detecting theme-oriented data (like architecture, biology, chemics, computers, music, philosophy, physics etc.).

This can be done using dictionaries from other sources that provide RDF-annotated data like freebase, opencyc, dbpedia, yago, musicbrainz, geonames, dblp, census.

The third component of Cider is a data representation module that shows the parsed data for special use cases that are usual for search engine applications, like:

  • providing a ‘navigation’-view (show all data that can be used for a search navigator)
  • providing a ‘ranked bag of words’-view (list all words in context of ranking attributes for word importance)
  • providing a ‘snippet’-view (show a text snippet for a given search string)

Read on: code examples, eclipse how-to, applications

Clone this wiki locally