% BEACON link dump format % Jakob Voß
BEACON is a data interchange format for large numbers of uniform links. A BEACON link dump consists of:
Each link consists of a source URI, a target URI, and an annotation. Common patterns in source URIs and target URIs respectively can be used to abbreviate links. This specification defines:
- two serializations of link dumps (BEACON files) in a condense line-oriented format and in an XML format (),
- two interpretations of link dumps as mapping to HTML and mapping to RDF ().
The current specification is managed at https://github.com/gbv/beaconspec.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in .
The formal grammar rules in this document are to be interpreted as described in
, including the ABNF core rules HTAB
, LF
, CR
, and SP
. In
addition, the minus operator (-
) is used to exclude line breaks and vertical
bars from the rules LINE and TOKEN:
LINE = *CHAR - ( *CHAR LINEBREAK *CHAR )
TOKEN = *CHAR - ( *CHAR ( LINEBREAK / VBAR ) *CHAR )
LINEBREAK = LF | CR LF | CR ; "\n", "\r\n", or "\r"
VBAR = %x7C ; vertical bar ("|")
RDF in this document is expressed in Turtle syntax . The following namespace prefixes are used to refer to RDF properties and classes from the RDF and RDFS vocabularies , the DCMI Metadata Terms , the FOAF vocabulary , the VoID vocabulary , and the RSS 1.0 Syndication Module :
rdf: <http:https://www.w3.org/1999/02/22-rdf-syntax-ns#>
rdfs: <http:https://www.w3.org/2000/01/rdf-schema#>
dcterms: <http:https://purl.org/dc/terms/extent>
foaf: <http:https://xmlns.com/foaf/0.1/>
void: <http:https://rdfs.org/ns/void#>
rssynd: <http:https://web.resource.org/rss/1.0/modules/syndication/>
The blank node :dump
denotes the URI of the link dump and the blank node
:targetset
denotes the URI of the target dataset.
A Unicode string is normalized according to this specification, by stripping
leading and trailing whitespace and by replacing all WHITESPACE
character
sequences by a single space (SP
).
WHITESPACE = 1*( CR | LF | HTAB | SP )
The set of allowed Unicode characters in BEACON dumps is the set of valid Unicode characters from UCS which can also be expressed in XML 1.0, excluding some discouraged control characters:
CHAR = WHITESPACE / %x21-7E / %xA0-D7FF / %xE000-FFFD
/ %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
/ %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
/ %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
/ %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
/ %xD0000-DFFFD / %xE0000-EFFFD / %xF0000-FFFFD
/ %x10000-10FFFD
Applications SHOULD exclude disallowed characters by stripping them, by
replacing them with the replacement character U+FFFD
, or by refusing to
process. Applications SHOULD also apply Unicode Normalization Form Canonical
Composition (NFKC) to all strings.
A URI pattern in this specification is an URI Template, as defined in
, with all template expressions being either {ID}
for simple
string expansion or {+ID}
for reserved expansion.
A URI pattern is used to construct a URI by replacing all template expressions
with an identifier value. All identifier characters in the unreserved
range
from , and characters in the reserved
range or character
sequences matching the pct-encoded
rule for expressions being {+ID}
, are
copied literally. All other characters are copied to the URI as the sequence
of pct-encoded triplets corresponding to that character’s encoding in UTF-8
. The referenced character ranges are imported here from
for convenience:
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
A URI pattern is allowed to contain the broader set of characters allowed in Internationalized Resource Identifiers (IRI) . The URI constructed from a URI pattern by template processing can be transformed to an IRI by following the process defined in Section 3.2 of .
Example value Expression Copied as
path/dir {ID} path%2Fdir
path/dir {+ID} path/dir
Hello World! {ID} Hello%20World%21
Hello World! {+ID} Hello%20World!
Hello%20World {ID} Hello%2520World
Hello%20World {+ID} Hello%20World
M%C3%BCller {ID} M%25C3%25BCller
M%C3%BCller {+ID} M%C3%BCller
A link in a link dump is a directed connection between two resources that are identified by URIs . A link is compromised of three elements:
- a source URI,
- a target URI,
- an annotation.
Source URI and target URI define where a link is pointing from and to
respectively. The annotation is an optional whitespace-normalized Unicode
string that can be used to further describe the link or parts of it. A missing
annotation is equal to the empty string. Annotations MUST match the grammar
rule TOKEN
. The meaning of a link can be indicated by the
relation type () meta field.
Link elements are given in abbreviated form of link tokens when serialized in a BEACON file. Each link is constructed from:
- a mandatory source token
- an optional annotation token
- an optional target token, which is set to the source token if missing
All tokens MUST be whitespace-normalized before further processing. The full link is then constructed as following:
- The source URI is constructed from the
prefix
meta field URI pattern by inserting the source token, as defined in . - The target URI is constructed from the
target
meta field URI pattern by inserting the target token, as as defined in . - The annotation is constructed from the
message
meta field by literally replacing every occurrence of the character sequence{annotation}
by the annotation token. The resulting string MUST be whitespace-normalized after construction additional encoding MUST NOT be applied.
The following table illustrates construction of a link:
meta field + link token --> link element
----------------------------------------------
prefix | source | source URI
target | target | target URI
message | annotation | annotation
Constructed source URI and target URI MUST be syntactically valid.
Applications MUST ignore links with invalid URIs and SHOULD give a warning.
Note that annotation tokens are always ignored if the message
meta field does
not contain the sequence {annotation}
. Applications SHOULD give a warning in
this case.
Applications MUST NOT differentiate between equal links constructed from different abbreviations. For instance the following BEACON text file contains a single link:
#PREFIX: http:https://example.org/
#TARGET: http:https://example.com/
#MESSAGE: Hello World!
foo
The same link could also be serialized without any meta fields:
http:https://example.org/foo|Hello World!|http:https://example.com/foo
The default meta fields values could also be specified as:
#PREFIX: {+ID}
#TARGET: {+ID}
#MESSAGE: {annotation}
Another possible serialization is:
#PREFIX: http:https://example.org/
#TARGET: http:https://example.com/
#MESSAGE: Hello {annotation}
foo|World!
The link line in this example is equal to:
foo|World!|foo
Multiple occurrences of equal links in one BEACON file SHOULD be ignored. It is RECOMMENDED to indicate duplicated links with a warning.
All links in a link dump share a common relation type. A relation type is
either an URI or a registered link type from the IANA link relations registry
. The relation type is specified by the relation
meta field in
BEACON files ().
Some examples of relation types:
http:https://www.w3.org/2002/07/owl#sameAs
http:https://xmlns.com/foaf/0.1/isPrimaryTopicOf
http:https://purl.org/spar/cito/cites
describedby
replies
A link dump SHOULD contain a set of meta fields, each identified by its name
build of lowercase letters a-z
. Relevant meta fields for description of the
source and target datasets (), the link dump
(), and links () are defined in the
following. Additional meta fields, not defined in this specification, SHOULD
be ignored. All meta field values MUST be whitespace-normalized
(). Missing meta field values and empty strings MUST
be set to the field’s default value, which is the empty string unless noted
otherwise. The following diagram shows which meta fields belong to which dataset:
+----------------+ +-----------------+ +-----------------+
| source dataset | ---| link dump |---> | target dataset |
|----------------| |-----------------| |-----------------|
| | ---| |---> | |
| * sourceset | | * description | | * targetset |
| | ---| * creator |---> | * name |
+----------------+ | * contact | | * institution |
| * update | | |
| * timestamp | +-----------------+
| * feed |
| * homepage |
| |
+-----------------+
| |
| * prefix |
| * target |
| * relation |
| * message |
| * annotation |
| |
+-----------------+
The set that all source URIs in a link dump originate from is called the source dataset and the set that all target URIs originate from is called the target dataset.
The source dataset can be identified by the sourceset meta field, which MUST be
an URI if given. This field replaces the blank node :sourceset
.
The target dataset can be identified by the targetset meta field, which MUST be
an URI if given. This field replaces the blank node :targetset
.
The name meta field contains a name or title of target dataset. This field is
mapped to the RDF property dcterms:title
. For instance the name meta field
value "ACME documents" can be mapped to this RDF triple:
:targetset dcterms:title "ACME documents" .
The institution meta field contains the name or URI of the organization or of
an individual responsible for making available the target dataset. This field
is mapped to the RDF property dcterms:publisher
. For instance the institution
meta field value "ACME" can be mapped to this RDF triple:
:targetset dcterms:publisher "ACME" .
The description meta field contains a human readable description of the link
dump. This field is mapped to the dcterms:description
RDF property. For
instance the description meta field value "Mapping from ids to documents" can
be mapped to this RDF triple:
:dump dcterms:description "Mapping from ids to documents" .
The creator meta field contains the URI or the name of the person,
organization, or a service primarily responsible for making the link dump.
This field is mapped to the dcterms:creator
RDF property. The
creator is an instace of the class foaf:Agent
.
For instance the following creator meta field values:
"Bea Beacon"
"http:https://example.org/people/bea"
can be mapped the the following RDF triples, respectively:
:dump dcterms:creator "Bea Beacon" .
:dump dcterms:creator [ a foaf:Agent ; foaf:name "Bea Beacon" ] .
:dump dcterms:creator <http:https://example.org/people/bea> .
<http:https://example.org/people/bea> a foaf:Agent .
This field SHOULD NOT contain a simple URL unless this URL is also used as URI.
The contact meta field contains an email address or similar contact information to reach the creator of the link dump. The field value SHOULD be a mailbox address as specified in section 3.4 of , for instance:
[email protected]
Bea Beacon <[email protected]>
The contact meta field is mapped to the foaf:mbox
and to the foaf:name
RDF
properties. The domain of the the contact meta field is the BEACON dump. The
sample field values can be mapped to:
:dump dcterms:creator [
foaf:mbox <mailto:[email protected]>
] .
:dump dcterms:creator [
foaf:name "Bea Beacon" ;
foaf:mbox <mailto:[email protected]>
] .
The homepage field contains an URL of a website with additional information
about this link dump. This field corresponds to the RDF property
foaf:homepage
with dump
as subject. Note that this field does not specify
the homepage of the target dataset.
:dump foaf:homepage <http:https://example.org/about.html> .
The feed field contains an URL, where to download the link dump from. This
field corresponds to the RDF property void:dataDump
. An
example mapped to an RDF triple:
:dump void:dataDump <http:https://example.com/beacon.txt> .
The timestamp field contains the date of last modification of the link dump.
Note that this value MAY be different to the last modification time of a BEACON
file that serializes the link dump. The timestamp value MUST conform to the
full-date
or to the date-time
production rule in . In addition,
an uppercase T
character MUST be used to separate date and time, and an
uppercase Z
character MUST be present in the absence of a numeric time zone
offset. This field corresponds to the dcterms:modified
property.
For instance the following valid timestamp values:
2012-05-30
2012-05-30T15:17:36+02:00
2012-05-30T13:17:36Z
can be mapped to the following RDF triples, respectively:
:dump dcterms:modified "2012-05-30"
:dump dcterms:modified "2012-05-30T15:17:36+02:00"
:dump dcterms:modified "2012-05-30T13:17:36Z"
The update field specifies how frequently the link dump is likely to change.
The field corresponds to the <changefreq>
element in Sitemaps XML
format. Valid values are:
always
hourly
daily
weekly
monthly
yearly
never
The value always
SHOULD be used to describe link dumps that change each
time they are accessed. The value never
SHOULD be used to describe archived
link dumps. Please note that the value of this tag is considered a hint and
not a command.
The RDF property of this field is rssynd:updatePeriod
.
The prefix field specifies an URI patter to construct link sources. If no
prefix field is specified, the default value {+ID}
is used. If the prefix
field contains no template expression, the expression {ID}
is appended.
The name prefix
was choosen to keep backwards compatibility with
existing BEACON files.
Applications MAY map the prefix field to the RDF property void:uriSpace
or
void:uriRegexPattern
with :sourceset
as RDF subject.
The target field specifies an URI patter to construct link targets. If no
target field is specified, the default value {+ID}
is used. If the target
field contains no template expression, the expression {ID}
is appended.
Applications MAY map the target field to the RDF property void:uriSpace
or
void:uriRegexPattern
with :targetset
as RDF subject.
The message field is used as template for link annotations. The default value
is {annotation}
.
The relation field specifies the relation type for all links in a link dump.
The field value MUST be an URI. The default relation type is rdfs:seeAlso
.
This field is mapped to the RDF property void:linkPredicate
with subject
:dump
.
The annotation field specifies the RDF property between link target and link
annotation. The default value is rdf:value
having no specific meaning
.
A BEACON text file is an UTF-8 encoded Unicode file , split into
lines by line breaks (rule LINEBREAK
). The file consists of a set of lines
with meta fields, followed by a set of lines with link tokens. A BEACON text
file MAY begin with an Unicode Byte Order Mark and it SHOULD end with a line
break:
BEACONTEXT = [ BOM ] [ START ]
*METALINE
*EMPTY
LINKLINE *( LINEBREAK LINKLINE )
[ LINEBREAK ]
BOM = %xEF.BB.BF ; Unicode UTF-8 Byte Order Mark
The order of meta lines and of link lines, respectively, is irrelevant. At
least one empty line SHOULD be used to separate meta lines and link lines.
If no empty line is given, the first link line MUST NOT begin with "#"
.
EMPTY = *WHITESPACE LINEBREAK
The BEACON text file SHOULD start with a fixed meta field:
START = "#FORMAT:" +WHITESPACE "BEACON" *WHITESPACE LINEBREAK
A meta line specifies a meta field and its value. Meta field
names MUST be given in uppercase letters. All meta lines with METAFIELD
not
one of the field names defined in this specification, SHOULD be ignored.
METALINE = "#" METAFIELD ":" METAVALUE LINEBREAK
METAFIELD = +( %x41-5A ) ; "A" to "Z"
METAVALUE = LINE
Each link is given on a link line with its source token, optionally follwed by annotation token and target token:
LINKLINE = SOURCE [
VBAR ANNOTATION /
VBAR ANNOTATION VBAR TARGET /
VBAR TARGET
]
SOURCE = TOKEN
TARGET = TOKEN
ANNOTATION = TOKEN
The ambiguity of rule LINKLINE
with one occurrence of VBAR
is resolved is
following:
- If the target meta field has its default value
{+ID}
, and the message meta field has its default value{annotation}
, and the normalized second token begins with "http:" or "https:", then the second token is used as target token. - The second token is used as annotation token otherwise.
This way one can use two forms to encode links to HTTP URIs (given target meta field and message meta field with their default values):
foo|http:https://example.org/foobar
foo||http:https://example.org/foobar
A BEACON XML file is a valid XML file conforming to the following schema. The file SHOULD be encoded in UTF-8 . The file MUST:
- Begin with an opening
<beacon>
tag and end with a closing</beacon>
tag. - Specify the default namespace
http:https://purl.org/net/example
. - Include an empty
<link/>
tag for each link. - Include the source token as XML attribute
source
of each<link/>
element.
The file MAY further:
- Specify meta fields as XML attributes to the
<beacon>
tag. - Specify link tokens
target
and/orannotation
as attributes to the<link>
element.
All attributes MUST be given in lowercase. An informal schema of BEACON XML files is given in .
To process BEACON XML files, a complete and stream-processing XML parser, for
instance the Simple API for XML , is RECOMMENDED, in favor of parsing
with regular expressions or similar methods prone to errors. Additional XML
attributes of <link>
elements and <link>
elements without source
attribute SHOULD be ignored.
Note that in contrast to BEACON text files, link tokens MAY include line
breaks, which MUST BE removed by whitespace normalization. Furthermore id field,
annotation field and target token MAY include a vertical bar, which MUST be replaced
by the character sequence %7C
before further processing.
An important use-case of BEACON is the creation of HTML links as described in section . A link dump can also be mapped to an RDF graph () so BEACON provides a RDF serialization format for a subset of RDF graphs with uniform links.
The following triples are always assumed in mappings of link dumps to RDF:
:sourceset a void:Dataset .
:targetset a void:Dataset .
:dump a void:Linkset ;
void:subjectsTarget :sourceset ;
void:objectsTarget :targetset .
Each link can be mapped to at least one RDF triple with:
- the source URI used as subject IRI,
- the relation type used as predicate,
- the target URI used as object IRI.
As RDF is not defined on URIs but on URI references or IRIs, all URIs MUST be transformed to an IRI by following the process defined in Section 3.2 of . Applications MAY reject mapping link dumps with relation type from the IANA link relations registry, in lack of official URIs. Another valid solution is to extend the RDF model by using blank nodes as predicates.
The annotation SHOULD result in an additional RDF triple, unless its value equals to the empty string. The additional triple is mapped with:
- the target URI used as subject IRI,
- the
annotation
meta field used as predicate, - the annotation value used as literal object.
Applications MAY ignore annotations and map annotations to different kinds of
RDF triples if the annotation
meta field is the default value rdf:value
.
For instance an annotation could contain additional information about a link
such as its provenience, date, or probability (reification).
Typical use cases of annotations include specification of labels and a "number of hits" at the target dataset. For instance the following BEACON file in BEACON text format ():
#PREFIX: http:https://example.org/
#TARGET: http:https://example.com/
#RELATION: http:https://xmlns.com/foaf/0.1/primaryTopic
#ANNOTATION: http:https://purl.org/dc/terms/extent
abc|12|xy
is mapped to the following RDF triples:
<http:https://example.org/abc> foaf:primaryTopic <http:https://example.com/xy> .
<http:https://example.com/xy> dcterms:extent "12" .
This document does not specify a single mapping of links in a BEACON link dump to links in a HTML document, so the following description is non-normative.
A link in a BEACON dump can be mapped to a HTML link (<a>
element) as
following:
- link source corresponds to the website which a HTML link is included at,
- link target corresponds to the
href
attribute, - link annotation corresponds to the textual content,
For instance the following link, given in a BEACON text file:
http:https://example.com|example|http:https://example.org
can be mapped to the following HTML link:
<a href="http:https://example.org">example</a>
Note that the annotation field value may be the empty string. In practice, additional meta fields SHOULD be used to construct appropriate HTML links. For instance the meta fields
#RELATION: http:https://xmlns.com/foaf/0.1/isPrimaryTopicOf
#SOURCETYPE: http:https://xmlns.com/foaf/0.1/Person
#NAME: ACME documents
can be used to create a link such as
<span>
More information about this person
<a href="http:https://example.com/foo">at ACME documents</a>.
</span>
because foaf:isPrimaryTopicOf
translates to "more information about",
foaf:Person
translates to "this person", and the target dataset’s name can
be used as link label.
Programs should be prepared for malformed and malicious content when parsing BEACON files, when constructing links from link tokens, and when mapping links to RDF or HTML. Possible attacks of parsing contain broken UTF-8 and buffer overflows. Link construction can result in unexpectedly long strings and character sequences that may be harmless when analyzed as parts. Most notably, BEACON data may store strings containing HTML and JavaScript code to be used for cross-site scripting attacks on the site displaying BEACON links. Applications should therefore escape or filter accordingly all content with established libraries, such as Apache Escape Utils.