Shape Index Specification

Abstract

The open web and linked data are often seen as complementary, yet this doesn't imply that data publication should be disorganized. While the web as a whole lacks a clear structure, web segments can still be structured through hypermedia descriptions and precise specifications. This document introduces the Shape Index specification, designed to facilitate data quality control and optimize query execution in decentralized unindexed networks of linked data. The shape index leverages RDF data shapes to provide explicit constraints on data publication and provide lightweight yet powerful structural and statistical information about the content of RDF datasets in decentralized networks.

Terminology

Subweb: A subset of the web, defined by a set of IRIs or URI template [[RFC6570]] (URI template are considered as sets of IRI).
RDF Data Shape: A schema definition of an RDF subgraph, independent of any specific shape language. Typically, this MAY be formalized in ShEx [[SHEX-SPEC]] or SHACL [[SHACL]]. The term is used interchangeably with the shortening shape.
Complete Index: An index that fully describes every resource within its associated subweb.
Open shape: A shape with an open world assumption, meaning that an RDF graph respecting the shape can be a subset of the shape definition given no contradiction with the shape definition with for instance negative properties or non-respecting term cardinalities.
Closed shape: A shape with a close world assumption, meaning that the shape describes fully the contain of an RDF graph.

Why a Shape Index?

This section is non-normative

A schematic representation of a shape index — A shape index annotates subwebs of documents corresponding to certain data shapes

When publishing linked data, providers face many concerns, ranging from data model and ontology design to the exposition of data through an endpoint and its distribution. When distributing data, publishers often structure it logically. For example, data about books might be located at ex:books, while personal data might be found at ex:profile, and so on.

The Solid specification [[SOLID-SPEC]], with its concept of storage units (often referred to as pods [[TAELMAN-2023]]), and its use of the LDP specification [[LDP-SPEC]], provides a structured approach to distributing linked data. The approach consists of delimiting the domain of a dataset’s publication through a tree-structured URL pattern. Such a structure allows query engines to gain some information about the limit of the subweb of publication of specific data publishers. However, it provides little information about the dataset's content. The content information could be particularly useful in traversal queries by helping query engines to infer whether the content of specific documents is relevant to the query. Additionally, such information could help make strategic decisions for query planning considering the provenance of data.

The type index specification [[TYPE-INDEX-SPEC]] provides a limited means for query engines to know the content of a dataset, or more specifically, a Solid pod [[TAELMAN-2023]]. In essence, the type index is a mapping between types and subwebs within a single Solid pod. The type index has two major limitations. First, it only requires the data provider to specify a type, without offering additional information about the properties associated with that type. This limits the ability of query engines to assess the relevance of sources and provides minimal structural or statistical information about the dataset's content. Additionally, the type index is closely tied to Solid, as it does not clearly define the subweb it operates within or whether it describes the entire dataset within that subweb. As a result, this solution is difficult to generalize beyond the Solid ecosystem and offer limited option for query planning and search space optimization.

The Shape Trees specification [[SHAPE-TREES-SPEC]] is another effort aimed at enforcing structure in decentralized datasets. It proposes an indexing system centered around RDF data shapes, addressing the issue of limited structural and statistical information found in the type index. However, the specification suffers from its complexity and the lack of a concrete definition of a subweb (the specification is somewhat tied to Solid). While this complexity may be justified given its objectives, for our purpose, query optimization using lightweight information, it becomes a burden for both data publishers and query engines.

We propose a Shape Index as a generalizable and extendable solution for decentralized RDF data publication and summarization. It can be seen as an evolution of the type index, using the concept of shapes from Shape Trees but in a simpler form. The primary focus of the Shape Index is query optimization, offering low-cost statistical and structural information to query engines while minimizing the burden on data providers. The specification may evolve to include optional, richer, and higher-cost statistical information about data sources for more advanced query optimization. Like Shape Trees, the Shape Index maps RDF data shapes to resources. However, it introduces an explicit definition of the subwebs in which the Shape Index operates. It also provides information on whether the index describes each resource within the subweb, allows for disjointed targets and subwebs. The shape index also supports negative entries to indicate what is excluded from a target or the entire subweb.

Shape Index

A shape index is a scoped mapping between sets of RDF resources and RDF data shapes. In this context, scope refers to the index being defined by a pair: the subweb (`si:subweb`) and the completeness of the index (`si:complete`). A subweb is a set of IRIs or URI template managed by the data provider, while the completeness of a shape index indicates whether every resource in the subweb is associated with a shape. To create a disjoint subweb, a data publisher MUST use the predicate si:subweb multiple times, which MUST be interpreted as a union of IRI sets (with an URI template already representing a set of IRIs). This scope characterization of the index is beneficial for two reasons. First, query optimizations related to the shape index are local, so it is helpful for query engines to know which subwebs are affected when performing traversal queries. Second, it allows data providers the flexibility to avoid describing every resource in the subweb, which can reduce the computational cost of maintaining the mapping or allow more freedom in their data publication approach. While it is OPTIONAL to describe every resource in a subweb within a shape index, it is RECOMMENDED, as this provides better opportunities for query optimization.

The core feature of the shape index is its mappings. A shape index can have multiple entries, each identified by the term `si:entry`. Every entry includes a target set, identified by the term `si:subweb`. A target MUST be a subset of the shape index's subweb, and the target subwebs MUST NOT overlap. Each target set is associated with a shape, identified by the term `si:shape`. The target is a set of RDF resources, and every triple within these resources MUST conform to the shape linked to the target through the `si:shape` annotation. The example below present an example of a shape index.

      <http://mySubweb.com/index1> a si:ShapeIndex ;
        si:subweb <http://mySubweb.com/> ;
        si:subweb "http://mySubweb.com/0/data/.*/*" ;
        si:entry [
          si:shape ex:profile#ProfileShape ;
          si:subweb <http://mySubweb.com/profile>
        ], [
          si:shape ex:movieReview#MovieReviewShape ;
          si:subweb "http://mySubweb.com/0/data/movie_review/.*"
        ] , [
          si:shape <http://mySubweb.com/readingList#ReadingListShape> ;
          si:subweb <http://mySubweb.com/0/data/personal_reading_list>
          si:subweb <http://mySubweb.com/0/data/research/paper_list>
          si:subweb "http://mySubweb.com/0/data/reading_list/*"
        ] ;
        si:complete true .

Furthermore, every set of triples within the subweb of the shape index that conforms to a shape (with a closed-world assumption) associated with a `si:shape` MUST be contained within a resource that is part of the target annotated with that shape. It is RECOMMENDED to use shapes with closed-world assumptions. If shapes with open-world assumptions are used, triples conforming to the shape SHOULD be inside a resource of the target.

When shapes nested within each other are used to bind targets (see the example below and notice that the shapes are open), a set of triples conforming to the most restrictive shape (e.g., "http://mySubweb.com/readingList#ReadingListShape" in the example) SHOULD be located within the resources of that target. In a similar scenario with closed shapes, the set of triples MUST be located in the target of the most restrictive shape. If a set of triples can be validated by both an open and a closed shape, the triples MUST be located in the target of the closed shape.

      PREFIX ex: <http://example.com/#>
      PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
      PREFIX school: <http://school.example/#>
      PREFIX foaf: <http://xmlns.com/foaf/0.1/>

      http://mySubweb.com/readingList#List {
        foaf:name xsd:string ;
        ex:item IRI *;
      }

      http://mySubweb.com/readingList#ReadingListShape {
        foaf:name xsd:string ;
        ex:item IRI *;
        ex:isbn IRI;
        ex:rank xsd:number;
      }

Negative Entries

The shape index allows for describing RDF data shapes that are not present within a subweb. This is indicated using the `si:excludes` property within a `si:Entry`. Shapes specified in this way MUST adhere to a closed-world assumption, ensuring that no sets of triples within the subweb target validate against these shapes. It is RECOMMENDED to use these exclusions alongside shape constraints in the index, particularly to specify when nested shapes are absent from the subweb.

      <http://mySubweb.com/index2> a si:ShapeIndex ;
        si:subweb "http://mySubweb.com/.*" ;
        si:entry [
          si:shape ex:profile#ProfileShape ;
          si:subweb <http://mySubweb.com/profile>
        ];

        si:entry [
          si:shape ex:plant#PlantShapes;
          si:excludes true ;
          si:subweb "http://mySubweb.com/.*";
        ];

        si:entry [
          si:shape ex:plant#CarShapes;
          si:excludes true ;
          si:subweb <http://mySubweb.com/toy>
        ];

Shape Index for Query Optimization

This section is non-normative

One of the main objectives of the shape index is query optimization. This section provides an overview of the current approaches used to optimize queries with the shape index. As this section is non-normative and practices can evolve, it is subject to change. Readers are encouraged to contact the maintainers to propose new approaches or suggest extensions to the shape index based on techniques that build upon this approach.

Search Space Optimization

The shape index shares some similarities with the Type Indexes [[TYPE-INDEX-SPEC]]. Data discovery techniques like the one presented by Taelman and Verborgh [[TAELMAN-2023]] can be employed with small adaptations. Taelman and Verborgh's technique involves pruning sources associated with entries that do not share predicates with the user's query. Although this approach is an non-evaluated extension of their original method, it is expected to deliver performance that is equal to or better than the results presented in their paper.

Another approach, however, built to use the shape index, has been proposed by Tam et al. [[TAM-2024]] . This approach involves translating shapes into SPARQL algebra and performing a query containment problem between the user’s query and the translated shapes to determine which sources to dereference, taking into account the subweb and the completeness of the index. This method has been shown to significantly reduce query execution time compared to the approach by Taelman and Verborgh. However, it does not consider negative entries and has not been as extensively evaluated as Taelman and Verborgh's method.

Vocabulary

This section present the vocabulary to define a Shape Index. The onthology is defined in RDF at the following document.

`si:ShapeIndex`

A Shape Index

`si:shapeIndexLocation`

Indicate that a shape index is located at the IRI at the object position of a triple using this term at the predicate position.

`si:subweb`

The section of the web that the shape index caracterizes. It can be a single IRI or URI template in the form of a string or a set of IRI and/or string of URI template. To represent a set multiple `si:subweb` predicate MUST be defined, thus the subweb does not have to be in a tree structure.

The term `` can be used as a subweb term. In that case, the IRI set is characterized by the IRI following the definition of containers from the Solid Specification [[SOLID-SPEC]] with the same usage as the mapping in the type index specification [[TYPE-INDEX-SPEC]].

`si:complete`

A flag indicating if a shape index is complete, thus, every resource inside the subweb is describe by the index.

`si:Entry`

A shape index entry mapping a shape of resources to an RDF shape.

`si:entry`

Lead to a `si:Entry` in the object position.

`si:shape`

The shape binding a `si:subweb`.

`si:excludes`

Indicate that the `si:Entry` exclude the shape from the target subweb.