Shape Index Specification

Abstract

The open web and linked data are often seen as complementary, yet this doesn't imply that data publication should be disorganized. While the web as a whole lacks a clear structure, web segments can still be structured through hypermedia descriptions and precise specifications. This document introduces the Shape Index specification, designed to facilitate data quality control and optimize query execution in decentralized unindexed networks of linked data. The shape index leverages RDF data shapes to provide explicit constraints on data publication and provide lightweight yet powerful structural and statistical information about the content of RDF datasets in decentralized networks.

Terminology

Subweb: A subweb is defined as a set of IRIs or URI templates [[RFC6570]], where URI templates are treated as sets of IRIs. While subwebs can be nested, each subweb should be understood as a single set of IRIs rather than a collection of sets of IRIs.
RDF Data Shape: A schema definition of an RDF subgraph, independent of any specific shape language. Typically, this MAY be formalized in ShEx [[SHEX-SPEC]] or SHACL [[SHACL]]. The term is used interchangeably with the shortening shape.
Complete Index: An index that only use close shapes to describe its domain.
Open shape: A shape with an open world assumption, meaning that an RDF graph respecting the shape can be a subset of the shape definition given no contradiction with the shape definition with for instance negative properties or non-respecting term cardinalities.
Closed shape: A shape with a close world assumption, meaning that the shape describes fully the contain of an RDF graph.

Why a Shape Index?

This section is non-normative

A schematic representation of a shape index — A shape index annotates subwebs of documents corresponding to certain data shapes

When publishing linked data, providers face many concerns, ranging from data model and ontology design to the exposition of data through an endpoint and its distribution. When distributing data, publishers often structure it logically. For example, data about books might be located at `ex:books`, while personal data might be found at `ex:profile`, and so on.

The Solid specification [[SOLID-SPEC]], with its concept of storage units (often referred to as pods [[SAMBRA-2016]]) and its use of the LDP specification [[LDP-SPEC]], provides a structured approach to distributing linked data. This approach defines the scope of a dataset's publication using a tree-structured URL pattern. Such a structure allows query engines to infer the boundaries of a data publisher's subweb but offers little insight into the dataset’s actual content. Content information could be valuable for traversal queries by helping query engines determine whether specific documents contain relevant data. Additionally, such information could support strategic query planning by considering data provenance.

The type index specification [[TYPE-INDEX-SPEC]] provides a limited means for query engines to understand the content of a dataset, specifically within a Solid pod [[TAELMAN-2023]]. Essentially, the type index serves as a mapping between types and subwebs within a single Solid pod. However, the type index has two major limitations. First, it only requires data providers to specify a type without offering additional details about the properties associated with that type. This limits query engines' ability to assess a source’s relevance and provides minimal structural or statistical information about the dataset's content. Second, the type index is tightly coupled to the Solid ecosystem, as it does not clearly define the subweb it operates within or whether it describes the entire dataset in that subweb. Consequently, this solution is difficult to generalize beyond Solid and offers limited options for query planning and search space optimization.

The Shape Trees specification [[SHAPE-TREES-SPEC]] is another effort to enforce structure in decentralized datasets. It introduces an indexing system based on RDF data shapes, addressing the lack of structural information in the type index. However, its primary focus is not query optimization but rather the management of interconnected resources. This focus makes it challenging to extract the necessary information for query processing. Additionally, the specification is somewhat tied to Solid, limiting its applicability beyond that ecosystem. Given its limited adoption and the added complexity it introduces for query optimization, we propose drawing inspiration from it to develop a new shape-based indexing structure.

We propose a Shape Index as a generalizable and extendable solution for decentralized RDF data publication and summarization. It can be seen as an evolution of the type index, incorporating the concept of shapes from Shape Trees but in a simplified form, specifically designed for query optimization. The primary goal of the Shape Index is to enhance query optimization by providing low-cost statistical and structural information to query engines while minimizing the burden on data providers. The specification may evolve to include optional, richer, and higher-cost statistical details for more advanced query optimization. Like Shape Trees, the Shape Index maps RDF data shapes to resources. However, it explicitly defines the subwebs in which it operates and allows for disjointed targets and subwebs. Additionally, the Shape Index supports negative entries, enabling the specification of exclusions within a target or an entire subweb.

Shape Index

A shape index is a mapping between sets of RDF resources and RDF data shapes. A subweb is a set of IRIs or URI templates managed by the data provider. To create a disjoint subweb, a data publisher MUST use the predicate si:subweb multiple times, which MUST be interpreted as a union of IRI sets (with an URI template already representing a set of IRIs). This scope characterization of the index is beneficial because query optimizations related to the shape index are local, so it is helpful for query engines to know which subwebs are affected when performing traversal queries. A data provider MUST describe every resource in a subweb within a shape index.

The core feature of the Shape Index is its mappings. A Shape Index can contain multiple entries, each identified by the term si:entry. Every entry includes a target set, denoted by the term `si:subweb`. A target MUST be a subset of the Shape Index's subweb, and target subwebs MUST NOT overlap. Subwebs represent subsections of the web controlled by a data provider, which may be restricted by constraints such as shape constraints or a Shape Index's set of constraints. Each target set is associated with a shape, identified by the term `si:shape`. The target is a set of RDF resources, and every triple within these resources MUST conform to the shape linked to the target via the `si:shape` annotation. The example below illustrates a Shape Index. A Shape Index is considered complete if every shape binding a set of resources is closed; otherwise, it is incomplete.

      <http://mySubweb.com/index1> a si:ShapeIndex ;
        si:subweb <http://mySubweb.com/> ;
        si:subweb "http://mySubweb.com/0/data/.*/.*" ;
        si:entry [
          si:shape ex:profile#ProfileShape ;
          si:subweb <http://mySubweb.com/0/data/profile>
        ], [
          si:shape ex:movieReview#MovieReviewShape ;
          si:subweb "http://mySubweb.com/0/data/movie_review/.*"
        ] , [
          si:shape <http://mySubweb.com/readingList#ReadingListShape> ;
          si:subweb <http://mySubweb.com/0/data/personal_reading_list>
          si:subweb <http://mySubweb.com/0/data/research/paper_list>
          si:subweb "http://mySubweb.com/0/data/reading_list/*"
        ] ;

Furthermore, every set of triples within the subweb of the Shape Index that conforms to a shape, under a closed-world assumption, associated with an `si:shape` MUST be contained within a resource that is part of the target annotated with that shape. It is RECOMMENDED to use shapes with closed-world assumptions. If shapes with open-world assumptions are used, triples conforming to the shape SHOULD be inside a resource of the target.

When shapes nested within each other are used to bind targets (see the example below; note that the shapes are open), a set of triples conforming to the most restrictive shape (e.g., "http://mySubweb.com/readingList#ReadingListShape" in the example) SHOULD be located within the resources of that target. In a similar scenario with closed shapes, the set of triples MUST be located in the target of the most restrictive shape. If a set of triples can be validated by both an open and a closed shape, the triples MUST be located in the target of the closed shape.

      PREFIX ex: <http://example.com/#>
      PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
      PREFIX school: <http://school.example/#>
      PREFIX foaf: <http://xmlns.com/foaf/0.1/>

      http://mySubweb.com/readingList#List {
        foaf:name xsd:string ;
        ex:item IRI *;
      }

      http://mySubweb.com/readingList#ReadingListShape {
        foaf:name xsd:string ;
        ex:item IRI *;
        ex:isbn IRI;
        ex:rank xsd:number;
      }

Negative Entries

The Shape Index allows for describing RDF data shapes that are not present within a subweb. This is indicated using the `si:excludes` property within a `si:Entry`. Shapes specified in this way MUST adhere to a closed-world assumption, ensuring that no sets of triples within the subweb target validate against these shapes. It is RECOMMENDED to use these exclusions alongside shape constraints in the index, especially to specify when nested shapes are absent from the subweb.

      <http://mySubweb.com/index2> a si:ShapeIndex ;
        si:subweb "http://mySubweb.com/.*" ;
        si:entry [
          si:shape ex:profile#ProfileShape ;
          si:subweb <http://mySubweb.com/profile>
        ];

        si:entry [
          si:shape ex:plant#PlantShapes;
          si:excludes true ;
          si:subweb "http://mySubweb.com/.*";
        ];

        si:entry [
          si:shape ex:plant#CarShapes;
          si:excludes true ;
          si:subweb <http://mySubweb.com/toy>
        ];

Discover Mechanism in Shape Index

How to Discover a Shape Index?

The discovery of the shape index is left to the data provider. However, one approache is RECOMMENDED.

The data provider can use the predicate `si:shapeIndexLocation` with the IRI of the shape index as its object. This triple can be placed at convenient locations within the data provider's subweb. It is RECOMMENDED to place the triple in every file inside the subweb to facilitate the discovery of the shape index.

How to Discover the Resources Associated With a Subweb?

The subweb of an Entries (`si:entry`) can be defined abstractly using a URI template. Therefore, to discover the concrete IRI, a data provider SHOULD (and it is highly RECOMMENDED) provide a way for a query engine to find the concrete IRI of the set. One approach is that when dereferencing fixed path segments located before a variable path segment in a URI template, the server SHOULD provide a resource that offers all possible concrete IRIs until another is encountered. Another approach is to provide a tree structure at the root of the URI template (similar to a file server) that lists all hosted resources.

Shape Index for Query Optimization

This section is non-normative

One of the main objectives of the shape index is query optimization. This section provides an overview of the current approaches used to optimize queries with the shape index. As this section is non-normative and practices can evolve, it is subject to change. Readers are encouraged to contact the maintainers to propose new approaches or suggest extensions to the shape index based on techniques that build upon this approach.

Search Space Optimization

The shape index shares some similarities with the Type Indexes [[TYPE-INDEX-SPEC]]. Data discovery techniques like the one presented by Taelman and Verborgh [[TAELMAN-2023]] can be employed with small adaptations. Taelman and Verborgh's technique involves pruning sources associated with entries that do not share predicates with the user's query. Although this approach is an non-evaluated extension of their original method, it is expected to deliver performance that is equal to or better than the results presented in their paper.

Another approach, however, built to use the shape index, has been proposed by Tam et al. [[TAM-2024]] . This approach involves translating shapes into SPARQL algebra and performing a query containment problem between the user’s query and the translated shapes to determine which sources to dereference, taking into account the subweb and the completeness of the index. This method has been shown to significantly reduce query execution time compared to the approach by Taelman and Verborgh. However, it does not consider negative entries and has not been as extensively evaluated as Taelman and Verborgh's method.

Vocabulary

This section present the vocabulary to define a Shape Index. The onthology is defined in RDF at the following document.

`si:ShapeIndex`

A Shape Index

`si:shapeIndexLocation`

Indicate that a shape index is located at the IRI at the object position of a triple using this term at the predicate position.

`si:subweb`

The section of the web that the shape index caracterizes. It can be a single IRI or a URI template in the form of a string or a set of IRI and/or string of a URI template. To represent a set multiple `si:subweb` predicate MUST be defined, thus the subweb does not have to be in a tree structure. Subweb can be nested however they should be interpreted as a set of IRI and not a collection of set of IRI.

The term `` can be used as a subweb term. In that case, the IRI set is characterized by the IRI following the definition of containers from the Solid Specification [[SOLID-SPEC]] with the same usage as the mapping in the type index specification [[TYPE-INDEX-SPEC]].

`si:Entry`

A shape index entry mapping a shape of resources to an RDF shape.

`si:entry`

Lead to a `si:Entry` in the object position.

`si:shape`

The shape binding a `si:subweb`.

`si:excludes`

Indicate that the `si:Entry` exclude the shape from the target subweb.