Skip to end of metadata
Go to start of metadata
This page collects information about Discovery, Catalog and Indexing capabilities in R2

Searching

Term Searches

Description

Used for matching resource attributes to the specified string. The string is analyzed so attributes which contain special characters (non alphanumerics) will be tokenized and therefore this search is not a good option for it.

Known examples of bad strings

  • email addresses (@ symbol tokenizes the string)
  • complex identifiers ($ and _ tokenizes the string)
  • sentences and phrases (longer strings with spaces are tokenized but wildcards can still work well)

Usage (query language)

Usage (query object)

Range Searches

Description

Used for finding resources with an attribute laying within a range of values specified by the query.

Usage (query language)

Usage (query object)

Time Searches

Description

Used for finding resources with a timestamp attribute (ts_created and ts_updated) which exists between two time points.

Usage (query language)

Usage (query object)

Association Searches

Description

Used for finding resources which are associated from a specified resource.
Only works down not up.

Usage (query language)

Usage (query object)

Ownership Searches

Description

Used for finding resources which own a particular resource
Only works up not down

Usage (query language)

Usage (query object)

Illustration

Collection Searches

Description

Used for finding resources which are part of a specified collection.

Usage (query language)

Usage (query object)

Geo Distance Searches

Description

Used for finding resources which are a specified distance away from a geometric point.

Usage (query language)

Usage (query object)

Geo Bounding Box Searches

Description

Used for finding resources which contain geometric points which lay inside a bounding box.

Usage (query language)

Usage (query object)

Summary

  • Smart searches
  • Search updates
  • Navigation
  • Is the Discovery service of utility to end users only or is it the bases for any resource search
  • Do we need more advanced search tools?
  • Additional triple store in OLAP?

R2C1 Services

Discovery Service

  • Provides the interface for managing Views and View resources.
  • Provides searching capabilities through request objects or through a query domain-specific language.
  • Each view has exactly one Catalog for it.
  • Each view maintains an order for which the results should be rendered.
  • Each view maintains a set of filters which will be applied after the search is processed.
Defining a View
  • Determine the fields which the view needs: name, model, serial number, date.
  • Views optimally select the correct indexes based on the fields through Catalog Management.

Catalog Service

  • Provides the interface for managing Catalogs and Catalog Resources.
  • Catalogs contain a set of key fields which define the domain of the catalog.
  • Each Catalog is aware of what are all the fields it has between all of the indexes it has.
  • Each Catalog is aware of all the shared fields between the indexes it has.

Index Management Service

  • Manages Index resources and maintains the search options and metadata about the indices.
  • Provides some interface methods for interfacing with external technologies.
  • Indexes are statically defined and built in a bootstrap.
  • Each index has an optimal mapping for it's context types (Resource type in most cases).
  • Each ElasticSearch index has a river and script statically defined for the purpose of the index.

R2Cx Design

High level statements

  • Metadata about HDF/Science data maintained in the resource repository
  • Find operations not in services unless exceptional logic required to construct / resolve
  • In C1 demonstrate a single thread thru the system:
    • show a view in the UI that uses a catalog that hits (multiple) indexes
    • use the Discovery:Find to demonstrate a simple query filter (the design of the queryObj)

Searches

  1. Find resources by fields and terms
    • "SEARCH 'model' IS 'sbc*' FROM 'models_index'"
    • Supports wildcard analyzed term searches.
    • Returns search metadata and resource
  2. Find resources by a range
    • "SEARCH 'cash_balance' VALUES FROM 0 TO 1000 FROM 'resources_index'"
    • Ranges apply to fields of resources which are numerical only, not strings.
  3. Find associated resources
    • "BELONGS TO 'resourceid'"
    • Uses a breadth-first traversal of the resource graph.
    • "BELONGS TO 'resourceid' LIMIT 2"
    • Traverses at most 2-tiers down the association graph
  4. Compound searching, AND=Intersection, OR=Union
    • "search 'type_' is 'PlatformDevice' from 'lukes_main_view' AND belongs to 'siteid'"
    • "in 'collectionid' or belongs to 'transformid'"
    • The results do not contain metadata because they span multiple technologies (tier-2)
  5. Results can be limited using the LIMIT keyword, order can be determined using a field to order by
View by Name View by Model

Elastic Search/Lucene

  • Create indexes on subsets of resources in CouchDB. 'Tailored' indexing using _mapping
  • Need to verify rivers capabilities

Geospatial Index

  • Initially investigate GeoCouch

CouchDB Index/Views

The following queries are supported efficiently in the Resource Registry, i.e. have a pre-defined index:

  • Find Resource by its ID
  • Find Resources as object by association, predicate (optional) and type (optional) from subject Resource
  • Find Resources as subject by association, predicate (optional) and type (optional) from object Resource
  • Find Resources of Type and Lifecycle State, ordered by name
  • Find Resources of Lifecycle State and Type (optional), ordered by name
  • Find Resources of Type in Lifecycle State, ordered by name
  • Find Associations by predicate
  • Find Associations by subject and object type

Collections

  • Collections are a bin of resources.
  • Collections can be associated to any resource through hasResource
  • Collections' resources are defined at creation.
  • Collections can be found based on the collection or the contents of the collection.

Open Questions

  • How to manage join-style views?
    • put owner name in the result list of data products
      • query all owners then stich owners into result list in memory
  • What find engines are available that can orchestrate these searches?
  • What is possible with grouping then searching within a group?

Faceted Search (R3)

  • THIS LOOKS EASY WITH Elastic Search!* lets look how this can be leveraged the connect with the UI team.
  • add one dimension (owner) then another facet (LCS)
  • may be AND expressions to the filter on the query resource

Preservation Mgmt Svc: persistenceSystem and store

  • PersistenceSystem (open to better names here)
    **Describes a repository: one couchdb cluster
  • Store is a namespace in that archive: all the dbs in couchdb
  • Service exposes operator functions such as restart
  • Service loads its resources loaded on startup then the stores inside are loaded
  • Associations
    • Store hasCatalog Catalog, PersistSys hasStore Store, Store hasIndex Index
    • Catalog (uses)hasIndex Index, View hasCatalog Catalog and/or View hasStore Store

Performance Evaluations

Some preliminary performance measurements:
The following is the statistics for a CouchDB query vs an ElasticSearch index search for all items, this is through both python wrappers so it includes the latency in both wrapper APIs.

Searching Tiers

Tier-1 Searching

The search query and results are performed by the same entity, example: a search is made against CouchDB and the results of the search are returned directly to the initiating client.

Tier-2 Searching

The search query and results are processed separate from each other. Example: a search is made against multiple technologies such as ElasticSearch and CouchDB, the results have to be parsed and intersected before being returned to the initiating client.

See also

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Apr 17, 2012

    David Stuebe says:

    Resource model discussion and service op intent. Meeting from Apr 17, 2012

    Resource model discussion and service op intent.

    Meeting from Apr 17, 2012