|This page collects information about Discovery, Catalog and Indexing capabilities in R2|
Used for matching resource attributes to the specified string. The string is analyzed so attributes which contain special characters (non alphanumerics) will be tokenized and therefore this search is not a good option for it.
Known examples of bad strings
- email addresses (@ symbol tokenizes the string)
- complex identifiers ($ and _ tokenizes the string)
- sentences and phrases (longer strings with spaces are tokenized but wildcards can still work well)
Used for finding resources with an attribute laying within a range of values specified by the query.
Used for finding resources with a timestamp attribute (ts_created and ts_updated) which exists between two time points.
Used for finding resources which are associated from a specified resource.
Only works down not up.
Used for finding resources which own a particular resource
Only works up not down
Used for finding resources which are part of a specified collection.
Used for finding resources which are a specified distance away from a geometric point.
Used for finding resources which contain geometric points which lay inside a bounding box.
- Smart searches
- Search updates
- Is the Discovery service of utility to end users only or is it the bases for any resource search
- Do we need more advanced search tools?
- Additional triple store in OLAP?
- Provides the interface for managing Views and View resources.
- Provides searching capabilities through request objects or through a query domain-specific language.
- Each view has exactly one Catalog for it.
- Each view maintains an order for which the results should be rendered.
- Each view maintains a set of filters which will be applied after the search is processed.
- Determine the fields which the view needs: name, model, serial number, date.
- Views optimally select the correct indexes based on the fields through Catalog Management.
- Provides the interface for managing Catalogs and Catalog Resources.
- Catalogs contain a set of key fields which define the domain of the catalog.
- Each Catalog is aware of what are all the fields it has between all of the indexes it has.
- Each Catalog is aware of all the shared fields between the indexes it has.
- Manages Index resources and maintains the search options and metadata about the indices.
- Provides some interface methods for interfacing with external technologies.
- Indexes are statically defined and built in a bootstrap.
- Each index has an optimal mapping for it's context types (Resource type in most cases).
- Each ElasticSearch index has a river and script statically defined for the purpose of the index.
- Metadata about HDF/Science data maintained in the resource repository
- Find operations not in services unless exceptional logic required to construct / resolve
- In C1 demonstrate a single thread thru the system:
- show a view in the UI that uses a catalog that hits (multiple) indexes
- use the Discovery:Find to demonstrate a simple query filter (the design of the queryObj)
- Find resources by fields and terms
- "SEARCH 'model' IS 'sbc*' FROM 'models_index'"
- Supports wildcard analyzed term searches.
- Returns search metadata and resource
- Find resources by a range
- "SEARCH 'cash_balance' VALUES FROM 0 TO 1000 FROM 'resources_index'"
- Ranges apply to fields of resources which are numerical only, not strings.
- Find associated resources
- "BELONGS TO 'resourceid'"
- Uses a breadth-first traversal of the resource graph.
- "BELONGS TO 'resourceid' LIMIT 2"
- Traverses at most 2-tiers down the association graph
- Compound searching, AND=Intersection, OR=Union
- "search 'type_' is 'PlatformDevice' from 'lukes_main_view' AND belongs to 'siteid'"
- "in 'collectionid' or belongs to 'transformid'"
- The results do not contain metadata because they span multiple technologies (tier-2)
- Results can be limited using the LIMIT keyword, order can be determined using a field to order by
|View by Name||View by Model|
- Create indexes on subsets of resources in CouchDB. 'Tailored' indexing using _mapping
- Need to verify rivers capabilities
- Initially investigate GeoCouch
The following queries are supported efficiently in the Resource Registry, i.e. have a pre-defined index:
- Find Resource by its ID
- Find Resources as object by association, predicate (optional) and type (optional) from subject Resource
- Find Resources as subject by association, predicate (optional) and type (optional) from object Resource
- Find Resources of Type and Lifecycle State, ordered by name
- Find Resources of Lifecycle State and Type (optional), ordered by name
- Find Resources of Type in Lifecycle State, ordered by name
- Find Associations by predicate
- Find Associations by subject and object type
- Collections are a bin of resources.
- Collections can be associated to any resource through hasResource
- Collections' resources are defined at creation.
- Collections can be found based on the collection or the contents of the collection.
- How to manage join-style views?
- put owner name in the result list of data products
- query all owners then stich owners into result list in memory
- put owner name in the result list of data products
- What find engines are available that can orchestrate these searches?
- What is possible with grouping then searching within a group?
- THIS LOOKS EASY WITH Elastic Search!* lets look how this can be leveraged the connect with the UI team.
- add one dimension (owner) then another facet (LCS)
- may be AND expressions to the filter on the query resource
- PersistenceSystem (open to better names here)
**Describes a repository: one couchdb cluster
- Store is a namespace in that archive: all the dbs in couchdb
- Service exposes operator functions such as restart
- Service loads its resources loaded on startup then the stores inside are loaded
- Store hasCatalog Catalog, PersistSys hasStore Store, Store hasIndex Index
- Catalog (uses)hasIndex Index, View hasCatalog Catalog and/or View hasStore Store
Some preliminary performance measurements:
The following is the statistics for a CouchDB query vs an ElasticSearch index search for all items, this is through both python wrappers so it includes the latency in both wrapper APIs.
The search query and results are performed by the same entity, example: a search is made against CouchDB and the results of the search are returned directly to the initiating client.
The search query and results are processed separate from each other. Example: a search is made against multiple technologies such as ElasticSearch and CouchDB, the results have to be parsed and intersected before being returned to the initiating client.