Algorithm Interfaces

Here we list and briefly describe the high level algorithm interfaces which this SMQTK-Indexing package provides. There is at least one implementation available for each interface. Some implementations will require additional dependencies.

NearestNeighborsIndex

This interface defines a method to build an index from a set of smqtk_descriptors.DescriptorElement instances (NearestNeighborsIndex.build_index()) and a nearest-neighbors query function for getting a number of near neighbors to a query DescriptorElement (NearestNeighborsIndex.nn()).

Building an index requires that some non-zero number of DescriptorElement instances be passed into the NearestNeighborsIndex.build_index() method. Subsequent calls to this method should rebuild the index model, not add to it. If an implementation supports persistent storage of the index, it should overwrite the configured index.

The NearestNeighborsIndex.nn() method uses a single DescriptorElement to query the current index for a specified number of nearest neighbors. Thus, the NearestNeighborsIndex instance must have a non-empty index loaded for this method to function. If the provided query DescriptorElement does not have a set vector, this method will also fail with an exception.

This interface additionally requires that implementations define a NearestNeighborsIndex.count() method, which returns the number of distinct DescriptorElement instances are in the index.

class smqtk_indexing.interfaces.nearest_neighbor_index.NearestNeighborsIndex(*args: Any, **kwargs: Any)[source]

Common interface for descriptor-based nearest-neighbor computation over a built index of descriptors.

Implementations, if they allow persistent storage of their index, should take the necessary parameters at construction time. Persistent storage content should be (over)written build_index is called.

Implementations should be thread safe and appropriately protect internal model components from concurrent access and modification.

build_index(descriptors: Iterable[smqtk_descriptors.interfaces.descriptor_element.DescriptorElement]) None[source]

Build the index with the given descriptor data elements.

Subsequent calls to this method should rebuild the current index. This method shall not add to the existing index nor raise an exception to as to protect the current index.

Raises

ValueError – No data available in the given iterable.

Parameters

descriptors (collections.abc.Iterable[smqtk.representation.DescriptorElement]) – Iterable of descriptor elements to build index over.

abstract count() int[source]
Returns

Number of elements in this index.

nn(d: smqtk_descriptors.interfaces.descriptor_element.DescriptorElement, n: int = 1) Tuple[Tuple[smqtk_descriptors.interfaces.descriptor_element.DescriptorElement, ...], Tuple[float, ...]][source]

Return the nearest N neighbors to the given descriptor element.

Raises
  • ValueError – Input query descriptor d has no vector set.

  • ValueError – Current index is empty.

Parameters
  • d – Descriptor element to compute the neighbors of.

  • n – Number of nearest neighbors to find.

Returns

Tuple of nearest N DescriptorElement instances, and a tuple of the distance values to those neighbors.

remove_from_index(uids: Iterable[collections.abc.Hashable]) None[source]

Partially remove descriptors from this index associated with the given UIDs.

Parameters

uids – Iterable of UIDs of descriptors to remove from this index.

Raises
  • ValueError – No data available in the given iterable.

  • KeyError – One or more UIDs provided do not match any stored descriptors. The index should not be modified.

update_index(descriptors: Iterable[smqtk_descriptors.interfaces.descriptor_element.DescriptorElement]) None[source]

Additively update the current index with the one or more descriptor elements given.

If no index exists yet, a new one should be created using the given descriptors.

Raises

ValueError – No data available in the given iterable.

Parameters

descriptors (collections.abc.Iterable[smqtk.representation .DescriptorElement]) – Iterable of descriptor elements to add to this index.

LshFunctor

Implementations of this interface define the generation of a locality-sensitive hash code for a given DescriptorElement. These are used in smqtk_indexing.impls.nn_index.lsh.LSHNearestNeighborIndex instances.

class smqtk_indexing.interfaces.lsh_functor.LshFunctor(*args: Any, **kwargs: Any)[source]

Locality-sensitive hashing functor interface.

The aim of such a function is to be able to generate hash codes (bit-vectors) such that similar items map to the same or similar hashes with a high probability. In other words, it aims to maximize hash collision for similar items.

Building Models

Some hash functions want to build a model based on some training set of descriptors. Due to the non-standard nature of algorithm training and model building, please refer to the specific implementation for further information on whether model training is needed and how it is accomplished.

abstract get_hash(descriptor: numpy.ndarray) numpy.ndarray[source]

Get the locality-sensitive hash code for the input descriptor.

Parameters

descriptor – Descriptor vector we should generate the hash of.

Returns

Generated bit-vector as a numpy array of booleans.

HashIndex

This interface describes specialized NearestNeighborsIndex implementations designed to index hash codes (bit vectors) via the hamming distance metric function. Implementations of this interface are primarily used with the smqtk_indexing.impls.nn_index.lsh.LSHNearestNeighborIndex implementation.

Unlike the NearestNeighborsIndex interface from which this interface is very similar to, HashIndex instances are built with an iterable of numpy.ndarray and HashIndex.nn() returns a numpy.ndarray.

class smqtk_indexing.interfaces.hash_index.HashIndex(*args: Any, **kwargs: Any)[source]

Specialized NearestNeighborsIndex for indexing unique hash codes bit-vectors) in memory (numpy arrays) using the hamming distance metric.

Implementations of this interface cannot be used in place of something requiring a NearestNeighborsIndex implementation due to the speciality of this interface.

Only unique bit vectors should be indexed. The nn method should not return the same bit vector more than once for any query.

build_index(hashes: Iterable[numpy.ndarray]) None[source]

Build the index with the given hash codes (bit-vectors).

Subsequent calls to this method should rebuild the current index. This method shall not add to the existing index nor raise an exception to as to protect the current index.

Raises

ValueError – No data available in the given iterable.

Parameters

hashes – Iterable of hash vectors (boolean-valued) to build index over.

abstract count() int[source]
Returns

Number of elements in this index.

nn(h: numpy.ndarray, n: int = 1) Tuple[numpy.ndarray, Sequence[float]][source]

Return the nearest N neighbor hash codes as bit-vectors to the given hash code bit-vector.

Distances are in the range [0,1] and are the percent different each neighbor hash is from the query, based on the number of bits contained in the query (normalized hamming distance).

Raises

ValueError – Current index is empty.

Parameters
  • h – Hash code vectors (boolean-valued) to compute the neighbors of. Should be the same bit length as indexed hash codes.

  • n – Number of nearest neighbors to find.

Returns

Tuple of nearest N hash codes and a tuple of the distance values to those neighbors.

remove_from_index(hashes: Iterable[numpy.ndarray]) None[source]

Partially remove hashes from this index.

Parameters

hashes – Iterable of numpy boolean hash vectors to remove from this index.

Raises
  • ValueError – No data available in the given iterable.

  • KeyError – One or more UIDs provided do not match any stored descriptors.

update_index(hashes: Iterable[numpy.ndarray]) None[source]

Additively update the current index with the one or more hash vectors given.

If no index exists yet, a new one should be created using the given hash vectors.

Raises

ValueError – No data available in the given iterable.

Parameters

hashes – Iterable of numpy boolean hash vectors to add to this index.