Advanced usage

Go to:

Notebook configuration

[1]:
import sys

import numpy as np

import commonnn
from commonnn import cluster, recipes
from commonnn import _bundle, _types, _fit

Print Python and package version information:

[2]:
# Version information
print("Python: ", *sys.version.split("\n"))

print("Packages:")
for package in [np, commonnn]:
    print(f"    {package.__name__}: {package.__version__}")
Python:  3.10.7 (main, Sep 27 2022, 11:41:38) [GCC 10.2.1 20210110]
Packages:
    numpy: 1.23.3
    commonnn: 0.0.1

Clustering initialisation

Short initialisation for point coordinates

In the Basic usage tutorial, we saw how to create a Clustering object from a list of point coordinates.

[3]:
# Three dummy points in three dimensions
points = [
    [0, 0, 0],
    [1, 1, 1],
    [2, 2, 2]
]
clustering = cluster.Clustering(points)

The created Clustering object is now ready to execute a clustering on the provided input data. In fact, this default initialisation works in the same way with any Python sequence of sequences.

[4]:
# Ten random points in four dimensions
points = np.random.random((10, 4))
clustering = cluster.Clustering(points)

Please note, that this does only yield meaningful results if the input data does indeed contain point coordinates. When a Clustering is initialised like this, quite a few steps are carried out in the background to ensure the correct assembly of the object. To be specific, the following things are taken care of:

  • The raw input data (here points) is wrapped into a generic input data object (a concrete implementation of the abstract class _types.InputData)

    • Prior to the wrapping, the raw data may be passed through a preparation function that returns it in a format matching the input data type

  • A generic fitter object (a concrete implementation of the abstract class _fit.Fitter) is selected and associated with the clustering

    • The fitter is equipped with other necessary building blocks

In consequence, the created Clustering object carries a set of other objects that control how a clustering of the input data is executed. Which objects that are is controlled by a recipe (defined in the recipes module. The default registered recipe is named "coordinates".

[5]:
print(clustering)
Clustering(input_data=InputDataExtComponentsMemoryview(components of 10 points in 4 dimensions), fitter=FitterExtCommonNNBFS(ngetter=NeighboursGetterExtBruteForce(dgetter=DistanceGetterExtMetric(metric=MetricExtEuclideanReduced), sorted=False, selfcounting=True), na=NeighboursExtVectorUnorderedSet, nb=NeighboursExtVectorUnorderedSet, checker=SimilarityCheckerExtSwitchContains, queue=QueueExtFIFOQueue), hierarchical_fitter=None, predictor=None)

To understand the setup steps and the different kinds of partaking objects better, lets have a closer look at the default recipe for the Clustering class in the next section.

Manual custom initialisation

There are multiple ways to initialise a Clustering:

[6]:
help(cluster.Clustering.__init__)
Help on cython_function_or_method in module commonnn.cluster:

__init__(self, data=None, *, fitter=None, hierarchical_fitter=None, predictor=None, bundle_kwargs=None, recipe=None, **recipe_kwargs)
    Clustering.__init__(self, data=None, *, fitter=None, hierarchical_fitter=None, predictor=None, bundle_kwargs=None, recipe=None, **recipe_kwargs)

    Keyword args:
        data:
            The data points to be clustered. Can be one of
                * `None`:
                    Plain initialisation without input data.
                * A :class:`~commonnn._bundle.Bundle`:
                    Initialisation with a ready-made input data bundle.
                * Any object implementing the input data interface
                (see :class:`~commonnn._types.InputData` or
                :class:`~commonnn._types.InputDataExtInterface`):
                    in this case, additional keyword arguments can be passed
                    via `bundle_kwargs` which are used to initialise a
                    :class:`~commonnn._bundle.Bundle` from the input data,
                    e.g. `labels`, `children`, etc.
                * Raw input data: Takes the input data type and a preparation
                hook from the `recipe` and wraps the raw data.
        fitter:
            Executes the clustering procedure. Can be
                * Any object implementing the fitter interface (see :class:`~commonnn._fit.Fitter` or
                :class:`~commonnn._fit.FitterExtInterface`).
                * None:
                    In this case, the fitter is tried to be build from the `recipe` or left
                    as `None`.
        hierarchical_fitter:
            Like `fitter` but for hierarchical clustering (see
            :class:`~commonnn._fit.HierarchicalFitter` or
            :class:`~commonnn._fit.HierarchicalFitterExtInterface`).
        predictor:
            Translates a clustering result from one bundle to another. Treated like
            `fitter` (see
            :class:`~commonnn._fit.Predictor` or
            :class:`~commonnn._fit.PredictorExtInterface`).
        bundle_kwargs: Used to create a :class:`~commonnn._bundle.Bundle`
            if `data` is neither a bundle nor `None`.
        recipe:
            Used to assemble a fitter etc. and to wrap raw input data. Can be
                * A string corresponding to a registered default recipe (see
                    :obj:`~commonnn.recipes.REGISTERED_RECIPES`
                )
                * A recipe, i.e. a mapping of component keywords to component types
        **recipe_kwargs: Passed on to override entries in the base `recipe`. Use double
            underscores in key names instead of dots, e.g. fitter__na instead of fitter.na.

If, like in the example above, raw data is passed on initialisation without the specification of other options, certain assumptions are made and the clustering object is created using a default recipe. For illustration, lets create a clustering object without data and by explicitly silencing the default recipe. This will give us a plain clustering object without any other building blocks attached.

[7]:
plain_clustering = cluster.Clustering(recipe="none")
print(plain_clustering)
Clustering(input_data=None, fitter=None, hierarchical_fitter=None, predictor=None)

Naturally, this object is not set up for the actual clustering.

[8]:
plain_clustering.fit(radius_cutoff=0.1, similarity_cutoff=2)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [8], line 1
----> 1 plain_clustering.fit(radius_cutoff=0.1, similarity_cutoff=2)

File src/commonnn/cluster.pyx:283, in commonnn.cluster.Clustering.fit()

AttributeError: 'NoneType' object has no attribute 'fit'

Starting from scratch, we need to provide some input data and associate it with the clustering.

[9]:
points = np.array([
    [0, 0, 0],
    [1, 1, 1],
    [2, 2, 2],
], dtype=float)

To do so, we first need to associate these data with a Bundle. A bundle in turn is added to the Clustering. Bundles will become important in the context of hierarchical clustering. Trying to create a bundle with our raw input data, however, will result in in error.

[10]:
_bundle.Bundle(input_data=points)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [10], line 1
----> 1 _bundle.Bundle(input_data=points)

File src/commonnn/_bundle.pyx:30, in commonnn._bundle.Bundle.__cinit__()

File src/commonnn/_bundle.pyx:59, in commonnn._bundle.Bundle.input_data.__set__()

TypeError: Can't use object of type ndarray as input data. Expected type InputData.

Input data needs to be provided in terms of a generic type to allow a clustering procedure to be executed with it. Generic types can be accessed and worked with in a universal fashion, independent of how data is actually physically stored. A good type to be created from raw data points presented as a NumPy array is _types.InputDataExtComponentsMemoryview:

[11]:
input_data = _types.InputDataExtComponentsMemoryview(points)
bundle = _bundle.Bundle(input_data=input_data)
plain_clustering._bundle = bundle

Note that this type requires a C-continuous 2-dimensional array of 64-bit floats. Python nested sequences can be converted into this format using recipes.prepare_components_array_from_parts.

[12]:
print(plain_clustering)
Clustering(input_data=InputDataExtComponentsMemoryview(components of 3 points in 3 dimensions), fitter=None, hierarchical_fitter=None, predictor=None)

Info: If you know what you are doing, you can still associate arbitrary input data to a clustering (bundle) by assigning to Bundle._input_data directly.

But we are not done yet and clustering is still not possible because we are missing a fitter that controls how the clustering should be actually done.

The default fitter for any common-nearest-neighbours clustering is _fit.FitterExtCommonNNBFS. If we want to initialise this fitter, we additionally need to provide the following building blocks that we need to pass as the following arguments:

  • neighbours_getter: A generic object that defines how neighbourhood information can be retrieved from the input data object. Needs to be a concrete implementation of the abstract class _types.NeighboursGetter.

  • neighbours: A generic object to hold the retrieved neighbourhood of one point. Filled by the neighbours_getter. Needs to be a concrete implementation of the abstract class _types.Neighbours.

  • neighbour_neighbours: As neighbours. The FitterExtCommonNNBFS fitter uses exactly two containers to store the neighbourhoods of two points.

  • similarity_checker: A generic object that controls how the common-nearest-neighbour similarity criterion (at least c common neighbours) is checked. Needs to be a concrete implementation of the abstract class _types.SimilarityChecker.

  • queue: A generic queuing structure needed for the breadth-first-search approach implemented by the fitter. Needs to be a concrete implementation of the abstract class _types.Queue.

So let’s create these building blocks to prepare a fitter for the clustering. Note, that the by default recommended neighbours getter (_types.NeighboursGetterExtBruteForce) does in turn require a distance getter (that controls how pairwise distances for points in the input data are retrieved), which again expects us to define a metric. For the neighbours containers we choose a type that wraps a C++ vector. The similarity check will be done by a set of containment checks and the queuing structure will be a C++ queue.

[13]:
# Choose Euclidean metric
metric = _types.MetricExtEuclidean()
distance_getter = _types.DistanceGetterExtMetric(metric)

# Make neighbours getter
neighbours_getter = _types.NeighboursGetterExtBruteForce(distance_getter)

# Make fitter
fitter = _fit.FitterExtCommonNNBFS(
    neighbours_getter,
    _types.NeighboursExtVector(),
    _types.NeighboursExtVector(),
    _types.SimilarityCheckerExtContains(),
    _types.QueueExtFIFOQueue()
)

This fitter can now be associated with our clustering. With everything in place, a clustering can be finally executed.

[14]:
plain_clustering.fitter = fitter
[15]:
plain_clustering.fit(radius_cutoff=0.1, similarity_cutoff=2)
-----------------------------------------------------------------------------------------------
#points   r         nc        min       max       #clusters %largest  %noise    time
3         0.100     2         None      None      0         0.000     1.000     00:00:0.000
-----------------------------------------------------------------------------------------------

The described manual way to initialise a Clustering instance is very flexible as the user can cherry pick exactly the desired types to modify the different contributing pieces. On the other hand, this approach can be fairly tedious and error prone. In the next section we will see how we solved this problem by facilitating the aggregation of a clustering according to pre-defined schemes.

Initialisation via a builder

We did see so far how to assemble a Clustering instance from scratch by selecting the individual clustering components manually. In the beginning we did also see that we could create a Clustering seemingly automatically if we just pass raw data to the constructor. To fill the gap, let’s now have a look at how a Clustering can be created via a Builder. A builder is a helper object that serves the purpose of correctly creating a Clustering based on some preset requirements, a so called recipe. When we try to initialise a Clustering with raw input data (that is not wrapped in a valid generic input data type), a recipes.Builder instance actually tries to take over behind the scenes. By default, a builder is associated with a certain recipe.

[16]:
builder = recipes.Builder()
print(builder.default_recipe)
coordinates

We should look into what is actually meant by a clustering recipe. A recipe is basically a nested mapping of clustering component strings (matching the corresponding keyword arguments used on clustering/component initialisation, e.g. "input_data" or "neighbours") to the generic types (classes not instances) that should be used in the corresponding place. A recipe could for example look like this:

[17]:
recipe = {
    "input_data": _types.InputDataExtComponentsMemoryview,
    "fitter": "bfs",
    "fitter.getter": "brute_force",
    "fitter.getter.dgetter": "metric",
    "fitter.getter.dgetter.metric": "euclidean",
    "fitter.na": ("vector", (), {"initial_size": 1000}),
    "fitter.checker": "contains",
    "fitter.queue": "fifo"
}

In this recipe, the generic type supposed to wrap the input data is specified explicitly as the class object. Alternatively, strings can be used to specify a type in shorthand notation. Which abbreviations are understood is defined in the recipes.COMPONENT_NAME_TYPE_MAP. In the fitter case, bfs is translated into _fit.FitterExtCommonNNBFS. Dot notation is used to indicate nested dependencies, e.g. to define components needed to create other components. Similarly, shorthand notation is supported for the component key, as shown with fitter.getter which stands in for the neighbours getter required by the fitter. Abbreviations on the key side are defined in recipes.COMPONENT_ALT_KW_MAP. For the "fitter.na" component (one of the neighbours container type needed that the fitter needs), we have a tuple as the value in the mapping. This is interpreted as a component string identifier, followed by an arguments tuple, and a keyword arguments dictionary used in the initialisation of the corresponding component. Note also, that the recipe defines only "fitter.na" (neighbours) and not "fitter.nb" (neighbour_neighbours) in which case the same type will be used for both components. Those fallback relation ships are defined in recipes.COMPONENT_KW_TYPE_ALIAS_MAP. The above mentioned default recipes looks like the following:

[18]:
recipes.get_registered_recipe("coordinates")
[18]:
{'input_data': 'components_mview',
 'preparation_hook': 'components_array_from_parts',
 'fitter': 'bfs',
 'fitter.ngetter': 'brute_force',
 'fitter.na': 'vuset',
 'fitter.checker': 'switch',
 'fitter.queue': 'fifo',
 'fitter.ngetter.dgetter': 'metric',
 'fitter.ngetter.dgetter.metric': 'euclidean_r'}

On builder initialisation, a base recipe can be specified either as a string (if a corresponding recipe is registered) or as a mapping. Further keyword arguments are interpreted to override the base recipe.

[19]:
builder = recipes.Builder(recipe=recipe, prep='components_array_from_parts')
builder.recipe
[19]:
{'input_data': commonnn._types.InputDataExtComponentsMemoryview,
 'fitter': 'bfs',
 'fitter.neighbours_getter': 'brute_force',
 'fitter.neighbours_getter.distance_getter': 'metric',
 'fitter.neighbours_getter.distance_getter.metric': 'euclidean',
 'fitter.neighbours': ('vector', (), {'initial_size': 1000}),
 'fitter.similarity_checker': 'contains',
 'fitter.queue': 'fifo',
 'preparation_hook': 'components_array_from_parts'}

Other readily available recipes are "distances", "neighbourhoods" and "sorted_neighbourhoods". The users are encouraged to modify those to their liking or to define their own custom recipes.

Individual components are build by the builder after initialisation by calling its make_component method.

[20]:
fitter = builder.make_component("fitter")
print(fitter)
FitterExtCommonNNBFS(ngetter=NeighboursGetterExtBruteForce(dgetter=DistanceGetterExtMetric(metric=MetricExtEuclidean), sorted=False, selfcounting=True), na=NeighboursExtVector, nb=NeighboursExtVector, checker=SimilarityCheckerExtContains, queue=QueueExtFIFOQueue)

Generic input data is made from raw input data by make_input_data.

[21]:
input_data = builder.make_input_data(points)
print(input_data)
InputDataExtComponentsMemoryview(components of 3 points in 3 dimensions)

Newly defined types that should be usable in a builder controlled aggregation need to implement a classmethod get_builder_kwargs() -> list that provides a list of component identifiers necessary to initialise an object of itself.