Demonstration of (generic) interfaces¶
Go to:
Notebook configuration¶
[1]:
import sys
import numpy as np
import commonnn
from commonnn import cluster
from commonnn import _types, _fit
Print Python and package version information:
[2]:
# Version information
print("Python: ", *sys.version.split("\n"))
print("Packages:")
for package in [np, commonnn]:
print(f" {package.__name__}: {package.__version__}")
Python: 3.10.7 (main, Sep 27 2022, 11:41:38) [GCC 10.2.1 20210110]
Packages:
numpy: 1.23.3
commonnn: 0.0.2
Labels¶
_types.Labels
is used to store cluster label assignments next to a consider indicator and meta information. It also provides a few transformational methods.
Initialize Labels
as
Labels(labels)
Labels(labels, consider=consider)
Labels(labels, consider=consider, meta=meta)
Labels.from_sequence(labels_list, consider=consider_list, meta=meta)
Labels.from_length(n, meta=meta)
Technically, Labels
is not used as a generic class. A clustering, i.e. the assignments of cluster labels to points through a fitter (using a bunch of generic interfaces), uses an instance of Labels
by directly modifying the underlying array of labels, a Cython memoryview that can be accessed from the C level as Labels._labels
. Labels.labels
provides a NumPy array view to Labels._labels
.
Examples:
[3]:
# Requires labels to be initialised
labels = _types.Labels()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [3], line 2
1 # Requires labels to be initialised
----> 2 labels = _types.Labels()
File src/commonnn/_types.pyx:155, in commonnn._types.Labels.__cinit__()
TypeError: __cinit__() takes exactly 1 positional argument (0 given)
[4]:
labels = _types.Labels(np.array([1, 1, 2, 2, 2, 0]))
labels
[4]:
Labels([1, 1, 2, 2, 2, 0])
[9]:
labels = _types.Labels.from_sequence([1, 1, 2, 2, 2, 0])
labels
[9]:
Labels([1, 1, 2, 2, 2, 0])
[10]:
print(labels)
[1 1 2 2 2 0]
[11]:
labels.labels
[11]:
array([1, 1, 2, 2, 2, 0])
[12]:
labels.consider
[12]:
array([1, 1, 1, 1, 1, 1], dtype=uint8)
[13]:
labels.meta
[13]:
{}
[14]:
labels.set
[14]:
{0, 1, 2}
[15]:
labels.mapping
[15]:
defaultdict(list, {1: [0, 1], 2: [2, 3, 4], 0: [5]})
[16]:
labels.sort_by_size()
print(labels)
[2 2 1 1 1 0]
[17]:
labels.sort_by_size(member_cutoff=3)
print(labels)
[0 0 1 1 1 0]
Cluster parameters¶
An instance of a _types.ClusterParameters
subclass (e.g. CommonNNParameters
) is used during a clustering to pass around cluster parameters.
Initialise ClusterParameters
as:
ClusterParameters(fparams, iparams)
ClusterParameters.from_mapping(mapping)
…
ClusterParameters
are simple classes that carry two C-arrays, one for floating point parameters and one for integer parameters. The order of parameters in these arrays is important. Descriptive names are stored under ClusterParameters._fparam_names
and ClusterParameters._iparam_names
.
Examples:
[23]:
# Requires two sequences
_types.ClusterParameters()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [23], line 2
1 # Requires two sequences
----> 2 _types.ClusterParameters()
File src/commonnn/_types.pyx:30, in commonnn._types.ClusterParameters.__cinit__()
TypeError: __cinit__() takes exactly 2 positional arguments (0 given)
[28]:
# Consistency is not checked for required parameters
_types.CommonNNParameters([], [])
[28]:
{'radius_cutoff': 4.63809877625237e-310, 'similarity_cutoff': 93876154254304, '_support_cutoff': 0, 'start_label': 0}
[29]:
# The order of parameters matters
_types.CommonNNParameters([1], [2, 3, 4])
[29]:
{'radius_cutoff': 1.0, 'similarity_cutoff': 2, '_support_cutoff': 3, 'start_label': 4}
[30]:
# More robust initialisation via a mapping (checks required)
_types.CommonNNParameters.from_mapping({"similarity_cutoff": 2})
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In [30], line 2
1 # More robust initialisation via a mapping
----> 2 _types.CommonNNParameters.from_mapping({"similarity_cutoff": 2})
File src/commonnn/_types.pyx:54, in commonnn._types.ClusterParameters.from_mapping()
KeyError: 'radius_cutoff'
[31]:
# More robust initialisation via a mapping (provided defaults)
_types.CommonNNParameters.from_mapping({"radius_cutoff": 1, "similarity_cutoff": 2})
[31]:
{'radius_cutoff': 1.0, 'similarity_cutoff': 2, '_support_cutoff': 2, 'start_label': 1}
Input data¶
Common-nearest-neighbour clustering can be done on data in a variety of different input formats with variations in the actual execution of the procedure. A typical case for example, would be to use the coordinates of a number of points in some feature space. These coordinates may be stored in a 2-dimensional (NumPy-)array but they could be also held in a database. Maybe instead of point coordinates, we can also begin the clustering with pre-computed pairwise distances between the points. The
present implementation in the commonnn
package is aimed to be generic and widely agnostic about the source of input data. This is achieved by wrapping the input data structure into an input data object that complies with a universal input data interface. The input data interface is on the Python level defined through the abstract base class _types.InputData
and specialised through its abstract subclasses InputDataComponents
, InputDataPairwiseDistances
,
InputDataPairwiseDistancesComputer
, InputDataNeighbourhoods
, and InputDataNeighbourhoodsComputer
. Valid input data types inherit from one of these abstract types and provide concrete implementation for the required methods. On the Cython level, the input data interface is universally defined through _types.InputDataExtInterface
. Realisations of the interface by Cython extension types inherit from InputDataExtInterface
and should be registered as a concrete implementation of
one of the Python abstract base classes.
InputData
objects should expose the following (typed) attributes and methods:
data
(any): If applicable, a representation of the underlying data, preferably as NumPy array. Not strictly required for the clustering.n_points
(int
): The total number of points in the data set.meta
(dict
): A Python dictionary storing meta-information about the data. Used keys are for example:"access_components"
: Can point coordinates be retrieved from the input data (bool)?"access_distances"
: Can distances be retrieved from the input data (bool)?"access_neighbours"
: Can neighbourhoods be retrieved from the input data (bool)?"edges"
: If stored input data points are actually belonging to more than one data source, a list of integers can state the number of points per parts.
(
InputData
)get_subset(indices: Container)
: Return an instance of the same type holding only a subset of points (as given by indices). Used byClustering.isolate()
.
InputDataComponents
objects should expose the following additional attributes:
n_dim
(int
): The total number of dimensions.(
float
)get_component(point: int, dimension: int)
: Return one component of a point with respect to a given dimension.(
NumPy ndarray
)to_components_array()
: Transform/return underlying data as a 2D NumPy array.
InputDataExtComponentsMemoryview¶
Examples:
[32]:
# Requires data to initialise
_types.InputDataExtComponentsMemoryview()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [32], line 2
1 # Requires data to initialise
----> 2 _types.InputDataExtComponentsMemoryview()
File src/commonnn/_types.pyx:1042, in commonnn._types.InputDataExtComponentsMemoryview.__cinit__()
TypeError: __cinit__() takes exactly 1 positional argument (0 given)
[33]:
input_data = _types.InputDataExtComponentsMemoryview(np.random.random(size=(10, 3)))
print(input_data)
InputDataExtComponentsMemoryview(components of 10 points in 3 dimensions)
[34]:
input_data.data
[34]:
<MemoryView of 'ndarray' at 0x7fd289696c20>
[36]:
input_data.to_components_array()
[36]:
array([[0.63713848, 0.59261714, 0.11073944],
[0.50103405, 0.21809186, 0.17320056],
[0.02821558, 0.87189284, 0.35115627],
[0.15125287, 0.66732633, 0.46516895],
[0.98859881, 0.76153395, 0.72389632],
[0.85745918, 0.47118309, 0.52671906],
[0.90798414, 0.52142208, 0.87590641],
[0.15672554, 0.67594873, 0.61782398],
[0.86505685, 0.13480431, 0.8690348 ],
[0.94010104, 0.34241657, 0.75876202]])
[37]:
input_data.meta
[37]:
{'access_components': True}
[38]:
input_data.n_points
[38]:
10
[39]:
input_data.n_dim
[39]:
3
Clustering¶
For more details on Clustering
initialisation refer to the Advanced usage tutorial.