{ "cells": [ { "cell_type": "markdown", "id": "consecutive-penguin", "metadata": {}, "source": [ "# Advanced usage " ] }, { "cell_type": "markdown", "id": "tutorial-emergency", "metadata": { "ExecuteTime": { "end_time": "2021-09-28T09:23:54.350238Z", "start_time": "2021-09-28T09:23:54.346679Z" } }, "source": [ "Go to:\n", " \n", " - [Notebook configuration](advanced_usage.ipynb#Notebook-configuration)\n", " - [Clustering initialisation](advanced_usage.ipynb#Clustering-initialisation)\n", " - [Short initialisation](advanced_usage.ipynb#Short-initialisation-for-point-coordinates)\n", " - [Manual custom initialisation](advanced_usage.ipynb#Manual-custom-initialisation)\n", " - [Initialisation via a builder](advanced_usage.ipynb#Initialisation-via-a-builder)" ] }, { "cell_type": "markdown", "id": "handmade-composer", "metadata": {}, "source": [ "## Notebook configuration" ] }, { "cell_type": "code", "execution_count": 1, "id": "representative-danger", "metadata": { "ExecuteTime": { "end_time": "2022-11-23T10:24:41.021111Z", "start_time": "2022-11-23T10:24:39.068989Z" } }, "outputs": [], "source": [ "import sys\n", "\n", "import numpy as np\n", "\n", "import commonnn\n", "from commonnn import cluster, recipes\n", "from commonnn import _bundle, _types, _fit" ] }, { "cell_type": "markdown", "id": "surprising-antenna", "metadata": {}, "source": [ "Print Python and package version information:" ] }, { "cell_type": "code", "execution_count": 2, "id": "technical-compilation", "metadata": { "ExecuteTime": { "end_time": "2022-11-23T10:24:41.027958Z", "start_time": "2022-11-23T10:24:41.023882Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python: 3.10.7 (main, Sep 27 2022, 11:41:38) [GCC 10.2.1 20210110]\n", "Packages:\n", " numpy: 1.23.3\n", " commonnn: 0.0.1\n" ] } ], "source": [ "# Version information\n", "print(\"Python: \", *sys.version.split(\"\\n\"))\n", "\n", "print(\"Packages:\")\n", "for package in [np, commonnn]:\n", " print(f\" {package.__name__}: {package.__version__}\")" ] }, { "cell_type": "markdown", "id": "respected-vampire", "metadata": {}, "source": [ "## Clustering initialisation" ] }, { "cell_type": "markdown", "id": "aggressive-nudist", "metadata": {}, "source": [ "### Short initialisation for point coordinates" ] }, { "cell_type": "markdown", "id": "productive-swing", "metadata": {}, "source": [ "In the [*Basic usage*](basic_usage.ipynb) tutorial, we saw how to create a `Clustering` object from a list of point coordinates." ] }, { "cell_type": "code", "execution_count": 3, "id": "dental-fourth", "metadata": { "ExecuteTime": { "end_time": "2022-11-23T10:24:41.047338Z", "start_time": "2022-11-23T10:24:41.029556Z" } }, "outputs": [], "source": [ "# Three dummy points in three dimensions\n", "points = [\n", " [0, 0, 0],\n", " [1, 1, 1],\n", " [2, 2, 2]\n", "]\n", "clustering = cluster.Clustering(points)" ] }, { "cell_type": "markdown", "id": "premier-hammer", "metadata": {}, "source": [ "The created `Clustering` object is now ready to execute a clustering on the provided input data. In fact, this default initialisation works in the same way with any Python sequence of sequences." ] }, { "cell_type": "code", "execution_count": 4, "id": "plain-darwin", "metadata": { "ExecuteTime": { "end_time": "2022-11-23T10:24:41.061818Z", "start_time": "2022-11-23T10:24:41.049074Z" } }, "outputs": [], "source": [ "# Ten random points in four dimensions\n", "points = np.random.random((10, 4))\n", "clustering = cluster.Clustering(points)" ] }, { "cell_type": "markdown", "id": "young-replica", "metadata": {}, "source": [ "Please note, that this does only yield meaningful results if the input data does indeed contain point coordinates. When a `Clustering` is initialised like this, quite a few steps are carried out in the background to ensure the correct assembly of the object. To be specific, the following things are taken care of:\n", "\n", " - The *raw* input data (here `points`) is wrapped into a generic input data object (a concrete implementation of the abstract class `_types.InputData`)\n", " - Prior to the wrapping, the raw data may be passed through a preparation function that returns it in a format matching the input data type\n", " - A generic fitter object (a concrete implementation of the abstract class `_fit.Fitter`) is selected and associated with the clustering\n", " - The fitter is equipped with other necessary building blocks\n", " \n", "In consequence, the created `Clustering` object carries a set of other objects that control how a clustering of the input data is executed. Which objects that are is controlled by a recipe (defined in the `recipes` module. The default registered recipe is named `\"coordinates\"`." ] }, { "cell_type": "code", "execution_count": 5, "id": "jewish-albuquerque", "metadata": { "ExecuteTime": { "end_time": "2022-11-23T10:24:41.085858Z", "start_time": "2022-11-23T10:24:41.063634Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Clustering(input_data=InputDataExtComponentsMemoryview(components of 10 points in 4 dimensions), fitter=FitterExtCommonNNBFS(ngetter=NeighboursGetterExtBruteForce(dgetter=DistanceGetterExtMetric(metric=MetricExtEuclideanReduced), sorted=False, selfcounting=True), na=NeighboursExtVectorUnorderedSet, nb=NeighboursExtVectorUnorderedSet, checker=SimilarityCheckerExtSwitchContains, queue=QueueExtFIFOQueue), hierarchical_fitter=None, predictor=None)\n" ] } ], "source": [ "print(clustering)" ] }, { "cell_type": "markdown", "id": "geographic-bikini", "metadata": {}, "source": [ "To understand the setup steps and the different kinds of partaking objects better, lets have a closer look at the default recipe for the `Clustering` class in the next section." ] }, { "cell_type": "markdown", "id": "satisfactory-stevens", "metadata": {}, "source": [ "### Manual custom initialisation" ] }, { "cell_type": "markdown", "id": "84b61902", "metadata": { "ExecuteTime": { "end_time": "2022-11-22T15:39:09.319989Z", "start_time": "2022-11-22T15:39:09.311467Z" } }, "source": [ "There are multiple ways to initialise a `Clustering`:" ] }, { "cell_type": "code", "execution_count": 6, "id": "400408ea", "metadata": { "ExecuteTime": { "end_time": "2022-11-23T10:24:41.102442Z", "start_time": "2022-11-23T10:24:41.089301Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on cython_function_or_method in module commonnn.cluster:\n", "\n", "__init__(self, data=None, *, fitter=None, hierarchical_fitter=None, predictor=None, bundle_kwargs=None, recipe=None, **recipe_kwargs)\n", " Clustering.__init__(self, data=None, *, fitter=None, hierarchical_fitter=None, predictor=None, bundle_kwargs=None, recipe=None, **recipe_kwargs)\n", " \n", " Keyword args:\n", " data:\n", " The data points to be clustered. Can be one of\n", " * `None`:\n", " Plain initialisation without input data.\n", " * A :class:`~commonnn._bundle.Bundle`:\n", " Initialisation with a ready-made input data bundle.\n", " * Any object implementing the input data interface\n", " (see :class:`~commonnn._types.InputData` or\n", " :class:`~commonnn._types.InputDataExtInterface`):\n", " in this case, additional keyword arguments can be passed\n", " via `bundle_kwargs` which are used to initialise a\n", " :class:`~commonnn._bundle.Bundle` from the input data,\n", " e.g. `labels`, `children`, etc.\n", " * Raw input data: Takes the input data type and a preparation\n", " hook from the `recipe` and wraps the raw data.\n", " fitter:\n", " Executes the clustering procedure. Can be\n", " * Any object implementing the fitter interface (see :class:`~commonnn._fit.Fitter` or\n", " :class:`~commonnn._fit.FitterExtInterface`).\n", " * None:\n", " In this case, the fitter is tried to be build from the `recipe` or left\n", " as `None`.\n", " hierarchical_fitter:\n", " Like `fitter` but for hierarchical clustering (see\n", " :class:`~commonnn._fit.HierarchicalFitter` or\n", " :class:`~commonnn._fit.HierarchicalFitterExtInterface`).\n", " predictor:\n", " Translates a clustering result from one bundle to another. Treated like\n", " `fitter` (see\n", " :class:`~commonnn._fit.Predictor` or\n", " :class:`~commonnn._fit.PredictorExtInterface`).\n", " bundle_kwargs: Used to create a :class:`~commonnn._bundle.Bundle`\n", " if `data` is neither a bundle nor `None`.\n", " recipe:\n", " Used to assemble a fitter etc. and to wrap raw input data. Can be\n", " * A string corresponding to a registered default recipe (see\n", " :obj:`~commonnn.recipes.REGISTERED_RECIPES`\n", " )\n", " * A recipe, i.e. a mapping of component keywords to component types\n", " **recipe_kwargs: Passed on to override entries in the base `recipe`. Use double\n", " underscores in key names instead of dots, e.g. fitter__na instead of fitter.na.\n", "\n" ] } ], "source": [ "help(cluster.Clustering.__init__)" ] }, { "cell_type": "markdown", "id": "df37fdb8", "metadata": {}, "source": [ "If, like in the example above, raw data is passed on initialisation without the specification of other options, certain assumptions are made and the clustering object is created using a default recipe. For illustration, lets create a clustering object without data and by explicitly silencing the default recipe. This will give us a plain clustering object without any other building blocks attached." ] }, { "cell_type": "code", "execution_count": 7, "id": "appreciated-glasgow", "metadata": { "ExecuteTime": { "end_time": "2022-11-23T10:24:41.118328Z", "start_time": "2022-11-23T10:24:41.106170Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Clustering(input_data=None, fitter=None, hierarchical_fitter=None, predictor=None)\n" ] } ], "source": [ "plain_clustering = cluster.Clustering(recipe=\"none\")\n", "print(plain_clustering)" ] }, { "cell_type": "markdown", "id": "convertible-brain", "metadata": { "ExecuteTime": { "end_time": "2021-10-04T11:04:16.299058Z", "start_time": "2021-10-04T11:04:16.293867Z" } }, "source": [ "Naturally, this object is not set up for the actual clustering." ] }, { "cell_type": "code", "execution_count": 8, "id": "distinguished-receiver", "metadata": { "ExecuteTime": { "end_time": "2022-11-23T10:24:41.799365Z", "start_time": "2022-11-23T10:24:41.126901Z" } }, "outputs": [ { "ename": "AttributeError", "evalue": "'NoneType' object has no attribute 'fit'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [8], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m plain_clustering\u001b[38;5;241m.\u001b[39mfit(radius_cutoff\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m0.1\u001b[39m, similarity_cutoff\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m2\u001b[39m)\n", "File \u001b[1;32msrc/commonnn/cluster.pyx:283\u001b[0m, in \u001b[0;36mcommonnn.cluster.Clustering.fit\u001b[1;34m()\u001b[0m\n", "\u001b[1;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'fit'" ] } ], "source": [ "plain_clustering.fit(radius_cutoff=0.1, similarity_cutoff=2)" ] }, { "cell_type": "markdown", "id": "cognitive-apparel", "metadata": {}, "source": [ "Starting from scratch, we need to provide some input data and associate it with the clustering." ] }, { "cell_type": "code", "execution_count": 9, "id": "4090972c", "metadata": { "ExecuteTime": { "end_time": "2022-11-23T10:24:41.807636Z", "start_time": "2022-11-23T10:24:41.804190Z" } }, "outputs": [], "source": [ "points = np.array([\n", " [0, 0, 0],\n", " [1, 1, 1],\n", " [2, 2, 2],\n", "], dtype=float)" ] }, { "cell_type": "markdown", "id": "c97a9586", "metadata": {}, "source": [ "To do so, we first need to associate these data with a `Bundle`. A bundle in turn is added to the `Clustering`. Bundles will become important in the context of hierarchical clustering. Trying to create a bundle with our *raw* input data, however, will result in in error." ] }, { "cell_type": "code", "execution_count": 10, "id": "a6eb273c", "metadata": { "ExecuteTime": { "end_time": "2022-11-23T10:24:41.828162Z", "start_time": "2022-11-23T10:24:41.809164Z" } }, "outputs": [ { "ename": "TypeError", "evalue": "Can't use object of type ndarray as input data. Expected type InputData.", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn [10], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m _bundle\u001b[38;5;241m.\u001b[39mBundle(input_data\u001b[38;5;241m=\u001b[39mpoints)\n", "File \u001b[1;32msrc/commonnn/_bundle.pyx:30\u001b[0m, in \u001b[0;36mcommonnn._bundle.Bundle.__cinit__\u001b[1;34m()\u001b[0m\n", "File \u001b[1;32msrc/commonnn/_bundle.pyx:59\u001b[0m, in \u001b[0;36mcommonnn._bundle.Bundle.input_data.__set__\u001b[1;34m()\u001b[0m\n", "\u001b[1;31mTypeError\u001b[0m: Can't use object of type ndarray as input data. Expected type InputData." ] } ], "source": [ "_bundle.Bundle(input_data=points)" ] }, { "cell_type": "markdown", "id": "7418653c", "metadata": {}, "source": [ "Input data needs to be provided in terms of a generic type to allow a clustering procedure to be executed with it. Generic types can be accessed and worked with in a universal fashion, independent of how data is actually physically stored. A good type to be created from raw data points presented as a NumPy array is `_types.InputDataExtComponentsMemoryview`:" ] }, { "cell_type": "code", "execution_count": 11, "id": "570371ce", "metadata": { "ExecuteTime": { "end_time": "2022-11-23T10:24:41.840559Z", "start_time": "2022-11-23T10:24:41.829805Z" } }, "outputs": [], "source": [ "input_data = _types.InputDataExtComponentsMemoryview(points)\n", "bundle = _bundle.Bundle(input_data=input_data)\n", "plain_clustering._bundle = bundle" ] }, { "cell_type": "markdown", "id": "111dc58e", "metadata": {}, "source": [ "Note that this type requires a C-continuous 2-dimensional array of 64-bit floats. Python nested sequences can be converted into this format using `recipes.prepare_components_array_from_parts`." ] }, { "cell_type": "code", "execution_count": 12, "id": "cea65cda", "metadata": { "ExecuteTime": { "end_time": "2022-11-23T10:24:41.855737Z", "start_time": "2022-11-23T10:24:41.842247Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Clustering(input_data=InputDataExtComponentsMemoryview(components of 3 points in 3 dimensions), fitter=None, hierarchical_fitter=None, predictor=None)\n" ] } ], "source": [ "print(plain_clustering)" ] }, { "cell_type": "markdown", "id": "internal-bottle", "metadata": {}, "source": [ "