3. KC Tools (module)

Some klustering related modules

This modeule is a collection of many useful functions.

kClusterLib.kcTools.centerVectors(sensor_vectors)[source]

Applies Mean centering (and scaling too..) to SPND data.

\(\tilde{s_{ij}} = \frac{ s_{ij} - \bar{s} }{ \sqrt {\sigma}}\)

Parameters:sensor_vectors (numpy.ndarray) – SPND data matrix. SPNDs along columns.
Returns:
  • centered_array (numpy.ndarray) – Matrix(SPNDs along column) with elements centered and scaled.
  • mean_vector (list) – A row vector containing the sensor means.
  • sqrt_var_vector (list) – A row vector containing the \(\sqrt{\sigma^2}\) for each sensor.
kClusterLib.kcTools.getCentroids(mat, clusters)[source]

Gives Centroids of clusters. Requires data matrix and cluster map.

Note

This function is obsolete now. Use Pycluster.clustercentroids instead.

kClusterLib.kcTools.separateClusters(kcluster, labels=[])[source]

Returns a list of lists containing separated clusters i.e. SPND’s that belong to same cluster are put together in a list. As many list’s as there are clusters.

Parameters:kcluster (list) – Takes a 1-D cluster list representing {SPND <–> Cluster} mapping.
Returns:cluster_result – A list containing members list which contain the integer indexes of SPNDs belonging to a cluster
Return type:list (of lists)

Example

>>> from kcTools import separateClusters
>>> clusters = [ 2, 0, 1, 3, 2, 0, 1, 3,  2, 0, 1, 3]
>>> print separateClusters(clusters)
>>> ...
[[1, 5, 9], [2, 6, 10], [0, 4, 8], [3, 7, 11]]
kClusterLib.kcTools.prettyPrint(kcluster, labels=[])[source]

Prints human readable format from cluster data.

Parameters:
  • kcluster (list) – cluster list representing SPNDs to Cluster mapping.
  • labels (list, optional) – list containing string names of SPNDs. If no list is passed automatic numbering is used.
Returns:

Return type:

Displays output on stdout.

Example

kClusterLib.kcTools.getInterClusterS(mat, clusters)[source]

Calculates the sum of intercluster distances.

Parameters:
  • mat (numpy.ndarrray) – SPND data used for clustering.
  • clusters (list) – cluster map of SPND.
Returns:

interClusterDistance – Sum of cluster means from global mean.

Return type:

float

Example

kClusterLib.kcTools.count_nan(arr)[source]

Counts occurances of numpy.nan in the passed data structure.

Parameters:arr (numpy.ndarray, list) – The dataset in which nan(s) is to be counted.
Returns:count – number of occurances of nan within the data passed.
Return type:int

Example

kClusterLib.kcTools.custom_filter(mat)[source]

Converts negative elements to 0 (zero) in the passed data.

Parameters:mat (numpy.ndarray) – matrix to be cleaned of -ve values
Returns:cleanMat – matrix with negative elements converted to zero
Return type:numpy.ndarray
kClusterLib.kcTools.removeFaultySensors(matSensor, labelSensor, minSensorOutput=0, allowedFault=25)[source]

Eliminates those SPNDs which have x%% bad sensor readings, x is provided by user.

Parameters:
  • matSensor (numpy.ndarray) – Data to be filtered.
  • labelSensor (list) – list of SPND labels.
  • minSensorOutput (float) – sensor faulty value(output) or bad value threshold. If a sensor value is below this value then it is considered a bad value.
  • allowedFault (float) – %% threshold for total bad values. If a SPND has these many (or more) %% of bad readings then it is eliminated. Corresponding label is also removed from the list of labels
Returns:

  • matSensor (numpy.ndarray) – SPND data after removval of faulty SPND column(s).
  • labelSensor – label list after removal of faulty SPND label(s).

kClusterLib.kcTools.getOptimalCluster(sensor_mat, Si_threshold=0.1, Kmax=8, **kwargs)[source]

Does clustering multiple times and tries to find the optimal cluster.

Parameters:
  • sensor_mat (numpy.ndarray) – SPND data matrix.
  • Si_threshold (int) – Desired ratio of Intra to Inter Cluster distance. This value is a measure of closeness of the similar SPNDs in a cluster.
  • Kmax (int) – Upper limit on number of clusters. The clustering program starts clustering from k=2 (two clusters) and then keeps increases k to find the smaller Si.
Returns:

  • kcluster (list) – A list containing the cluster mapping of SPNDs. The index corresponds to SPND and value corresponds to Cluster. Eg kcluster = [2,0,1] signifies

    \(0^{th} spnd :\to cluster 2\)
    \(1^{st} spnd :\to cluster 0\)
    \(2^{nd} spnd :\to cluster 1\)
  • error (float) – Represents the sum of intra cluster distances.

  • freq (int) – Represents how many times the optimal solution was found while clustering.

Example

>>> from pylab import randn
>>> import numpy
>>> from kcTools import getoptimalCluster, prettyPrint
>>> #Lets create a random matrix
>>> a = numpy.array(randn(100)).reshape(10,10)
>>> #perform clustering
>>> kc, er, fc = getoptimalCluster(a,0.5,5)
>>> prettyPrint(kc,['a','b','c','d','e','f','g','h','i','j'])
0: ['d', 'h']
1: ['a', 'g']
2: ['b', 'f', 'i']
3: ['c', 'e', 'j']

...

kClusterLib.kcTools.loadPcaData(**kwargs)[source]

Returns relMatrix and residueCovMat.

PCA model data. This function searches for the stored data in current directory or some other directory (based on cpath arguments)

Parameters:
  • fname (str, optional) – String Name of cluster to load from current dir
  • cpath (str, optional) – System path of directory where cluster is stored.
Returns:

  • relMatrix (ndarray) – PCA relations Matrix
  • residueCovMat (ndarray) – Data-Covariance matrix
  • .. note (order of arguments is to be tracked carefully.)

kClusterLib.kcTools.savePcaData(relMat, residueCovMat, **kwargs)[source]

Saves a given cluster to memory.

Parameters:
  • relMatrix (ndarray) – PCA relations Matrix
  • residueCovMat (ndarray) – Data-Covariance matrix
Returns:

name – Returns <filename>.np if successfully written the data to disk/dir.

Return type:

str

...

kClusterLib.kcTools.loadKCFromDisk(**kwargs)[source]

Returns clusterMap, CoVcombinedMat, spndLabels, spndMeans, spndVars from disk (saved data).

This function searches for the stored data in current directory or some other directory (based on cpath arguments)

Parameters:
  • fname (str, optional) – String Name of cluster to load from current dir
  • cpath (str, optional) – System path of directory where cluster is stored.
Returns:

  • clusterMap (list) – If a cluster is located in the given path( or current directory) else None is returned.
  • CoVcombinedMat (numpy.ndarray) – matrix stored in memory
  • spndLabels (list) – List of string names for SPNDs
  • spndMeans (ndarray) – 1-D numpy array containing Mean of each SPND sensor.
  • spndVars (ndarray) – 1-D numpy array containing Mean of each SPND sensor.

Example

Here’s a use case of this function

>>> from kcTools import *
>>> from pylab import randn
>>> import numpy as np
>>> mat = np.array(randn(1500)).reshape(150,10)
>>> kc = np.array(range(6))
>>> labels = ['a','b','c','d','e','f','g','h','i', 'j']
>>> means = [np.mean(mat[:,i] for i in range(10))]
>>> vars = [np.var(mat[:,i] for i in range(10))]
>>> saveKCToDisk(kc, mat, labels, means, vars)
>>> clusterMap, CoVcombinedMat, spndLabels, spndMeans, spndVars = loadKCFromDisk()
>>> print( "kc: {}\r\n\r\nmat:{} \r\n\r\nlbl:{} \r\n\r\nmeans:{} \r\n\r\nvars:{}".format(clusterMap, CoVcombinedMat, spndLabels, spndMeans, spndVars) )
kClusterLib.kcTools.saveKCToDisk(clusterMap, CoVcombinedMat, spndLabels, spndMeans, spndVars, **kwargs)[source]

Saves a given cluster to memory.

Parameters:
  • clusterMap (list) – kcluster information to be saved to disk
  • CoVcombinedMat (ndarray) – data used for clustering (mean centered and scaled)
  • spndLabels (list) – String names for SPNDs
  • spndMeans (ndarray) – 1-D numpy array containing Mean of each SPND sensor.
  • spndVars (ndarray) – 1-D numpy array containing Mean of each SPND sensor.
  • fname (str, optional) – Name of data file to be created and stored on dir.
  • cpath (str, optional) – Valid system path where the cluster must be stored.
Returns:

name – Returns <filename>.np if successfully written the data to disk/dir.

Return type:

str

...

kClusterLib.kcTools.getFreshCluster(**kwargs)[source]

Searches for a valid cluster on disk.

Note

No need to use this function directly, use loadKCFromDisk() instead

kClusterLib.kcTools.get_datetime(fstr)[source]

Pass a valid DATE-Time string to this function and get a valid datetime object

Parameters:fstr (str) – A valid datetime string
Returns:timestamp – Valid datetime object. This allows us to check the freshness of files or checking how mush time has elapsed since a perticular file was created or
Return type:datetime instance
kClusterLib.kcTools.halt()[source]

Halts the execution of program until user chooses to proceeed

Parameters:None
Returns:
Return type:None
kClusterLib.kcTools.isSingletonCluster(cluster)[source]

Check if a singleton cluster exists in the passed 1-D SPND-cluster mapping array.

Parameters:cluster (list) – 1-D array of length( total number of SPNDs ) containing the SPND - cluster mapping.
Returns:status – Returns True if one or more singleton cluster labels occur in cluster.
Return type:bool

Example

>>> from kcTools import isSingletonCluster
>>> clustr = [2,3,0,1,2,0,3,2,0]  # Here 1 is singleton
>>> print (isSingletonCluster(clustr))  # printing
>>> clustr = [2,3,0,2,2,0,3,2,0]  # No singleton
>>> print (isSingletonCluster(clustr))  # printing

Produces Following output

True
False
kClusterLib.kcTools.mergeSingletonCluster(kcluster, mat, **kwargs)[source]

Merges the singleton cluster(if present) to nearest neighbour(S=1-abs(PC)) in the 1-D SPND to cluster MAP.

Parameters:
  • kcluster (list) – List containing the 1-D SPND to cluster-id Mapping.
  • mat (ndarray) – Matrix like structure containing the SPND sensor values.
Returns:

cluster – Updated cluster mapping

Return type:

list

Example