8. Usage example

A quick example to clustering. The order of events is as follows:

8.1. Demo mode

In demo mode no clustering is performed. This mode searches for previously stored cluster data on disk(default: current directory) and displays the cluster information

>>> from kctools import loadKCFromDisk
>>> kcluster = loadKCFromDisk(debug=False)
>>>     if isinstance(kcluster, (np.ndarray, np.generic)):
>>>         prettyPrint(kcluster)
>>>         exit()

To enter demo mode, demoFlag is to be set True.

8.2. Data fetching from database

Database operations are defined in the db connector module. The database backend can be changed in the module. Here’s a simple demo of usage

Example

Fetching data from sensor_co table in EXPeriment database

>>> from db_connector import fetchRAWData
>>> myMat, myLabels = fetchRAWData(debug=False,
                                  idx_start=0,
                                  idx_end=1000,
                                  db_host='localhost',
                                  dbase='EXPeriment',
                                  user = 'analyst',
                                  table='sensor_co',
                                  u_pass = '*******',
                                  )
>>> print myMat.shape

8.3. Elimination of faulty sensors

It is possible that some sensors are faulty/non-functional. In our work, the sensors had a certain min threshold or sane output level. If a sensor provides output below a threshold level for a very long time then we concluded the sensor to be faulty.

8.4. Filtering of sensor data

One can pre-process the data to improve the clustering performance. One example is when different sensor ( Types ) are showing different (lead/lag)phase characteristics. We could perform some elementary filtering to make the sensor data compatible.

8.5. Masking

Healthy sensors can sometimes generate faulty results. However such unreasonable results can also be eliminated. This is the purpose of Masking, to remove those values which are not reasonable.

8.6. Mean centering and scaling

Mean centered element is expressed as

\(\large {\tilde{s_{ij}}=\displaystyle \frac{s_{ij} - \bar{s}}{\sqrt {\sigma ^2}}}\)

where

\(\bar{s}\) : Arithmetic mean of \(i^{th}\) sensor
\({\sigma ^2}\): Variance of \(i^{th}\) sensor

8.7. Data combining

Numpy provides function concatenate for concatenating two matrices with similar row dimensions.

One can do the following:

from pylab import randn
import numpy as np

A = np.array(randn(10)).reshape(5,2)
B = np.array(randn(15)).reshape(5,3)

combinedMat = np.concatenate((A, B),axis=1)

print combinedMat.shape

For row-wise concatenation of two matrices we change axis to 0

8.8. Clustering

The clustering function kcluster is provided by the Pycluster library and it takes following inputs

Pycluster.kcluster(mat, k, mask, weights, transpose, npass, method, distance, cluster)
..glossary::
mat
Data to be clustered. n rows and m columns
k
Number of clusters
mask
Boolean mask of faulty readings
weights
vector defining the weights to be considered for clustering (distance calculations). If rows are clustered then len(weights) must equal m else when columns are clustered len(weights) must equal n
transpose
Boolean flag. If 0 rows are clustered, if 1 columns are clustered.
npass
The number of times the clustering algorithm should be run. If npass > 0, each run of the algorithm uses a different (random) initial seeds.
method
A single character string either method = 'a' or method = 'm'. When ‘a’ is choosen arithmetic mean is used by the algorithm internally( for distance calculations), otherwise, when ‘m’ is choosen median is used.
distance
A single character string to choose distance function. ‘a’ for absolute pearson correlation coefficient
cluster
When npass==0 these clusters are used for initial clustering. There is no way of selecting the seeds manually though.

8.9. Saving data to disk

To save data to disk we use the utility function saveKCToDisk(). This function is to be used in conjunction with loadKCFromDisk()

Here is a demo:

from kcTools import *
from pylab import randn
import numpy as np

mat = np.array(randn(1500)).reshape(150,10)

saveKCToDisk(mat)
mymat = loadKCFromDisk()

print( "mat: {}\r\n mymat:{}".format(mat,mymat) )