Usage example
=============

.. contents:: 
    :depth: 2

A quick example to clustering. The order of events is as follows:

Demo mode
----------
In demo mode **no** clustering is performed. This mode searches for previously stored cluster data on disk(default: `current directory`) and displays the cluster information 

>>> from kctools import loadKCFromDisk 
>>> kcluster = loadKCFromDisk(debug=False)
>>>     if isinstance(kcluster, (np.ndarray, np.generic)):            
>>>         prettyPrint(kcluster)
>>>         exit()

To enter demo mode, `demoFlag` is to be set True.


Data fetching from database  
-----------------------------
Database operations are defined in the `db connector <dbconnector.html>`_ module.
The database backend can be changed in the module. Here's a simple demo of usage

**Example**

Fetching data from `sensor_co` table in `EXPeriment` database

>>> from db_connector import fetchRAWData
>>> myMat, myLabels = fetchRAWData(debug=False, 
                                  idx_start=0, 
                                  idx_end=1000, 
                                  db_host='localhost', 
                                  dbase='EXPeriment', 
                                  user = 'analyst', 
                                  table='sensor_co', 
                                  u_pass = '*******', 
                                  )
>>> print myMat.shape
   

Elimination of faulty sensors
--------------------------------------------------------------
It is possible that some sensors are faulty/non-functional.
In our work, the sensors had a certain min threshold or sane output level. If a sensor provides output below a threshold level for a very long time then we concluded the sensor to be faulty.


Filtering of sensor data  
--------------------------------
One can pre-process the data to *improve* the clustering performance.
One example is when different sensor ( Types ) are showing different (lead/lag)phase characteristics. We could perform some *elementary filtering* to make the sensor data compatible.


Masking  
--------
Healthy sensors can sometimes generate faulty results. However such unreasonable results can also be eliminated. This is the purpose of Masking, to remove those values which are not reasonable.

Mean centering and scaling  
--------------------------

Mean centered element is expressed as 

:math:`\large {\tilde{s_{ij}}=\displaystyle \frac{s_{ij} - \bar{s}}{\sqrt {\sigma ^2}}}` 

where 

	| :math:`\bar{s}`    : Arithmetic mean of :math:`i^{th}` sensor 
	| :math:`{\sigma ^2}`: Variance of :math:`i^{th}` sensor 

.. Python source reference :ref:`meancentering`

Data combining  
---------------
Numpy provides function `concatenate` for concatenating two matrices with similar row dimensions. 

One can do the following::

  from pylab import randn
  import numpy as np

  A = np.array(randn(10)).reshape(5,2)
  B = np.array(randn(15)).reshape(5,3)

  combinedMat = np.concatenate((A, B),axis=1)

  print combinedMat.shape


.. testoutput:

  (5, 5)
 
For row-wise concatenation of two matrices we change ``axis`` to ``0``


Clustering  
----------
The clustering function `kcluster` is provided by the `Pycluster <http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm#pycluster>`_ library and it takes following inputs

.. function:: Pycluster.kcluster(mat,k,mask,weights,transpose,npass,method,distance,cluster)

..glossary::
  mat
    Data to be clustered. `n` *rows* and `m` *columns*
  k
    Number of clusters 
  mask
    Boolean mask of faulty readings
  weights
    vector defining the weights to be considered for clustering (distance calculations). If rows are clustered then ``len(weights) must equal m`` else when columns are clustered ``len(weights) must equal n``
  transpose
    Boolean flag. If `0` rows are clustered, if 1 columns are clustered.
  npass
    The number of times the clustering algorithm should be run. If npass > 0, each run of the algorithm uses a different (random) initial seeds.
  method
    A single character string either ``method = 'a' or method = 'm'``. When 'a' is choosen `arithmetic mean` is used by the algorithm internally( for distance calculations), otherwise, when 'm' is choosen `median` is used. 
  distance
    A single character string to choose distance function. 'a' for absolute pearson correlation coefficient
  cluster
    When `npass==0` these clusters are used for initial clustering. There is no way of selecting the seeds manually though.


Saving data to disk
-------------------
To save data to disk we use the utility function :func:`~kClusterLib.kcTools.saveKCToDisk()`. This function is to be used in conjunction with :func:`~kClusterLib.kcTools.loadKCFromDisk()` 

Here is a demo::

  from kcTools import *
  from pylab import randn
  import numpy as np 

  mat = np.array(randn(1500)).reshape(150,10)

  saveKCToDisk(mat)
  mymat = loadKCFromDisk()

  print( "mat: {}\r\n mymat:{}".format(mat,mymat) )