Clustal Omega  1.2.3
Data Structures | Macros | Enumerations | Functions
mbed.c File Reference
#include <assert.h>
#include <float.h>
#include "mbed.h"
#include "pair_dist.h"
#include "symmatrix.h"
#include "ktuple_pair.h"
#include "tree.h"
#include "util.h"
#include "progress.h"
#include "queue.h"
#include "log.h"
#include "kmpp/KMeans.h"
Include dependency graph for mbed.c:

Data Structures

struct  bisecting_kmeans_result_t
 

Macros

#define TIMING   0
 
#define FULL_WITHIN_CLUSTER_DISTANCES   1
 
#define COMPUTE_WITHIN_SUBCLUSTER_AVERAGE   0
 
#define USE_KMEANS_LLOYDS   0
 
#define log2(x)   (log(x) / 0.69314718055994530942)
 
#define NUMBER_OF_SEEDS(n)   pow(log2(((double)n)), 2)
 
#define SEED_SELECTION   SELECT_SEEDS_BY_LENGTH
 
#define USE_EUCLIDEAN_DISTANCE   1
 
#define PRINT_CLUSTER_DISTRIBUTION   0
 
#define TRACE   0
 

Enumerations

enum  SEED_SELECTION_TYPE { SELECT_SEEDS_RANDOMLY, SELECT_SEEDS_BY_LENGTH }
 

Functions

void FreeKMeansResult (bisecting_kmeans_result_t **prResult_p)
 Free KMeans result structure. More...
 
void NewKMeansResult (bisecting_kmeans_result_t **prKMeansResult_p)
 Allocate new KMeans result. More...
 
double EuclDist (const double *v1, const double *v2, const int dim)
 Calculate the euclidean distance between two vectors. More...
 
double CosDist (const double *v1, const double *v2, const int dim)
 Calculate the cosine distance between two vectors. More...
 
int SeqToVec (double **ppdSeqVec, mseq_t *prMSeq, int *piSeeds, const int iNumSeeds, const int iPairDistType)
 convert sequences into mbed-like (distance) vector representation. Seeds (prMSeq sequence indices) have to be picked before More...
 
int SeedSelection (int *piSeeds, int iNumSeeds, int iSelectionMethod, mseq_t *prMSeq)
 Select seeds to be used from an prMSeq. More...
 
void BisectingKmeans (bisecting_kmeans_result_t **prKMeansResult_p, const int iNObjs, const int iDim, double **ppdVectors, const int iMinRequiredObjsPerCluster, const int iMaxAllowedObjsPerCluster, char ***ppcClusterSplits_p)
 Bisecting K-Means clustering. Repeatedly calls K-Means with a K of 2 until no cluster has more than iMaxAllowedObjsPerCluster. More...
 
int Mbed (tree_t **prMbedTree_p, mseq_t *prMSeq, const int iPairDistType, const char *pcGuidetreeOut, int iClustersizes, const char *pcClusterFile)
 From scratch reimplementation of mBed: Blackshields et al. (2010); PMID 20470396. More...
 

Macro Definition Documentation

#define COMPUTE_WITHIN_SUBCLUSTER_AVERAGE   0
#define FULL_WITHIN_CLUSTER_DISTANCES   1
#define log2 (   x)    (log(x) / 0.69314718055994530942)
#define NUMBER_OF_SEEDS (   n)    pow(log2(((double)n)), 2)
#define PRINT_CLUSTER_DISTRIBUTION   0
#define SEED_SELECTION   SELECT_SEEDS_BY_LENGTH
#define TIMING   0
#define TRACE   0
#define USE_EUCLIDEAN_DISTANCE   1
#define USE_KMEANS_LLOYDS   0

Enumeration Type Documentation

Enumerator
SELECT_SEEDS_RANDOMLY 
SELECT_SEEDS_BY_LENGTH 

Function Documentation

void BisectingKmeans ( bisecting_kmeans_result_t **  prKMeansResult_p,
const int  iNObjs,
const int  iDim,
double **  ppdVectors,
const int  iMinRequiredObjsPerCluster,
const int  iMaxAllowedObjsPerCluster,
char ***  ppcClusterSplits_p 
)

Bisecting K-Means clustering. Repeatedly calls K-Means with a K of 2 until no cluster has more than iMaxAllowedObjsPerCluster.

Parameters
[out]prKMeansResult_pResult of Bisecting KMeans. Will be allocated here. Caller has to free. See
See also
FreeKMeansResult()
Parameters
[in]iNObjsNumber of objects/sequences to cluster
[in]iDimDimensionality of input data
[in]ppdVectorseach row holds iDim points for this object's coordinates
[in]iMinRequiredObjsPerClusterMinimum number of objects per Cluster (inclusive)/
[in]iMaxAllowedObjsPerClusterMaximum number of objects per Cluster (inclusive). Function returns once no cluster contains more then this number of objects. Soft limit!
Note
Convoluted code. Could use some restructuring. My apologies. AW
double CosDist ( const double *  v1,
const double *  v2,
const int  dim 
)

Calculate the cosine distance between two vectors.

Parameters
[in]v1First vector with dim dimensions
[in]v2Second vector with dim dimensions
[in]dimDimensionality of v1 and v2
Returns
cosine distance as double
Note
Could probably be inlined. Also perfect for SSE
double EuclDist ( const double *  v1,
const double *  v2,
const int  dim 
)

Calculate the euclidean distance between two vectors.

Parameters
[in]v1First vector with dim dimensions
[in]v2Second vector with dim dimensions
[in]dimDimensionality of v1 and v2
Returns
euclidean distance as double
Note
Could probably be inlined. Also perfect for SSE
void FreeKMeansResult ( bisecting_kmeans_result_t **  prResult_p)

Free KMeans result structure.

Parameters
[out]prResult_pK-Means result to free
See also
NewKMeansResult()
int Mbed ( tree_t **  prMbedTree_p,
mseq_t prMSeq,
const int  iPairDistType,
const char *  pcGuidetreeOut,
int  iClustersizes,
const char *  pcClusterFile 
)

From scratch reimplementation of mBed: Blackshields et al. (2010); PMID 20470396.

Idea is a follows:

  • convert sequences into vectors of distances
  • cluster the vectors using k-means
  • cluster each of the k clusters using upgma (used cached distances from above?)
  • join the sub-clusters to create on tree (use UPGMA on k-means medoids)
Parameters
[out]prMbedTree_pCreated upgma tree. will be allocated here. use FreeMuscleTree() to free
[in]prMSeqMultiple sequences
[in]iPairDistTypeDistance measure for pairwise alignments
[in]pcGuidetreeOutPassed down to GuideTreeUpgma()
Note
: if the number of sequences is smaller than MAX_ALLOWED_SEQ_PER_PRECLUSTER then there's no need to do the subclustering first. In fact it costs some extra time. However, it's insignificant and for simplicities sake we don't do any special checks
Returns
Zero on success, non-zero on error
void NewKMeansResult ( bisecting_kmeans_result_t **  prKMeansResult_p)

Allocate new KMeans result.

Parameters
[out]prKMeansResult_pK-Means result to free
See also
NewKMeansResult()
int SeedSelection ( int *  piSeeds,
int  iNumSeeds,
int  iSelectionMethod,
mseq_t prMSeq 
)

Select seeds to be used from an prMSeq.

Parameters
[out]piSeedsWill store the indices of prMSeqs seqs used to be as seeds here. Must be preallocated.
[in]iNumSeedsNumber of seeds to be picked
[in]iSelectionMethodSeed selection method to be used
[in]prMSeqThe prMSeq structure to pick sequences from
Returns
: Non-zero on failure
int SeqToVec ( double **  ppdSeqVec,
mseq_t prMSeq,
int *  piSeeds,
const int  iNumSeeds,
const int  iPairDistType 
)

convert sequences into mbed-like (distance) vector representation. Seeds (prMSeq sequence indices) have to be picked before

Parameters
[out]ppdSeqVecPointer to preallocated matrix of size nseqs x iSeeds
[in]prMSeqSequences which are to be converted
[in]piSeedsArray of sequences in prMSeq which are to be used as seeds.
[in]iNumSeedsNumber of seeds/elements in piSeeds
[in]iPairDistTypeType of pairwise distance comparison