df0=sns.load_dataset("iris")
df0.head()| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Although pdist from scipy can calculate 1D distance of self matrix (row by row) with customized function, the function is limited without key on the vector. In this module, we’d like to make functions that can consider key in the vector when calculating distance between two vectors, so the input should take dataframe for datatable and pd.Series for vectors.
array([0.53851648, 0.50990195, 0.64807407, ..., 0.6164414 , 0.64031242,
0.76811457], shape=(11175,))
Compute 1D distance (like pdist from scipy) but for df with column names
100%|██████████| 3/3 [00:00<00:00, 8823.92it/s]
array([2, 4, 2])
Parallel computing to accelerate when flattened pssms are too many in a df:
Parallel compute 1D distance for each row in a dataframe given a distance function
Get linkage matrix Z from pssms dataframe
100%|██████████| 150/150 [00:00<00:00, 539.11it/s]
CPU times: user 274 ms, sys: 5.29 ms, total: 279 ms
Wall time: 280 ms
array([[1.01e+02, 1.42e+02, 0.00e+00, 2.00e+00],
[7.00e+00, 3.90e+01, 1.00e-01, 2.00e+00],
[0.00e+00, 1.70e+01, 1.00e-01, 2.00e+00],
[9.00e+00, 3.40e+01, 1.00e-01, 2.00e+00],
[1.28e+02, 1.32e+02, 1.00e-01, 2.00e+00]])
Call self as a function.
Get flat cluster assignments from hierarchical clustering linkage matrix Z.