The number of clusters as one of the input parameters

When we discussed the k-means algorithm, we saw that we had to give the number of clusters as one of the input parameters. In the real world, we won’t have this information available. We can definitely sweep the parameter space to find out the optimal number of clusters using the silhouette coefficient score, but this will be an expensive process! A method that returns the number of clusters in our data will be an excellent solution to the problem. DBSCAN does just that for us.

Getting ready

We will perform a DBSCAN analysis using the sklearn.cluster.DBSCAN function. We will use the same…

Measuring the Performance

We have built different clustering algorithms, but haven’t measured their

  1. In supervised learning, the predicted values with the original labels are
    compared to calculate their accuracy.
  2. In contrast, in unsupervised learning, we have no labels, so we need to find a way to measure the performance of our algorithms.

Getting ready

A good way to measure a clustering algorithm is by seeing how well the clusters are separated. Are the clusters well separated? Are the data points in a cluster that is tight enough?

We need a metric that can quantify this behaviour. We will use a metric called the silhouette…

Understanding Hierarchical Clustering

  1. Hierarchical clustering refers to a set of clustering algorithms that creates tree-like clusters by consecutively splitting or merging them, and they are represented using a tree.
  2. Hierarchical clustering algorithms can be either bottom-up or top-down. Now, what does this mean?
  3. In bottom-up algorithms, each data point is treated as a separate cluster
    with a single object. These clusters are then successively merged until all the clusters are merged into a single giant cluster. This is called agglomerative clustering.
  4. On the other hand, top-down algorithms start with a giant cluster and successively split these clusters until individual data points are reached…

Image for post
Image for post

Clustering data using the k-means algorithm


  1. Unsupervised learning is a paradigm in machine learning where we build models without relying on labelled training data.
  2. Dealing with data that is labelled in some way means that learning algorithms can look at the data and learn to categorize them based on labels.
  3. In the world of unsupervised learning, we don’t have this opportunity! These algorithms are used when we want to find subgroups within datasets using a similarity metric.
  4. In unsupervised learning, information from the database is automatically extracted. All this takes place without prior knowledge of the content to be analyzed.
  5. In unsupervised learning, there is no…

A combination of different approaches leads to better results

A combination of different approaches leads to better results: this statement works in different aspects of our life and also adapts to algorithms based on machine learning.

Stacking is the process of combining various machine learning algorithms. This technique is due to David H. Wolpert, an American mathematician, physicist, and computer scientist.

We will learn how to implement a stacking method.

Getting ready

  1. We start by importing the libraries:
from heamy.dataset import Dataset
from heamy.estimator import Regressor
from heamy.pipeline import ModelsPipeline
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error


An interesting application of SVMs is to predict traffic, based on related data.

Getting ready

Download data file ‘traffic_data.txt’ . This is a dataset that counts the number of cars passing by during baseball games at the Los Angeles Dodgers home stadium. Each line in this file contains comma-separated strings formatted in the following manner:

  1. Day;
  2. Time;
  3. The opponent team;
  4. Whether or not a baseball game is going on;
  5. The number of cars passing by.

How to do it

Let’s see how to estimate the traffic.

  1. Load the data and relevant libraries:
SVM regressor to estimate traffic
import numpy as np
from sklearn import preprocessing
from sklearn.svm import SVR
input_file = 'traffic_data.txt' …

We will build an SVM to predict the number of people going in and out of a building.

Download the data file building_event_binary.txt, building_event_multiclass.txt from

Getting ready

Let’s understand the data format before we start building the model. Each line in building_event_binary.txt consists of six comma-separated strings. The ordering of these six strings is as follows:

  1. Day;
  2. Date;
  3. Time;
  4. The number of people going out of the building;
  5. The number of people coming into the building;
  6. The output indicating whether or not it’s an event.

The first five strings form the input data, and our task is to predict whether or not an event is going on in the building.

Each line in building_event_multiclass.txt consists of six comma-separated strings…

Image for post
Image for post

Hyperparameters are important for determining the performance of a classifier.

Getting ready

  1. In machine learning algorithms, various parameters are obtained during the learning process.
  2. In contrast, hyperparameters are set before the learning process begins.
  3. Given these hyperparameters, the training algorithm learns the parameters from the data.

We will extract hyperparameters for a model based on an SVM algorithm using the grid search method.

How to do it

Let’s see how to find optimal hyperparameters:

Datafile: download ‘data_multivar.txt’ form here:

  1. We start importing the libraries:
from sklearn import svm
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
import pandas as pd
import utilities

2. Then, we load the data:

input_file = 'data_multivar.txt'…

It would be nice to know the confidence with which we classify unknown data. When a new data point is classified into a known category, we can train the SVM to compute the confidence level of that output as well. A confidence level refers to the probability that the value of a parameter falls within a specified range of values.

Getting ready

We will use an SVM classifier to find the best separating boundary between a
dataset of points. In addition, we will also perform a measure of the confidence level of the results obtained.


Download the file ‘data_multivar.txt’ from

How to do it


Tackling class imbalance

We dealt with problems where we had a similar number of data points in all our classes. In the real world, we might not be able to get data in such an orderly fashion. Sometimes, the number of data points in one class is a lot more than the number of data points in other classes. If this happens, then the classifier tends to get biased. The boundary won’t reflect the true nature of your data, just because there is a big difference in the number of data points between the two classes. …

Bhanu Soni

Data Scientist at NCS-IT, UK

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store