When we discussed the k-means algorithm, we saw that we had to give the number of clusters as one of the input parameters. In the real world, we won’t have this information available. We can definitely sweep the parameter space to find out the optimal number of clusters using the silhouette coefficient score, but this will be an expensive process! A method that returns the number of clusters in our data will be an excellent solution to the problem. DBSCAN does just that for us.

We will perform a DBSCAN analysis using the **sklearn.cluster.DBSCAN** function. We will use the same…

We have built different clustering algorithms, but haven’t measured their

performance.

**In supervised learning**, the predicted values with the original labels are**compared to calculate their accuracy.**- In contrast,
**in unsupervised learning,**we have no labels, so we need to find a way to measure the performance of our algorithms.

A good way to measure a clustering algorithm is by seeing how well the clusters are separated. **Are the clusters well separated? Are the data points in a cluster that is tight enough?**

We need a metric that can quantify this behaviour. We will use a metric called the **silhouette…**

**Hierarchical clustering**refers to a set of clustering algorithms that creates tree-like clusters by consecutively splitting or merging them, and they are represented using a tree.- Hierarchical clustering algorithms can be either
**bottom-up or top-down.**Now, what does this mean? - In
**bottom-up algorithms**, each data point is treated as a separate cluster

with a single object. These clusters are then successively merged until all the clusters are merged into a single giant cluster.**This is called agglomerative clustering.** - On the other hand,
**top-down algorithms**start with a giant cluster and successively split these clusters until individual data points are reached…

**Unsupervised learning**is a paradigm in machine learning where we build models**without relying on labelled training data**.- Dealing with data that is labelled in some way means that learning algorithms can look at the data and learn to categorize them based on labels.
- In the world of unsupervised learning, we don’t have this opportunity!
**These algorithms are used when we want to find subgroups within datasets using a similarity metric.** - In unsupervised learning, information from the database is automatically extracted. All this takes place without prior knowledge of the content to be analyzed.
- In unsupervised learning, there is no…

A combination of different approaches leads to better results: this statement works in different aspects of our life and also adapts to algorithms based on machine learning.

Stacking is the process of combining various machine learning algorithms. This technique is due to David H. Wolpert, an American mathematician, physicist, and computer scientist.

**We will learn how to implement a stacking method.**

- We start by importing the libraries:

`from heamy.dataset import Dataset`

from heamy.estimator import Regressor

from heamy.pipeline import ModelsPipeline

from sklearn.datasets import load_boston

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_absolute_error

2…

Download data file ‘traffic_data.txt’ https://github.com/appyavi/Dataset . This is a dataset that counts the number of cars passing by during baseball games at the Los Angeles Dodgers home stadium. Each line in this file contains comma-separated strings formatted in the following manner:

- Day;
- Time;
- The opponent team;
- Whether or not a baseball game is going on;
- The number of cars passing by.

Let’s see how to estimate the traffic.

- Load the data and relevant libraries:

SVM regressor to estimate traffic

import numpy as np

from sklearn import preprocessing

from sklearn.svm import SVRinput_file = 'traffic_data.txt' …

Download the data file building_event_binary.txt, building_event_multiclass.txt from https://github.com/appyavi/Dataset.

Let’s understand the data format before we start building the model. Each line in **building_event_binary.txt** consists of six comma-separated strings. The ordering of these six strings is as follows:

- Day;
- Date;
- Time;
- The number of people going out of the building;
- The number of people coming into the building;
**The output**indicating whether or not it’s an event.

The **first five strings form the input data**, and our task is to predict whether or not an event is going on in the building.

Each line in building_event_multiclass.txt consists of six comma-separated strings…

- In machine learning algorithms, various parameters are obtained during the learning process.
- In contrast, hyperparameters are set before the learning process begins.
- Given these hyperparameters, the training algorithm learns the parameters from the data.

We will extract hyperparameters for a model based on an SVM algorithm using the grid search method.

Let’s see how to find optimal hyperparameters:

Datafile: download ‘data_multivar.txt’ form here: https://github.com/appyavi/Dataset

- We start importing the libraries:

`from sklearn import svm`

from sklearn import model_selection

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import classification_report

import pandas as pd

import utilities

2. Then, we load the data:

`input_file = 'data_multivar.txt'…`

It would be nice to know the confidence with which we classify unknown data. When a new data point is classified into a known category, we can train the SVM to compute the confidence level of that output as well. A confidence level refers to the probability that the value of a parameter falls within a specified range of values.

We will use an SVM classifier to find the best separating boundary between a

dataset of points. In addition, we will also perform a measure of the confidence level of the results obtained.

Download the file ‘data_multivar.txt’ from https://github.com/appyavi/Dataset

Let’s…

We dealt with problems where we had a similar number of data points in all our classes. In the real world, we might not be able to get data in such an orderly fashion. Sometimes, the number of data points in one class is a lot more than the number of data points in other classes. If this happens, then the classifier tends to get biased. The boundary won’t reflect the true nature of your data, just because there is a big difference in the number of data points between the two classes. …

Data Scientist at NCS-IT, UK