When we discussed the k-means algorithm, we saw that we had to give the number of clusters as one of the input parameters. In the real world, we won’t have this information available. We can definitely sweep the parameter space to find out the optimal number of clusters using the silhouette coefficient score, but this will be an expensive process! A method that returns the number of clusters in our data will be an excellent solution to the problem. DBSCAN does just that for us.
We will perform a DBSCAN analysis using the sklearn.cluster.DBSCAN function. We will use the same…
We have built different clustering algorithms, but haven’t measured their
performance.
A good way to measure a clustering algorithm is by seeing how well the clusters are separated. Are the clusters well separated? Are the data points in a cluster that is tight enough?
We need a metric that can quantify this behaviour. We will use a metric called the silhouette…
A combination of different approaches leads to better results: this statement works in different aspects of our life and also adapts to algorithms based on machine learning.
Stacking is the process of combining various machine learning algorithms. This technique is due to David H. Wolpert, an American mathematician, physicist, and computer scientist.
We will learn how to implement a stacking method.
from heamy.dataset import Dataset
from heamy.estimator import Regressor
from heamy.pipeline import ModelsPipeline
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
2…
Download data file ‘traffic_data.txt’ https://github.com/appyavi/Dataset . This is a dataset that counts the number of cars passing by during baseball games at the Los Angeles Dodgers home stadium. Each line in this file contains comma-separated strings formatted in the following manner:
Let’s see how to estimate the traffic.
SVM regressor to estimate traffic
import numpy as np
from sklearn import preprocessing
from sklearn.svm import SVRinput_file = 'traffic_data.txt' …
Download the data file building_event_binary.txt, building_event_multiclass.txt from https://github.com/appyavi/Dataset.
Let’s understand the data format before we start building the model. Each line in building_event_binary.txt consists of six comma-separated strings. The ordering of these six strings is as follows:
The first five strings form the input data, and our task is to predict whether or not an event is going on in the building.
Each line in building_event_multiclass.txt consists of six comma-separated strings…
We will extract hyperparameters for a model based on an SVM algorithm using the grid search method.
Let’s see how to find optimal hyperparameters:
Datafile: download ‘data_multivar.txt’ form here: https://github.com/appyavi/Dataset
from sklearn import svm
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
import pandas as pd
import utilities
2. Then, we load the data:
input_file = 'data_multivar.txt'…
It would be nice to know the confidence with which we classify unknown data. When a new data point is classified into a known category, we can train the SVM to compute the confidence level of that output as well. A confidence level refers to the probability that the value of a parameter falls within a specified range of values.
We will use an SVM classifier to find the best separating boundary between a
dataset of points. In addition, we will also perform a measure of the confidence level of the results obtained.
Download the file ‘data_multivar.txt’ from https://github.com/appyavi/Dataset
Let’s…
We dealt with problems where we had a similar number of data points in all our classes. In the real world, we might not be able to get data in such an orderly fashion. Sometimes, the number of data points in one class is a lot more than the number of data points in other classes. If this happens, then the classifier tends to get biased. The boundary won’t reflect the true nature of your data, just because there is a big difference in the number of data points between the two classes. …
Data Scientist at NCS-IT, UK