## Machine Learning Algorithms and their Applications

In this article, you will learn about some Machine Learning Algorithms and how they **work**, what the **math behind them** is, and some of their countless **applications**.

We will start by talking about Machine Learning in **genera**l, we will come up with some **definitions**, and then I am going to explain how **2 algorithms** work, what their **differences **are **when they are used,** and what makes them so **good/bad**. We will start with Linear Regression, and then move to K-Means.

Before Starting I should mention that I will include all the equations and mathematical formulas that make up these algorithms. If you are not familiar with **Calculus** and **Linear Algebra**, you will not be able to understand most of these equations, however, I will **explain what the equations mean** and provide some **intuition** so that you can follow **despite not knowing the mathematics.**

**What is Machine Learning?**

- Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.

*(Arthur Samuel)*

**The Universally Accepted Definition of Machine**** Learning**

A computer program is said to learn from experience **E** with respect to some task **T** and some performance measurement **P**, if its performance on **T**, as measurement by **P**, improves with experience **E.**

*(Tom Mitchell)*

**What are some examples of Machine Learning?**

The most common example is the housing price model. This algorithm, based on all the data it was given in its training phase, creates a prediction for the price of a house. A lot of factors depend on it, like how old the house is, how many bedrooms there are, its location, the number of restrooms, etc. We use machine learning to create an analogy between all these **features** and the final prices. In this specific case, it creates a function of **n** variables, where n denotes the number of features, that return a specific value, namely the housing price. This is a classical example of a **Regression Algorithm.**

**Different Types of Learning Algorithms**

In Machine Learning, there are 3 types of Learning Algorithms:

**1) Supervised Learning**

** 2)Unsupervised Learning**

** 3)Reinforcement Learning**

All of these types of Learning Algorithms solve different types of problems and have countless applications in our everyday life.

**Supervised Learning**

We will first take a look at Supervised Learning Algorithms, define them and look at some examples. Afterward, we will get into our first Algorithm, and do a deep analysis with multiple examples and many diagrams.

However, before starting, we should first talk about notation and symbols. There are many ways to symbolize the things that we will talk about, but in this article, we will use the following notation:

Number of Training Examples: **m**

Input Variables/Features: **x**

Output Variables/Target Variables/ Labels: **y**

Hypothesis: \( h_θ(x) \)

\( i^{th} \) training example :\((x^{(i)},y^{(i)})\)(for example, the second training example will be denotes as:\((x^{(2)},y^{(2)})\), we will not start our index at 0, since there is a specific values reserved for the 0th elements)

**What is Supervised Learning?**

In S.L., the program is given a **training set. **A training set is a set of examples for the program to learn of. Each training example consists of **n** inputs, where **n** is the number of features our algorithm takes into consideration when making a prediction. Along with the inputs, we also provide the **labels**, which are the values that correspond to the respective inputs.

The Supervised Learning Algorithm, then creates a function that we call a **hypothesis, **denoted by **h**. This hypothesis is the prediction it creates based on the inputs. After forming a hypothesis, the algorithm then compares its **estimated output, **with the **label** that we have provided. Based on the comparison it adjusts its function(the hypothesis) and iterates over the training set until it reaches the accuracy that the programmer has set as a threshold.

Based on the **algorithm of our choice, **the hypothesis has a different form. For example, as we will later see, Linear Regression has a form of

\(h_θ(x) = θ_0 + θ_1 x \)

for the **Simple Linear Regression, **and the form

for the **Multivariable Linear Regression.**

On the other hand, the **Logistic Regression Model, **take the model

The algorithms above might seem complicated, but in reality, they are very very simple and we get to the analysis of these, you will find out that they are very straightforward.

The **Hypothesis**, as we said is the function that our machine learning program creates.

The values **θ **are simply constants. That means that the **Simple Linear Regression **hypothesis is simply the equation for a line, the famous

All of these constants are **parameters** of our hypothesis, hence the notation **\(h_θ(x) \)**

It means that our hypothesis h is a **function of x, parametrized by θ.**

Our algorithm changes the values **θ** based on how close its estimated prediction is to the actual value, given in the training set.

The simplest example of such a function is one where the **training set** consists of single pairs of numbers, one value, and its associated **label. **In this case, in each example, x is simply that one value, and then we compare the value of our hypothesis to the label, which we usually notate as **y.**

The **Hypothesis**, as we said is the function that our machine learning program creates.

The values **θ **are simply constants. That means that the **Simple Linear Regression **hypothesis is simply the equation for a line, the famous

All of these constants are **parameters** of our hypothesis, hence the notation \(h_θ(x) \)

It means that our hypothesis h is a **function of x, parametrized by θ.**

Our algorithm changes the values **θ** based on how close its estimated prediction is to the actual value, given in the training set.

The simplest example of such a function is one where the **training set** consists of single pairs of numbers, one value, and its associated **label. **In this case, in each example, x is simply that one value, and then we compare the value of our hypothesis to the label, which we usually notate as **y.**

**Linear Regression**

This is the first algorithm that we will look at. Linear Regression, as we said in the previous segment, is an algorithm, that based on the training set creates **the best fit line** for our training set. We give it a specific starting hypothesis, that could be **random, taken from another program, **or **anything **really.

The Algorithm then starts **training **by comparing its predictions/estimated results, to the actual **output variables y,** and makes the necessary corrections to the parameters **θ.**

In this article we will only talk about **Univariate Linear Regression,** and mention **Multivariate Linear Regression **at the end and see how the Gradient Descent algorithm changes when it is introduced to many variables. However, I will not go into detail

Note, that almost everything we say in one algorithm applies to the rest too. At the start of each algorithm, we will however have a small revision of the previous one.

**Univariate Linear Regression**

**Univariate Linear Regression **is a Linear Regression algorithm that has only **1** input variable, as suggested by its name.

That means that training set for U.L.R, would look like this:

The algorithm creates a function, that we call the **hypothesis,** and based on how close to **y** the estimated result is, it alters the function to match the training set.

**How do we do that?**

### The Cost Function

For our algorithm to see how close its estimated result is to the actual label, we need a function that tells us exactly that. That function is called the **Cost Function**.

There isn’t a single correct Cost Function. There are many ways to find the difference between the expected and the estimated value. However, in this article, we will use the **Mean Squared Error Function** as our Cost Function.

We denote the Cost Function as

\(J(θ)\)or more specifically, for U.L.R.

\(J(θ_0 , θ_1)\)I will not directly give you the function. Instead, we will try to “create” it ourselves, so that you can get some intuition.

So first of all, the main **objective **of our function should be to find how much we are off the labels. We should also try to make the value of the function as simple as possible so that we can later use that value and alter the **thetas **(θ). In other words, we want our result to be a single value for the whole hypothesis and not a specific value for each x. That also makes our algorithm way faster and more efficient.

The core of our function, as we have said, should calculate the difference between y and \(h_θ(x) \), however, there is a problem:

We cannot simply subtract one value from the other one, because of the **signs.**

** **If our hypothesis was say 12 and the actual result was 10, subtracting **y** from **h **would yield **2**. But if our hypothesis was 8, the result of the subtraction would be **-2**.

See the problem?

Both estimates, 8 and 12, are 2 units away from **y**, but with the cost function based on \(h_θ(x) -y\), we get two very different results, with a different sign.

There is a very simple solution to that. What we can do is to either square the difference or find its absolute value.

Both ways are good, but we will choose the first one, namely squaring the difference.

The reason behind it is that if we square it, it amplifies the difference, obviously, by order of 2. That makes the Cost Function **more sensitive,** therefore, and faster. There is however a catch. After the difference \(h_θ(x) – y\) becomes smaller than one, the **amplification,** is no longer helpful, since it makes a small difference a lot smaller.

In the real world and for the training set sizes that we will talk about, it doesn’t really matter which of these we choose. We can barely see a difference in the speed.

From the name of the **Cost Function, **you could find the rest of the formula based on logic and, well English, but nevertheless, I will continue explaining everything so that you build a strong intuition.

If you remember, we said that we want our Cost function to have only one dimension, one value, for the whole hypothesis. So what we can now do, is find the average error of our hypothesis, by adding all the differences \(h_θ(x)-y \) and then dividing by the number of training examples. Just like this, we can calculate the mean (squared) error of our hypothesis. Mathematically we write this as:

\(J(θ_0,θ_1) = \frac{1}{2m} \sum\limits_{i=1}^{m} (h_θ(x^{(i)}) – y^{(i)} )^2 \)Where:

**m:** Number of Training Examples, \(h_θ(x^{(i)}) \): Hypothesis of the \(i^{th} \) training example, \(y^{(i)}\) : Label of \(i^{th}\) training example

You probably noticed, that in our Cost Function, we also divide by 2. If you are familiar with Mutivariate Calculus, you probably understand why that 2 is there and probably, because of that 2, what we will do to change the parameters to fit the training set.

If you are not familiar with multivariable calculus, do not worry about that 2, I will soon explain its existence and why we put it there.

Before moving on here are some example so that you get a visual interpretation of what we just said.

It is pretty clear that the second example fits the examples better and is the most accurate model of these 3. That would mean that if we were to calculate the cost function of each of these lines, we would expect that the first has the largest, and the middle the smallest. If we were to actually Calculate we would find exactly that!

### Minimizing the Cost Function

We now want to find a way to minimize our cost Function J. It also needs to be fast and controllable.

In **Multivariable Calculus **and in **Vector Calculus **there is a specific operation, called taking the **Gradient **of a Function. We symbolize it with either \( grad(f)\) or \( \nabla f\). This operation tells you whether you should increase the value of the input variables to decrease them so that you can arrive at the local maximum slope in the output variable of the function.

For someone that isn’t familiar with Calculus this is a lot of information, that seems complex and too hard to understand and they might get discouraged by that, but the reason I am writing this article is so that people not in the field can understand the logic behind it, so I will explain what this algorithm does.

Before getting into the gradient, we should first take a step back and talk about **functions of 2 variables.**

These functions take as **input 2 variables **and **output** a **third, dependant, variable. **That means, that we can see the function, call it f, like a **map,** that correlates two specific inputs from one space to an output in a 3rd space. In other words, we can see the **3D **space as one **2D input space, **and one **1D output space. **

Essentially, what the **gradient** of a **multivariable function **does, is that it tells us, whether we should **increase** or **decrease **the values of our input variables so that we get closer to the nearest maximum value of the function. **Gradient Descent **is essentially that, but opposite, so we plug in a **minus.** The multivariable function is the **Cost Function **in our case, and we use G.D. to find its local minimum, rather than the minimum.

We have seen now mathematically what the Gradient Descent is, but now we need to see how we apply it to Linear Regression.

There is a problem with applying this operation in our algorithm. There is a chance that the Gradient Descent **overshoots,** and instead of going at the local minimum, it goes slightly above. That is why we introduce a number to our algorithm, called **the learning rate.** We usually denote it with **a**, and it is a controlled constant that **determines,** how **strong** the Gradient Descent is, or how big of steps it tells us to move each time.

We always select a small value for **a**, so that it doesn’t overshoot, but big enough so that it doesn’t take ages to find the nearest optimum.

*( I have used a cost function that depends on one value for demonstrative purposes, you can imagine how it would look like in 3D)*

We are now ready to compose our Gradient Descent Algorithm:

**initialize the** **θ, with random values **

**repeat until convergence{**

**\( h_θ(x^{(i)}) = θ_0 + θ_1 x^{(i)}\)**

**\(J(θ_0,θ_1) = \frac{1}{2m}\sum\limits_{i=1}^{m} (h_θ(x^{(i)})-y^{(i)})^2\)**

**\(θ_j := θ_j – a\frac{\partial}{\partial θ_j}J( θ_0, θ_1)\)**

**(for j = 0 and j = 1)}**

That is our Algorithm. Notice how we first define the Cost function and then **update ** the variables (:= is the **update operator**). We do that since it depends on both thetas, so if we were to first update a theta, then define the cost function, and then update the other theta, we would have an algorithm that doesn’t work as intended and updates the values with a wrong logic. Not to mention that this would take a lot longer to finish as opposed to the algorithm above!

###### This actually is also the algorithm and logic for Multivariate Linear Regression, too. The only difference being the hypothesis and the values j takes:

\( h_θ(x_1, x_2,..,x_n) = θ_0 + θ_1 x_1 + θ_2 x_1 +…+ θ_n x_n \)j takes values form 1 to n, and we define all the **x**, as **features.** We also define \(x_0 = 1\) for all training examples. In MVL we also perform something called **feature scaling,** where we scale all the features so that they are around the same values. We do that because t there is a **huge** difference between values and G.D. takes way too long.

*(MVL is not in the scope of this article)*

##### This Has Been Linear Regression

### Unsupervised Learning

Unsupervised Algorithms are completely different from Supervised. In U.L. we have no **label,** no **output variable to predict**.

U.L. Algorithms take as input all the examples we have and create clusters based on them. After the clusters are created, and the algorithm is sufficiently trained, it classifies any input based on its distance from all the clusters. For our purpose, we will examine K-Means, an algorithm that does exactly that: ** Creates Clusters **and Assigns** points **to them.

K-Kmeans is a very simple but beautiful algorithm, that everyone can intuitively understand without having to know any hardcore mathematics.

### K-Means

We can describe this algorithm in 4 steps:

**Plot**all the Points- Pick
**K random Points**(these random points are now called**cluster centroids**) **Assign**all the p**oints to centroids**based on the**distance**from them-
Set the

**coordinates**of the**cluster centroid**as the**mean of the coordinates**of all its points

Then** repeat **steps **3 and 4 **until **convergence**

The first step is called **Cluster Initialization****.** We pick the number of **clusters** that we want (we denote that with the variable **K**), and the algorithm picks **K **random points to be **centroids**.

After that, we move onto the next step called **Cluster Assignment. **Now, our algorithm iterates over all the points, and based on their distance to the clusters, it assigns them to the closest one, creating **n** **clusters.**

The next step is a **shift** in the **coordinates** of the cluster centroids. The Algorithm sets the coordinates of the centroids, equal to the **mean coordinates** of the points in said cluster.

Finally, K-Means repeats the process, until there is no change in assignments, in coordinates, until everything **converges.**

**The Mathematics behind K-Means**

Before concluding, I think it is only right that we take a look at the mathematics behind K-Means, and the symbols we usually.

The algorithm takes two inputs: **K: The number of Clusters,**and \(\{x^{(1)},x^{(2)},..,x^{(m)}\}\) the **training set, **where \(x^{(i)}\in \mathbb{R}\) and \(x^{(0)}=1\)

*(Remember that this is a generalized version, meaning there are n features in each training example)*

##### Symbols

\(c^{(i)}\) is the **cluster** \(x^{(i)}\) is **assigned** to.

\(μ_{(k)}\) are the **clusters/ cluster centroids**

(so if we say that \(μ_{(k)}\):= …., we mean that we update its coordinates to …..)

\(δ_{ij}\) The **Kronecker Delta**. This is a mathematical symbol that is used by many fields. It is equal to **1** if **i=j,** and 0 **otherwise**

##### The Process

Initialize K \(μ_k\)‘s

repeat until convergence{

for i in range(**m**){

\(c^{(i)} = \min\limits_{k} ||x^{(i)}-μ_k||\)

for j in range(**K**){

\(t_j =\sum\limits_{i}^{m} δ_{jc^{(i)}}\)

\( μ_j = \frac{1}{t_j}\sum\limits_{k = 1}^{m}x^{(k)}δ_{jc^{(i)}}\)

}}}

Thank you all so much for reading and hopefully, you learned something new! If you want to learn more about machine learning, I would suggest taking a look at some of the books I have listed below. Thank you for your time and make sure to come back for our future content.

**Stavros Klaoudatos**

Best Machine Learning Book: Hands-on Machine Learning with Python

Calculus 1: Single Variable Calculus: Early Transcendentals, Volume

Calculus 2: Single Variable Calculus: Early Transcendentals, Volume II

Multivariable Calculus: Multivariable Calculus, 7th edition

Statistics: Statistics, 11th Edition

My Book: Quantum Mechanics for Starters