mercredi 5 septembre 2018

How does a machine learn?


How does a machine learn?
Preamble : This article is a article aimed to explain “Machine Learning” basics in a intuitive fashion for developers.

Introduction

Today there is a huge hype about “Artificial Intelligence”, more especially about “machine learning”. This programming paradigm is very different from the classic old one. The origin of programming is based on the algorithm principle which is a formal description sequence of unitary operations to be made in order to perform the desired task. On the other side “Machine Learning” (ML) is about using examples of solved task to train a “model”, like a teacher showing examples to his students. This is considéred as a more powerful programming paradigm since a lot of tasks cannot be described as a formal sequence of unitary operations. Think about recognising a familiar face, identifying the voice of someone, extracting the sentiment from a text, detecting trolls on a forum, etc… All these tasks can be described with examples but not with a formal sequence of unitary operations.
This covers why machine learning is so interesting. However it does not explain how machines learn...  
Spoiler!: The trick to machine learning is to transform a learning problem into an optimization problem (which is something engineers easily deal with). This is explained in the rest of this article while explaining what is a “example”, a “model”, a “training”, and why does all this sometimes fail.

Ok, this introduction already yields some questions.

What do you mean by  “example“?

It’s a pair of (input,output) information, the input side is the information your have about the task, for instance a person’s picture for an identification task. The output is the solution of that given task (ie. the name of the person on the picture). These informations are transformed into numbers in order to be fed into a “model”. Typically a picture is a set of pixel which are numerical values. On the other hand the expected output in this example is a name, which is not numerical. Thus, it must be transformed into a number, in this case index of a list of all names. This list is necessary in order to be able to show a human readable answer at the end of the process.
Generally, when training a model, a training database is used: a large number of (input, output) pairs, enough to obtain a viable model for the targeted task.
Since input & output may have more than one dimension we often refer them as input & output vector. But for the sake of simplicity let’s consider them as one dimensional values (scalars), which allow an interesting graphical representation: Laying the input on the x axis, and the output on the y axis, shows the training database as a point cloud.

So what is exactly a “model”?

Considering the above figure, the “model” is a curve that fit to the point could. Specifically it’s a “class of curves”, since there is not a unique solution. Consider the model as a type of curve that has the ability to fit the point cloud with given properties.
One can see the machine learning paradigm as a interpolation problem : with a new input value (X) what would be a credible (Y) value ?
A model can be seen as an “empty brain”. Since input & output of the task are seen as numbers, one can see the model as a mathematical formula. This curve is in fact a mathematical function, a parametric function. The parameter are values that can be estimated in order to fit the point cloud, but they are not set for the moment (we will see that later).

modelparameters(input) = output


This mathematical function is where the magic happens. It transforms the input number into a numerical output (the result) that will, in a later stage, be transformed back into a usable information (for instance an index, that will be transformed back into a name from a list). At this stage the model is still “empty” so the answer is random (or undefined). A model needs to be “trained” in order to provide meaningful answers.
There is a lot of different models ((deep) neural networks, linear regression, etc…). All are mathematical functions with different properties more or less suited to different kind of tasks.

What does it mean to “to train a model with examples” ?

Training a model means finding parameters of the underlying formula in order to produce correct answers for given inputs. The output of a model is numerical and known (remember: the example task is already solved outside of the model), this means we can express the error a model makes as a numerical value. We are then facing a “classic” equation resolution problem:
modelparameters(input) = output
At the beginning, model parameters are unknown and usually initialised at random, hence producing an irrelevant output at this stage.  We can measure the error of the model as:
Error = modelparameters(input) - expected output
In practise, we use more complex error formula, but the main idea remains. When the error is minimized, the learning is done.
The error is basically the distance between the point cloud and the curve. The learning process is a step by step optimisation of the parameter of the model, in order to fit the point cloud. 




The same process can be viewed as this block diagram :


The learning problem becomes an optimization problem!
Optimization is an old mathematical field, well known by engineers. Given an input/output dataset and a model (with unknown parameters), the optimization process outputs model parameter values.
Common optimization strategies are: stochastic gradient descent, Monte Carlo Markov Chain, BFGS, …
This process can be quite long and heavy depending on the model complexity and the learning database size (with far more than 3 learning steps).
When a model, its parameters and input are known, one can easily compute the output from new and previously unseen input data : it’s just evaluating a function.
This phase is very fast (regarding to the learning phase).


And sometimes it fails !
One important thing the point cloud analogy help to understand, is that there is more than one solution, because more than one curve can fit a point set. The less training point one have, the more different solutions are possible. That’s why ML need a lot of data to produce accurate results. As a consequence GAFAM, that hold most of the internet data, have a great advantage. 
And sometimes, the learning procedure can still fail, even when the measured error is very small. This graph shows how a model can fit data with a very low error, but can still be totally wrong when tested on new data.


What is the main difference with classical programming?
  • ML gives the ability to a programmer to train a model on a task that he is unable to do by himself (assuming he can obtain a training dataset for that task)!
A classical developer can only make program that automates things he could do by himself. It’s even more powerful that human learning: a teacher can’t teach you something he doesn’t know! A ML practitioner trains a computer to do something he can’t do by himself. This is a real shift in human potential.
For instance I built DNA analysis softwares while lacking any biology degree.
  • Algorithm used in classical programming are often mathematically proved. This means that they are more reliable. Most of the times the bugs comes from error in the implementation (ie. the translation of the algorithm principles into a programming language).
ML programs cannot be proved to be 100% accurate, but they tackle problems that cannot be solved at all with classical programming. One still has to handle the fact that there will be errors. Good ML practitioners are able to measure the validity of their models, and know how to handle risks and consequences of errors.
(One third of introductory course on ML are about evaluating a model.)
  • The model and the optimisation phase still use classical programs, hence why ML practitioners are mainly computer scientists... with a slightly different profile. They are more math friendly people than typical developers. These kind of profile is quite rare, regarding to the high number of developers. This means that they are highly demanded. ML is not new, but it gained popularity quite recently, with the shift of all sort of business activities (e-commerce, advertisement, ...) toward the web where data collection is cheap & easy.