Introduction to Machine Learning – 1
Machine learning has become one of the most popular subjects nowadays and it seems to remain popular for a long time. This term, once, is only in the studies of academic people however, today, it has a practical usage in most commercial and social applications. For example; Netflix is proposing you new movies and series that you would live to watch, Google has an excellent speech recognition API and Facebook asks to tag/label your friends faces in the photos. So, what is machine learning?
Machine learning is a task to improve the computer’s learning abilities without pre-programming. This means, not to tell to a computer each step of what to do, but enabling it by a model to make logical decisions against unseen situations. Instead of static program instructions, a learning model is developed to make a prediction and decision against a new situation.
Machine learning is emerged as a field of science which also includes pattern recognition (indeed, pattern recognition should be a subject of a new post). When I was conducting a postdoc research in USA in 2016, my advisor once told me that “now they are calling machine learning what they are calling to pattern recognition a few years ago, it’s just rebranding“. Thus; it is difficult to separate machine learning, pattern recognition, data mining and computational statistics. In most cases, they are overlapping and intersecting.
Machine learning is being used in many areas where static algorithms are not feasible such as spam filtering, detecting network intrusions, optical character recognition, biometrics (fingerprint, face, retina recognition), search engines, computer vision and much more…
Machine learning tasks can be divided into three categories according to their feedback mechanisms:
- Supervised learning: Input data and the corresponding desired output data about the problem is given to the computer by an expert. The aim is to obtain a relationship between the input and the output. Thus, we will have a generalized model for the problem. Curve fitting is the basic method for this purpose.
- Unsupervised learning: No labels (desired outputs) are given to the algorithm and structuring the input data and finding the hidden patterns by its own is expected from the algorithm.
- Reinforcement learning: Algorithm is interacting with the dynamic environment and gets prizes and punishments (as a feedback) while navigating in the problem space.
In addition to this, in semi-supervised methods, some of the output data is given missing. Only a small portion of the input data is labeled, the majority remains unlabeled (no corresponding output values).
Machine learning tasks can be categorized as follows according to the desired outputs:
- Classification: Inputs are divided into two or more classes. Algorithm should assign the unseen inputs into the proper class accurately. This procedure is accomplished by supervised learning methods. The problem of groupping e-mails into “spam” or “not spam” is a good example of classification.
- Clustering: Grouping the input data without a supervision. Unlike classification, groups are not pre-defined in this case.
- Regression: A kind of supervised learning in which the outputs are continuous (not discrete). For example; predicting the tomorrow’s weather by using the previous meteorological data.
- Dimension reduction: Simplifying the inputs by mapping to a lower-dimension space. You can think this procedure as the compression of data. For example, we are trying to develop a model to diagnose a disease from the blood parameters of individuals. Assume that, we’ve obtained 100 different blood parameters from ten thousands of people. Thus, we have an 100-dimensional input data. Instead of processing this huge data, using the reduced form which may probably include the %90 of input features within only 3 dimensions, we may reduce the dimension of the problem from 100 to 3 with waiving only a small amount of the features.
The basic goal of a machine learning task is generalization from experience. This means that working accurately on unseen and new samples by using the experience obtained from the old samples. One of the biggest problem of machine learning is that the missing input data (not covering the entire input space) or the inaccurate (faulty measured or labeled) input data. Assume that, one of the parameters of your input data ranges 0 to 100. What happens when your model come across to a value of -100 or 1000 for this particular parameter?
Another important problem of machine learning is measuring how much the model learned or, therefore, when we can say that “ok, finally it learned well”. If your model is less complex than the real problem, modelling cannot succeed. No matter how you deal with it, training error cannot converge to zero. This is called an underfit model of data. On the contrary, if your model’s complexity is higher than the real problem, thus you obtain an overfit model. In this case, the generalization ability is not satisfying. The ideal one is; the complexity of the model and the real problem is equal or similar however maintaining this harmony is very difficult in practical usage. Therefore, using a relatively higher complex model with some extra measures to prevent overfitting is recommended. For example, instead of dividing the input data into two groups as train data and test data (train data is used for learning and test data is used for measuring the generalization ability of the model), dividing the input data into three groups as train data, validation data and test data may be wiser. While learning with train data, the validation data can be checked independently to stop the training procedure whether training error not converged to zero if an overfitting tendency is detected.
In fact, I was planning to continue this post by giving brief details of important machine learning techniques however I realized that not only the subject is so wide but also the post is lengthening unnecessarily. Therefore, I am finishing this post here and converting it to a series of posts and will give the details in the next episodes.
You can find the Wikipedia entry in the references which I used to write this text. In addition to this, Ethem Aydın‘s book titled Introduction to Machine Learning comes first to be read. I also recommend you to join free online Machine Learning class of Andrew Ng (Professor of Stanford University, Chief-scientist of Baidu and co-founder of Coursera).
See you in the next post…
- Alpaydın, Ethem. Introduction to Machine Learning, London: The MIT Press.
- Ng, Andrew. Machine Learning, Coursera.