A branch of artificial intelligence (AI) based on the notion that machines (software applications) can learn from examples and can teach themselves how to solve specific problems without being programmed manually. Recent successes in artificial intelligence have resulted from exponential growth in computational power as well as data generation, allowing machine learning (ML) to spread to other sectors beyond computing sciences. Some of the breakthroughs in data-driven AI are already present in our day-to-day lives, in the form of spam filters, automated fraud detection in financial transactions or insurance claims, conversational agents, digital personal assistants, visual search and photo tagging, speech recognition, and recommendation systems. Other fields in which machine learning is key to decision making are medicine, astronomy, biology, chemistry, genetics, finance, politics, and industrial robotics. In the future, machines will become smarter and will continue to significantly transform our lives (see illustration). In fact, when presented with sufficient data, software applications can even learn novel things that no programmer or domain expert could teach them explicitly. See also: Artificial intelligence; Computer programming; Software
How do machines learn?
Machines are able to learn how to solve specific problems through algorithms applied to data. Some of the ideas have been around for a long time, and many of the techniques used in machine learning are based on concepts invented centuries ago. For example, Thomas Bayes’s theorem, published in 1763, is the foundation of the Naïve Bayes classification algorithm, a very powerful and scalable machine-learning technique. However, it is the newly developed computer technologies that enable learning algorithms to be applied to large volumes of data and thus achieve adequate performance. As a relatively recent subfield of computer science, ML took off around the 1950s. With computational power on the rise, it became easier than ever to process larger and larger amounts of information to transform data into knowledge, and the trend is maintained to this day. A major innovation, for example, was the use of graphic cards as specialized processors, which led to the implementation of machine-learning algorithms on GPUs (graphics processing units), which has sped up the training process hundreds of times. See also: Algorithm
In addition to computational resources, ML algorithms require data to learn from, and they can learn quite well if presented with considerable amounts of it. In the current “big data” era, rapidly creating and collecting large amounts of data is not difficult. However, the data are usually unstructured, and the challenge is to transform them into structured data that ML algorithms can be trained on. Creating the right dataset, such that learning algorithms can find a good signal in it, is arguably the most difficult task ML practitioners face. The task is called “feature engineering,” and it represents the process of using domain knowledge to transform raw data into features that are important to the learning problem and indicative of the pattern to discover. The examples imputed to the algorithm are usually represented as vectors of discrete or continuous features. The examples are also labeled, which means they are already assigned to various discrete categories (or classes) by domain experts. The process of labeling data is time-consuming and often requires human experts. In classification, which is one of the most established and popular uses of ML, the learning algorithm produces a model, called a classifier, which can essentially map examples to classes. See also: Big data
For example, in a spam-filtering application, email messages are labeled as either “spam” or “non-spam” from the training dataset. Each labeled email constitutes a training document, and the documents are usually preprocessed to a specific format required by the machine-learning algorithm. This transformation entails several natural-language processing (NLP) steps, such as (1) tokenization (extraction of the words in the email body), (2) lemmatization (reduction of words to their root forms), (3) stop-word removal (elimination of very frequent words that are irrelevant to the classification), and (4) representation (conversion of the set of words). After applying these NLP steps to a training corpus of labeled emails, the resulting dictionary (vocabulary), representing the dimensions of the vector space, is used to transform all messages in the training corpus into feature vectors, in which each vector dimension corresponds to a separate feature or term. In the example below, the term frequency scheme is used, and each dimension corresponds to the number of times a particular term appears in the document, or zero if a term does not occur in the document. See also: Natural language processing
- Vocabulary: [bag, sunglass, expensive, low, price, . . .]
- Email: “Save up to 90% on designer sunglasses now! Select bags and sunglasses at very low prices. . . .”
- Feature vector representation: [1, 2, 0, 1, 1, . . .]
A learning or prediction algorithm is then trained on the preprocessed labeled dataset, with the objective of creating a model that can be used to predict, or classify, new incoming messages as either “spam” or “non-spam.” This is called “generalization,” the ultimate goal of any ML algorithm. The learned classifier must subsequently be able to generalize beyond the examples it has encountered during training. Typically, the more and better data the algorithm sees, the more suitable it becomes at generalizing new unseen samples. However, data alone is not enough, even if there is a lot of it available. The learning algorithm must also make some assumptions, or follow some constraints, to be able to learn a useful function (or mapping from input to desired output) that allows generalization beyond the training data. For example, the naïveté of the Naïve Bayes classifier comes from the assumption that features are independent of each other, given the class. In other words, the presence of a particular feature in a class is unrelated to the presence of any other feature, which does not hold in many real-world applications. However, the algorithm is particularly suited when the dimensionality of the input space is high. For text classification, despite the obviously incorrect assumption that words in a document are independent of each other, given the class of the document, the Naïve Bayes classifier has demonstrated outstanding results on numerous occasions, and it still represents a reliable baseline for text classification.
Other algorithms have different assumptions, for example, that the data points (examples) are linearly separable, such that if they were to be plotted in a high-dimensional space, there would be a hyperplane that acts as a boundary between the clouds of points belonging to different classes. Once the hyperplane is learned (discovered from the training data by means of function optimization), new examples coming in will be plotted in the same space, and the side of the hyperplane on which they fall will determine their class. Another common assumption is that similar examples have the same class, meaning that if they are in proximity to of each other in the input space, they are likely to belong to the same category. Creating a powerful, reliable prediction model is not as straightforward as it sometimes may seem. There are many intricacies to learning algorithms, and it can take numerous iterations before a good result is achieved. Probably the hardest open problem in machine learning remains that of “overfitting.” This happens when the algorithm produces a model that fits the training data too well, possibly modeling the peculiarities in the sample training data rather than the real signal, such that it fails to predict future examples reliably. Nonetheless, with good practices and powerful algorithms, spectacular advances have been recorded.
At the forefront of ML research is deep learning, a set of powerful approaches modeled after the human brain and thinking process. In deep learning, information is passed through a network, or layers of abstraction, starting with the input layer, continuing through hidden layers (which typically apply linear transformations to the incoming data), and ending with the output layer. Deep learning also has the ability to learn good data representations and automatically construct features. Recent breakthroughs in deep learning include beating a human Go champion, discovering an eight-planet solar system, and even winning against humans at poker, which, in addition to knowing the right moves, requires bluffing, a technique used to deceive opponents.
It is widely believed that ML agents will become so complex and powerful that they will eventually surpass more domain experts at performing specific tasks. A self-driving car will be more reliable than any human driver, as it will never take its eyes off the road or have moments of distraction. An unbiased agent that could detect disease would offer a great boost to clinicians’ accuracy. Bioinformatics and biomedical problems can be quite difficult to solve because there simply does not exist sufficient knowledge about the phenomenon of interest.
Machine learning is an increasingly popular tool used to investigate the underlying mechanisms and interactions between biological processes and diseases, and it is now an essential step in any biomarker discovery process. A ML system can theoretically gain complete understanding of an area or problem, such as a human disease and all its symptoms. That is because it will be able to learn in real time from every bit of data that becomes available. It will continuously update and improve its knowledge and stay current, thus ultimately exceeding any medical professional’s ability to make a diagnosis with regard to that disease. However, there is a trade-off between model accuracy and model interpretability. Although accurate results are desirable in aiding medical personnel with decision making, the disadvantage of using advanced ML is that some of the most powerful algorithms are so complex that they are still viewed as intractable “black boxes.” This represents a barrier for interpreting their results, which is crucial in biomedical fields.