The Exchanges Sector Need to Evolve Technologically
The Data Driven World
Exceeding Corporate Goals with Business Intelligence
The Emerging Focus on Data Governance in Higher Education
Thank you for Subscribing to CIO Applications Weekly Brief
By Ashish Bansal, Senior Director, Enterprise Merchant Insights Lead, Capital One
What is Machine Learning
Tom Mitchell in his book Machine Learning provides the following definition:
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
In other words, a program that improves its execution by performing more tasks can be considered in the category of machine learning. Let’s consider a program that looks at credit card transactions and detects fraud. This could be built as a set of rules that depend on transaction amount, location of the transaction, time of day etc. Some combination of these will be classified as fraud. These rules were coded into software by humans analyzing previous patterns. This program classifies transactions into fraud or not fraud. However, more transactions it processes does not improve the accuracy of this algorithm. To improve, offline analysis needs to be performed and new rules need to be coded, or existing rules modified before the algorithm becomes better.
In the machine learning regime, every transaction correctly or incorrectly identified as fraud results in a small improvement in the algorithm. Consequently, after processing lots and lots of transactions, the algorithm becomes very good at detecting fraud. It also adapts to techniques being used to conduct fraud. This methodology may be appropriate in many classes of problems such as translating between languages, detecting objects in images, diagnosing diseases from x-rays, validating profiles of people or estimating housing prices. It is always important to understand, given a problem, if experience at doing the task should result in improvement in performance. Not every problem fits this description
The corollary of above is that not every problem is a machine learning problem. Many problems can be solved using rule based systems. For example, validating form fields and correctness of data being filled in could be simply coded using rules.
Machine thinking requires building a traditional software engineering solution to the problem using rules or other techniques, irrespective of whether a machine learning solution exists. At the very least, this provides a baseline that the machine learning solution should beat. Further, iterate on the machine learning solution by using a simple algorithm (like linear or logistic regression) before going for the more complicated methods like Random Forests or Neural Networks/Deep Learning. Key to remember is that a complex machine learning algorithm should provide incremental benefits to accuracy at accomplishing the task.
A lot of progress in recent times has been attributed to Deep Learning. State-of-the-art today depends on having large amounts of data available for deep learning to work effectively. Not all problems and organizations have such large data sets available for use. Current state of data science is that 60-70 percent of the time is spent on wrangling data, and 30-40 percent of the time on modelling and tuning. So it is critical to understand the data sets prior to building models. Data usually suffers from data quality issues, incomplete sets, imbalance of labels or skewed distribution of data attributed. Rectifying this would require cleansing data, imputing missing values, and normalizing the skews. This is a critical step in the process, and important to manage in the process of getting value from machine learning.
Neural network and deep learning are dominating the airwaves at the moment
Deep Learning and Deep Thinking
Neural network and deep learning are dominating the airwaves at the moment. One might feel that by not using these techniques they are missing out. Let’s put this perspective through Kaggle 2017 State of Data Science and Machine Learning survey:
This chart shows that top three methods being used are not deep learning methods. State of the art deep learning methods require a lot of data to train today. Goodfellow et al in their Deep Learning book propose:
As of 2016, a rough rule of thumb is that a supervised deep learning algorithm will generally achieve acceptable performance with around 5,000 labeled examples per category and will match or exceed human performance when trained with a dataset containing at least 10 million labeled examples.
The key to note here is per category. Ability to convert a task into a supervised learning problem, and making sure there are enough examples to train a deep learning algorithm can be a very challenging task.
Key to being successful in applying latest algorithms is to cast problems as supervised problems. Supervised learning is a class of machine learning problems where the desired output is known with the input data. For example, the input could be an image and desired output is the label ‘cat’. Both of these would need to be fed into a deep learning network, in fact many thousands and more likely thousands of them, for the algorithm to learn and work effectively. Another way to phrase this problem is to consider pairs of the form (A -> B). B represents one of the many types of labels the machine is learning to discriminate. In case of the cat detector, these could be the two possible values (cat, not cat). During training, multiple (A,B) pairs are provided. Once the algorithm is trained, input of the form (A, ?) is passed in and the machine guesses the correct label.
To be successful, start with casting your problem into a supervised learning problem. This is often the hard part of Machine Thinking. Once you are able to do that, there is very little to stop you from building amazing machine learning solutions and products.