In an earlier blog, we talked about how machine learning is used in social media analytics. In this post, we’re going to review machine learning (ML) basics and examples, and explore some of the areas you might be unaware of where ML is having a significant impact.
Machine Learning — The Basics
“Evolved from the study of pattern recognition and computational learning theory in artificial intelligence, machine learning explores the study and construction of algorithms that can learn from and make predictions on data.” (Wikipedia)
The goal of ML is pretty simple — teach computer systems to perform a task. The computer system gains experience by observing patterns from examples rather than being programmed with explicit instructions or rules.
There are two types of machine learning: supervised and unsupervised. In supervised ML, you give the computer a set of examples with corresponding labels or answers. This is the training data. From these, it learns to predict new answers based on the patterns.
In unsupervised ML, you provide the data without answers or labels. The algorithm directs the computer to look for interesting structure or patterns. Unsupervised ML is very useful when you’re not sure what you’re after; it’s a discovery tool. The graphic below illustrates the distinction.
Machine learning is math, and a lot of what gets the credit for the performance of ML are its algorithms. But a machine learning model is only as good as the data used to train it. Adding more training data is almost always better than tweaking the algorithms.
What follows are three examples of machine learning in everyday life.
It’s possible to “beat Shazam.” But it’s not easy. The first team to do it spent three months cramming. Shazam never breaks a sweat.
There’s an entire blog devoted to the backend of Shazam. It’s a deep dive into the intricacies and complexities of machine learning with posts like “Optimizing the Shazam backend structure via Genetic Algorithms.”
The genetic algorithm is a method for solving constrained and unconstrained optimization problems that is based on natural selection, the process that drives biological evolution. The genetic algorithm repeatedly modifies a population of individual solutions.
Shazam also uses ML to analyze the social media data it gathers to track user sentiment and assess product performance — such as when the app crashes.
What Do You Want to Watch Tonight?
If you watch Netflix, machine learning is serving up recommendations every time you log in. When Netflix launched, it had subscribers fill out surveys about their movie likes and dislikes. It turns out though that a lot of us lie. We say we like foreign films and documentaries. But our viewing behavior tells a different story.
Once Netflix launched its streaming service, it had a ton of data it could use to train machine learning algorithms for its recommendation engine. Netflix knows what you watched — or abandoned — what you searched for, how you rated it, plus the time and date you watched, and the device used.
At one time, Netflix employed as many as 800 engineers in the group responsible for creating and tuning the algorithms that now drive its recommendation engine.
Math Versus Malware
Almost everyone has some sort of malware and virus protection on their computer. If not, they usually regret it.
The Symantec Security Report from 2015 noted that there were 431 million unique forms of malware. Up from 300 million the year before, which was up another 100 million from the year before that. That’s about one billion pieces of malware in three years. What about now? Safe to say, it’s more.
Traditional AV/malware protection is signature based. A signature needs to be written for each piece of malware, so a signature scanning engine can compare a threat to the signature database. Imagine a DAT database of one billion signatures. Scanning that would exact an enormous network performance hit to an organization. That’s one reason the major AV vendors don’t have DAT files that large. They’re more in the range of double-digit millions. But that leaves a lot of malware unaccounted for.
Get social insights delivered to your inbox.
A company called Cylance takes a different approach. Instead of virus signatures, it uses a machine learning algorithm to identify potential threats. Cylance trains its algorithm with 250 million good files and 250 million bad ones. The algorithm learns to nab threats based on features.
Cylance prevented the Microsoft Word RTF (CVE-2014-1761) zero-day malware threat from executing before it was ever observed in the wild. Its software discovered and quarantined this threat in March 2014, even though it did not appear on malwr.com until April, and even then, was detected by only 4 of 51 antivirus engines.
From Esoteric Research to Commercial Application
A lot of the machine learning knowledge that powers Beat Shazam, Netflix, and Cylance started in academia and private research institutions. Eventually,the research trickles down to commercial applications.
It used to be that corporate R&D focused almost exclusively on making a company’s products or methodology better. But now more companies are investing in original research.
Companies are talking about and publishing their research. And there are many open-source tools — tools that companies like Google spent many person-hours building — that can be downloaded and used for free.
It’s an exciting time in technology research. It harkens back to the days of the original Bell Labs, whose researchers are credited with many scientific breakthroughs, including the development of radio astronomy, the transistor, the laser, and the UNIX operating system. Eight Nobel Prizes have been awarded for work completed at Bell Laboratories.
See how machine learning can help your business. Schedule a personalized demonstration today.