Today’s Challenges
In the next three years, it is estimated that everyone who is online will produce an average of 1.7 Megabytes of data per second. There is an expectation of 44 trillion Gigabytes of data produced by that time in total. This explosion is due to our ability to capture data better than ever before. The quantity and speed of data is constantly increasing along with the new types of data now available for analysis such as unstructured text, images, video, audio and IoT data. If we want to make complex historical or predictive decisions on this type of data, we need the right infrastructure, methodologies and processes in place. Without these, many organizations will pass on the desire to exploit this information, because it just cannot be done within a reasonable time or with current resources.
Traditionally, statisticians would spend weeks to months refining a single predictive model, which would be implemented until it becomes stale, because the underlying data, assumptions, and various factors change. It provides the best lift a human could create through data wrangling, trial and error and comparing multiple, potential modeling techniques.
There are techniques that have evolved over time to speed up this process. They involve moving data into memory and parallel processing for faster training of models. The first technique is the concept of ‘Machine Learning’, which allows for the parallel training of multiple models with various statistical methodologies. This is beneficial for concurrent model training against traditional column and row oriented data. ‘Deep Learning’ utilizes Neural Networks and Decision Trees with a more human way of thinking to solve these problems. While these techniques have been around for many years, we always used them against column and row oriented data on disk with samples of data, because the equipment could not handle much more. The processes could run for hours, days, weeks, and months. However, the advent of using unstructured data as part of the data model for processing adds much more demand on system resources. Hence, Deep Learning is evolving, using this type of data, but at record processing speeds.
Machine Learning Solutions Explained
Machine learning began with pattern recognition and the idea that computers could perform tasks that were not explicitly programmed. New data causes the machine to adapt its response.
Machine learning results in decision-making and prediction. As the model becomes more refined, it becomes more accurate.
If machine learning is done correctly, the results are valid for new situations as well as the trained instances. As the input data expands, the program changes by utilizing the additional data.
Advantages of machine learning include the ability to identify features for inclusion in the algorithms. Parameters can range from a few to thousands and each can be assessed and optimized. Continuous quality improvement is therefore possible.
The obvious disadvantage is that machine learning can fail for some instances. Not every combination of parameters will produce a ‘correct’ result due to unforeseen dependencies. Feature values may be determined too subjectively or may have been intentionally omitted by the Data Scientist. Data with less clear-cut valuations may have been left out entirely resulting in a less valid model. Results are subject to human interpretation and may be incorrect.
Commonly developed machine learning applications include churn prediction, sentiment analysis, recommendation (music selection or google website recommendation), marketing new financial products, and micro segmentation retail offers.
Multi-core processing allows for faster results. The key factors are the capability of pattern recognition, anomalies, and prediction. This refinement is in response to external stimuli, without any direct programmer intervention.
Deep Learning Solutions Explained
Machine learning means that the computer learns without specific programming. In contrast, Deep Learning means the computer learns by simulating a neural net of decisions – emulating the way humans think. This can produce results that are more complex. Learning algorithms can automatically tune the weights and biases of a network of artificial neurons. The neuron analogy is to have input values that can produce a decision. In the most basic example, the inputs and the output are binary – the value of yes or no.
However, this approach lacks subtlety and a small change can have a large impact. If the input values are between 0 and 1, then the varying weights will affect the outcome but less significantly. As complexity develops, the non-input and non-output decisions are referred to as ‘hidden’. This allows for extremely complex variations on how data is utilized.
Output error is determined with forward propagation. Back propagation is used to calculate errors based on the outputs. The resulting gradient (function) includes back propagation components.
If the model is not successful, there are many ways to improve it. For example, add more input data, or rearrange existing data to create more instances, remove non-influencing feature variables, and try different algorithms or combine algorithms.
Since 2008, computing technology has become more affordable, allowing organizations to build scale out High Performance Computing environments. These environments make it possible to perform Deep Learning model training, but the architectures had to evolve to take model training down from weeks to hours. The advantages of Deep Learning also include the elimination of (human) feature engineering, which is very time consuming.
However, Deep Learning requires a huge amount of data and it is computationally expensive to train. The earliest models often lacked accuracy, so a period of refinement is required. This training time can cause significant delay reducing the effectiveness of the results. There is no specific recommended algorithm for deep learning applications. The final derived result can be complex (with hidden layers) and difficult to understand or explain. The mathematics involved are challenging as well, but the benefits can be great.
IBM’s Answer to Machine and Deep Learning
While the concepts of Machine Learning and Deep Learning have been around for many years, the computing architecture to support these methodologies has lagged. Many corporations have made decisions to invest in inexpensive x86 hardware because it can scale out with many, sometimes hundreds of clusters of servers. This provided many server farms with a great deal of cores, but at a cheap acquisition price. However, in Destiny Corporation’s research across hundreds of our clients, we found several developments in the IT industry. We will mention three for the purpose of this discussion. Software vendors love this architecture because they were able to implement core based pricing. More cores mean a higher price for the software license. The hidden costs include the increased data center power consumption and addition of IT people to support these massive server environments. More servers require more IT staff.
IBM looked at the challenge in a different way, with certain objectives in mind. First, they went after the Linux x86 market by offering Linux on Power. The Power chip offers much more capabilities in its footprint due to its design. They also opened up the Power design requirements to allow the Open community to build software and capabilities that take advantage of this chip set. They also built accelerators around the architecture to speed up processing and further scale in smaller server footprints than what are available in an x86 environment. While IBM can scale out with its Linux on Power offering as the LC class servers compete with the most robust x86 offerings yielding twice the performance or half the price of x86 infrastructure (guaranteed by IBM). This architecture allows for overall IT cost reduction. Google decided they needed the performance of the Power chip and are building their own systems with Power architecture to handle today’s workloads.
Having listed the background on this architecture, IBM took it several steps further. They saw the need for a step change in the High Performance Computing area that was needed to support Machine Learning and Deep Learning model training. NVidia has been creating GPUs to support the video gaming that has been so popular and so rich in our displays with fast calculations. IBM collaborated with NVidia to implement this technology in the Power design. IBM created a custom-built server with the Power 8 chip, CAPI and other design advances. CAPI stands for Coherent Accelerator Processor Interface. It connects a custom acceleration engine to the POWER 8 chip. This server class is code-named “Minsky” It uses NVidia’s NVLink interconnect bus (communication system). Late in 2017, IBM will release the Power 9 version of this architecture, which includes more design improvements and accelerators.
IBM saw the open source Deep Learning Frameworks evolving and decided to bundle them into this Minsky style architecture. They include Caffe, Torch, TensorFlow, Theano, Chainer and other goodies packaged and precompiled. IBM has also built onto these frameworks, making it simpler to work with Machine and Deep Learning. AI Vision is an end-to-end deep learning development environment for computer vision (image/video), which includes data labeling, preprocessing, training and inference supporting high productivity and accuracy for AI solution developers. DL Insight automatically tunes hyper-parameters (from a prior data distribution) for models from input data sets. IBM Spectrum Conductor launches multiple training processes with different hyper-parameters and analyzes the model effectiveness. The Power Inference Engine enables model logic to be processed by FPGAs. The Data Science Experience utilizes a local version of IBM Watson as a front end to run models and manage analytics. They term the project ‘PowerAI’. To see a brief demonstration of this in action for an Advanced Driver Assist System for use in automobiles, see the following link:
https://ibm.ent.box.com/s/m7ooeoi738rs7dq9l9v0i9iir79t4xmd
Example Use Case
The National Center for Supercomputing Applications has a Blue Waters supercomputer, with 32 processors on each of 22,400 servers. It is used to run a model of a reservoir — for a billion cells. The goal is to learn enough to extract oil from the ground efficiently. This platform gave Exxon results thousands of times faster than conventional reservoir simulations.
However, in six weeks a similar simulation was developed with extraordinarily smaller resources: thirty IBM servers that had 60 IBM processors and 120 NVidia graphics processors. It consumed one tenth of the power and about 25 square feet as opposed to tens of thousands of square feet for the x86 based supercomputer. It took less than two hours to run the model on this architecture.
Summary
In this discussion, we identified various trends and requirements that industries and organizations are leveraging to improve analytic capabilities. These could be applied to business scenarios such as the reduction in customer churn by being able to increase the lift in statistical models by just a few points, resulting in millions of dollars to the bottom line. Researchers can now scan X-Ray images as additional input data to training models that detect early stages of cancer in patients. The combination of these statistical methodologies on new styles of big data captured on disk or streaming, with computing architecture that can handle the workload is truly a game changer in our world. The application of this takes us into modern day’s version of AI (artificial intelligence) where we can develop applications that think better and faster than humans do as in the ‘self-driving car’ craze. There are new capabilities and advancements coming very fast with this disruptive technology that Destiny Corporation expect will not only affect businesses, but also the entire human race all within the next few years. Fasten your seatbelts.