JK–Actually, technology has always developed through a virtuous cycle of knowledge, process, tool, and ecosystem (aka toolchain) mutually evolving to realize human outcomes. ORIGINAL VERBIAGEJK–In the analytic pipeline world, the knowledge (statistical modeling, data engineering, programming, domain knowledge, etc.
), process (collaborative workflows/governance among specialized roles), tools (regression modeling, artificial neural networks, advanced visualization, etc.) and ecosystem (big data, cloud computing, open-source software etc.) have been co-evolving from the dawn of computing. ORIGINAL VERBIAGEJK–The evolutionary leap in the past 10 years in analytics/data science/ML has come from the operationalization of these technologies in business, industry, government, consumer, and other infrastructures, which has upped the stakes, spurred accelerating innovations, drawn many more people into this field, caused workflows to become far more distributed and virtual, and triggered rapid maturation of repeatable, industrialized DevOps pipelines for preparing, developing, and operationalizing these assets 24×7. ORIGINAL VERBIAGE JK–DevOps pipeline maturation refers to standardizing, automating, and accelerating, and all three trends apply in full force to ML pipelines. ORIGINAL VERBIAGEJK–ML pipelines–as a subset of analytic and data science pipelines in general–refer to the entire set of phases, functions, and tasks needed to prepare, develop, and operationalize ML assets in production applications.
In terms of “standardizing,” there is general agreement among practitioners and industry that the ML pipeline consists of the following ORIGINAL VERBIAGE IN THE INTRO, WITH COPY/PASTED TABLE FROM RECENT WIKIBON PUBLISHED REPORT:· PHASE PROCESS DISCUSSION Preparation Data discovery discover, acquire, and ingest the data needed to build and train machine learning models Data exploration building visualizations of relationships of interest within the source data Data preprocessing building training data set through encoding of categorical variables, imputing missing variables, and executing other necessary programmatic data transformations, corrections, augmentations, and annotations; ensure that the data has been cleansed and otherwise prepared to speed the downstream modeling and training processes. Modeling Feature engineering building, exploring, and selecting feature representations that describe predictive variables to be included in the resulting machine learning models Algorithm selection identify the statistical algorithms best suited to the learning challenge, such as making predictions, inferring abstractions, and recognizing objects in the data Model training process model against a training-data test or validation set to determine whether it performs a machine-learning learning task (e.g., predicting some future event, classifying some entity, or detecting some anomalous incident) with sufficient accuracy Model evaluation generate learning curves, partial dependence plots, and other metrics that illustrate comparative performance in accuracy, efficiency, and other trade-offs among key machine learning metrics Hyperparameter tuning identify the optimal number of hidden layers, learning rate (adjustments made to backpropagated weights at each iteration); regularization (adjustments that help models avoid overfitting), and other hyperparameters necessary for top model performance Operationalization Model deployment For models that have been promoted to production status, generate customized REST APIs and Docker images around ML models during the promotion and deployment stages; and deploy the models for execution into private, public or hybrid multi-cloud platforms Model resource provisioning For models that have been deployed, scale up or down the provisioning of CPU, memory, storage, and other resources, based on changing application requirements, resource availabilities, and priorities. Model governance For deployed models, keep track of which model version is currently deployed; ensuring that a sufficiently predictive model in always in live production status; and retraining using fresh data prior to redeployment. JK–As regards “automating,” as discussed in my most recent Wikibon research notes (https://wikibon.
com/automated-machine-learning-assessing-available-solutions/ https://wikibon.com/automated-machine-learning-accelerating-development-deployment-statistical-models/), every one of these ML pipeline phases is increasingly being automated, and the range of tools for doing so continues to grow, evolve in sophistication, and be adopted in production ML environments. For example, as I said in a recent SiliconANGLE article regarding deep learning (DL) pipelines ORIGINAL VERBIAGE IN THE INTRO, WITH COPY/PASTED TABLE FROM RECENT SILICONANGLE PUBLISHED NEWS STORY :As deep learning frameworks converge, automation possibilities unfoldhttps://siliconangle.
com/blog/2017/10/25/deep-learning-frameworks-converge-automation-possibilities-unfold/considering how rapidly automation is coming to every aspect of the DL and machine learning development cycle. Across the industry, we’re seeing mainstream DL development tools starting to incorporate these features:* Automatically generate customized REST APIs and Docker images around DL models during the promotion and deployment stages;* Automatically deploy DL models for execution into private, public or hybrid multi-cloud platforms;* Automatically scale DL models’ runtime resource consumption up or down based on changing application requirements;* Automatically retrain DL models using fresh data prior to redeploying them;* Automatically keep track of which DL model version is currently deployed; and* Automatically ensure that a sufficiently predictive DL model in always in live production status.JK–The “acceleration” side of it comes both from the standardization and automation, and from how the data science/ML pipeline has been reorganized in many organizations for maximum productivity. As I said here ORIGINAL VERBIAGE IN THE INTRO, WITH COPY/PASTED TABLE FROM RECENT WIKIBON PUBLISHED REPORT:Wrapping DevOps Around the Data Science Pipelinehttps://wikibon.com/wrapping-devops-around-data-science-pipeline/Increasingly, data science professionals are automating more of their tasks within continuous DevOps workflows.how industrialized practices may be implemented in the data-science pipeline.PRACTICE DISCUSSION Role specialization The key role specializations in an industrialized data-science pipeline consist of statistical modelers, data engineers, application developers, business analytics, and subject-domain specialists.
Within collaborative environments, data science team members pool their skills and specialties in the exploration, development, deployment, testing, and management of convolutional neural networks, machine learning models, programming code, and other pipeline artifacts. Workflow repeatability The principal workflow patterns in the data-science pipeline are those that govern the creation and deployment of deep-learning models, statistical algorithms, and other repeatable data/analytics artifacts. The primary workflow patterns fall into such categories as data discovery, acquisition, ingestion, aggregation, transformation, cleansing, prototyping, exploration, modeling, governance, logging, auditing, archiving, and so on.
In a typical enterprise data-science development practice, some of these patterns may be largely automated, while others may fall toward the manual, collaborative, and agile end of the spectrum. In the context of the data-science pipeline, it requires moving away from data science’s traditional focus on one-time, ad-hoc, largely manual development efforts that are divorced from subsequent operational monitoring and updating and for which downstream changes require significant and disruptive model rebuilds. Tool-driven acceleration The most important platform for tool-driven acceleration in data-science pipelines is the cloud.
The chief cloud-based tools include Spark runtime engines for distributed, in-memory training of machine learning and other statistical algorithms; unified workbenches for fast, flexible sharing and collaboration within data science teams; and on-demand, self-service collaboration environments that provide each DevOps pipeline participant with tools and interfaces geared to their specific tasks. Industrialized Data-Science Pipeline Practices JK–There has long been a general industry framework for “data mining” pipelines, called CRISP-DM. Here is how that is broken out, and it can be mapped very snuggly to the expanded ML pipeline that I spelled out above THE FOLLOWING VERBIAGE AND GRAPHIC ARE ENTIRELY LIFTED FROM THE INDICATED WIKIPEDIA PAGE:bhttps://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_miningBusiness understanding: This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used.Data understanding: The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.
Data preparation: The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.Modeling: In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type.
Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.Evaluation: At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives.
A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached. Deployment: Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g. segment allocation) or data mining process.
In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models. RF: The key question for enterprises is finding and bearing down on problems where there is high, identifiable value for solving them. We’re coming out of a period of exploration by da