posted on Jun 5, 2017

Agile Data Science (Part 2): PMML

The following is a guest post by Verdi March.

In part one, we described various strategies to manage the transition between model development (i.e. training) and production (i.e. operation). One of the strategies is to adopt a portable and open format to describe models. The strength of this strategy is to fully decouple training and operation phases, as a recognition that both phases potentially have remarkedly divergent technology stacks, processes, etc.

In this post, we zoom in the said strategy by demonstrating a minified implementation based on PMML which is an open and portable model format.

Overview of PMML

To put it simply, PMML is an XML format to describe statistical and machine learning models. The ecosystem is fairly mature both commercially and open source. Proponents of open source would be especially pleased to find PMML is supported in Apache Spark, R, RapidMiner, etc.

By adopting PMML, the whole data science workflow would look like the following diagram.

The key steps to highlight are produce and consume, which are what the rest of this post is about.

As our example, let us consider k-means clustering to classify Iris flowers. We have chosen this simple case on purpose, so that we can fully focus on the PMML aspect without getting distracted by the technicalities of model training or general software engineering.

We picked R as our development tool, then hand-crafted the PMML file that describes the trained k-means model. In practice, data science organizations should consider automating the production of PMML model. Fortunately, such PMML producers are readily available, e.g. pmml library for R, r2pmml (also for R), PMML model export in Apache Spark MLlib, etc.

For the deployment, we chose Augustus, the reference implementation in Python. Augustus is both a scoring engine and a PMML toolkit. As a scoring engine, it loads and executes PMML models, whereas as a toolkit, it can be used to manipulate PMML documents on the fly. For this post, we will focus on the scoring engine.

PMML Production: Finale of Model Development

Assuming that your R environment is setup, the typical way to train k-means on the Iris dataset (note: > and # denotes R prompts and comment, respectively) is as follows:

> library(datasets)
> irisCluster <- kmeans(iris[, 3:4], 3)

# Print the resulted model
> irisCluster
K-means clustering with 3 clusters of sizes 50, 52, 48

Cluster means:
  Petal.Length Petal.Width
1     1.462000    0.246000
2     4.269231    1.342308
3     5.595833    2.037500
# Snipped the rest output...

Using the petal length and width as the features, we produce a model of three clusters. So, let’s jot down the interesting details of the models:

Cluster Id	Centre of Petal Length	Centre of Petal Width
1	1.462000	0.246000
2	4.269231	1.342308
3	5.595833	2.037500

Next, we attempt to assign a descriptive name to each cluster:

# Check how species names correlates to cluster id
> table(iris$Species, irisCluster$cluster)
            
              1  2  3
  setosa     50  0  0
  versicolor  0 48  2
  virginica   0  4 46

We notice that cluster 1 correlates to setosa, cluster 2 correlates to versicolor, and cluster 3 correlates to virginica. Armed with this information, we can now update our table as:

Cluster Id	Cluster Name	Centre of Petal Length	Centre of Petal Width
1	setosa	1.462000	0.246000
2	versicolor	4.269231	1.342308
3	virginica	5.595833	2.037500

With this table, we are now ready to produce our first ever PMML file (let’s say file iris.pmml):

<PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1">

    <Header copyright="Copyright (c) 2017 verdi" description="KMeans cluster model for Iris dataset">
        <Timestamp>2017-06-03 19:15:38</Timestamp>
    </Header>
  
    <DataDictionary numberOfFields="2">
        <DataField name="petal_length" optype="continuous" dataType="double"/>
        <DataField name="petal_width" optype="continuous" dataType="double"/>
    </DataDictionary>
  
    <ClusteringModel modelName="iris_kmeans_model" functionName="clustering" algorithmName="Hartigan and Wong" modelClass="centerBased" numberOfClusters="3">
        <MiningSchema>
          <MiningField name="petal_length"/>
          <MiningField name="petal_width"/>
        </MiningSchema>
 
      <Output>
          <OutputField name="predictedSpecies" feature="predictedValue"/>
      </Output>

      <ComparisonMeasure kind="distance">
          <squaredEuclidean/>
      </ComparisonMeasure>

      <ClusteringField field="petal_length" compareFunction="absDiff"/>
      <ClusteringField field="petal_width" compareFunction="absDiff"/>

      <Cluster name="setosa" size="50" id="setosa">
          <Array n="2" type="real">1.462 0.246</Array>
      </Cluster>
      <Cluster name="versicolor" size="52" id="versicolor">
          <Array n="2" type="real">4.269231 1.342308</Array>
      </Cluster>
      <Cluster name="virginica" size="48" id="virginica">
          <Array n="2" type="real">5.595833 2.0375</Array>
      </Cluster>
    </ClusteringModel>
</PMML>

As a side note for curious readers, R has a pmml library to automatically generate PMML (see Appendix A for a brief tutorial).

And that’s it! We have completed the model development phase, with a PMML file as the final artefact. Let us now hand over this PMML file to the production/operation/engineering/IT/… folks!

PMML Consumption: Basis for Model Production

First, install Augustus (see Appendix B). Then, executing the PMML model can be done as simple as in the following script (let’s say file iris.py):

#!/usr/bin/env python2

from augustus.strict import *

# Load .pmml file, and returns an in-memory model
model = modelLoader.loadXml('iris.pmml')

# We are going to score two records:
#
# | record sequence | petal_length | petal_width |
# | --------------- | ------------ | ----------- |
# | 0               | 1.5          | 0.5         |
# | 1               | 4.0          | 1.2         |
#
# Augustus requires input data in a columnar form. Hence, please pay a careful
# attention to how we pack the input data into a Python dictionary.
data = {'petal_length': [1.5, 4.0], 'petal_width': [0.5, 1.2]}

# Execute model, i.e., inject input data to model to predict the cluster each
# data point belongs to.
result = model.calc(data)

# Show outputs
print 'Fields: '
result.fields.look(columnWidth=20)
print '\nOutput: '
result.output.look(columnWidth=20)

Running the script produces these output:

$ python2 ./iris.py
Fields: 
# | petal_length         | petal_width          | iris_kmeans_model   
--+----------------------+----------------------+---------------------
0 | 1.5                  | 0.5                  | setosa              
1 | 4.0                  | 1.2                  | versicolor          

Output: 
# | predictedSpecies    
--+---------------------
0 | setosa              
1 | versicolor

Congratulations! You have successfully incorporated PMML in your data science workflow.

Notice how we did not have to write a single line of k-means algorithm on the production side. One could have imagined that the bulk of our task should have been to load and execute PMML. However, thanks to the scoring engine which has implemented this capability, our script does this in just a single line! Notice also how PMML prevents vendor lock-in; we can switch to another scoring engine and still, we do not need to write a single line of k-means algorithm! Nifty, isn’t it?

With the model scoring sort of out of our way, the rest of the script handles the I/O and interfacing with the model. In a real application, there is also other stuff, which are unfortunately perceived as “boring” or “mundane” application features, e.g. model versioning and provenance. In our experience, highly impactful data-driven applications treat these features to be as important as the model itself.

That’s it for the agile data science series. We hope you gain additional insight. Stay tuned for more data science/engineering topics from our team.

Appendix A: Automatic Conversion of R Models to PMML

Assuming that you have installed the pmml library, using it is as simple as:

# Autogenerate PMML
> library(pmml)

# Convert the irisCluster object to a PMML object
> irisPMML <- pmml.kmeans(irisCluster)

# Save the PMML object to a file
> saveXML(irisPMML, file="/home/user/irisKMeans.pmml")

Please be aware of potential incompatibilities with Augustus when using this pmml library because Augustus supports up to PMML specification 4.1 but the newer versions of the pmml library produce PMML model beyond specification 4.1.

Appendix B: Installation of Augustus-0.6 on Ubuntu

This example assumes a working directory of /home/user (note: $ and # denote Bash prompts and comments, respectively).

$ git clone https://github.com/opendatagroup/augustus.git
$ sudo apt install python-lxml python-numpy
$ export PYTHONPATH=/home/user/augustus

Read Part 1: Agile Data Science (Part 1)