Is anyone aware of good data anonymization software? Or perhaps a package for R that does data anonymization? Obviously not expecting uncrackable anonymization – just want to make it difficult.
Solved – Data anonymization software
software
Related Solutions
The pmml package for R (used by Rattle, which is mentioned in highBandWidth's answer), provides a fairly transparent look at how to turn a model into PMML output.
In the pmml package reference manual, the example of building a linear model for the iris data set and then producing PMML is given:
> library("pmml")
> (iris.lm <- lm(Sepal.Length ~ ., data=iris))
> pmml(iris.lm)
This will produce the following PMML:
<PMML version="3.2" xmlns="http://www.dmg.org/PMML-3_2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-3_2 http://www.dmg.org/v3-2/pmml-3-2.xsd">
<Header copyright="Copyright (c) 2011 user" description="Linear Regression Model">
<Extension name="user" value="user" extender="Rattle/PMML"/>
<Application name="Rattle/PMML" version="1.2.27"/>
<Timestamp>2011-08-27 23:17:42</Timestamp>
</Header>
<DataDictionary numberOfFields="5">
<DataField name="Sepal.Length" optype="continuous" dataType="double"/>
<DataField name="Sepal.Width" optype="continuous" dataType="double"/>
<DataField name="Petal.Length" optype="continuous" dataType="double"/>
<DataField name="Petal.Width" optype="continuous" dataType="double"/>
<DataField name="Species" optype="categorical" dataType="string">
<Value value="setosa"/>
<Value value="versicolor"/>
<Value value="virginica"/>
</DataField>
</DataDictionary>
<RegressionModel modelName="Linear_Regression_Model" functionName="regression" algorithmName="least squares" targetFieldName="Sepal.Length">
<MiningSchema>
<MiningField name="Sepal.Length" usageType="predicted"/>
<MiningField name="Sepal.Width" usageType="active"/>
<MiningField name="Petal.Length" usageType="active"/>
<MiningField name="Petal.Width" usageType="active"/>
<MiningField name="Species" usageType="active"/>
</MiningSchema>
<RegressionTable intercept="2.17126629215507">
<NumericPredictor name="Sepal.Width" exponent="1" coefficient="0.495888938388551"/>
<NumericPredictor name="Petal.Length" exponent="1" coefficient="0.829243912234806"/>
<NumericPredictor name="Petal.Width" exponent="1" coefficient="-0.315155173326474"/>
<CategoricalPredictor name="Species" value="setosa" coefficient="0"/>
<CategoricalPredictor name="Species" value="versicolor" coefficient="-0.72356195778073"/>
<CategoricalPredictor name="Species" value="virginica" coefficient="-1.02349781449083"/>
</RegressionTable>
</RegressionModel>
</PMML>
Source Code
The relevant source code for this linear model is in the pmml package pmml.R
and pmml.lm.R
files. As will be the case for any PMML producer, it basically reads model parameters (here the model is in iris.lm
), and then builds up the XML nodes from the model data.
The code in pmml.lm.R
is pretty straightforward, and basically node-by-node builds up the PMML.
Below are some of the queries on the data model that are used (indirectly) in pmml.lm.R
:
> terms <- attributes(iris.lm$terms)
> terms$dataClasses
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
"numeric" "numeric" "numeric" "numeric" "factor"
> iris.lm$xlevels
$Species
[1] "setosa" "versicolor" "virginica"
> iris.lm$coefficients
(Intercept) Sepal.Width Petal.Length Petal.Width Speciesversicolor Speciesvirginica
2.1712663 0.4958889 0.8292439 -0.3151552 -0.7235620 -1.0234978
Best Answer
The Cornell Anonymization Tookit is open source. Their research page has links to associated publications.