A Julia package for general, composable machine learning at scale.
Evaluate the performance of a model and tune its hyper-parameters
using Pkg
# Create a new environment (optional):
Pkg.activate("mlj-env", shared=true)
# Install MLJ
Pkg.add("MLJ")
# Install RDatasets (needed for demo):
Pkg.add("RDatasets")
# Use these packages:
using MLJ, RDatasets
# identify models suitable for your data
julia> X, y = @load_iris
julia> models(matching(X, y))
54-element Vector
(name = AdaBoostClassifier, package_name = MLJScikitLearnInterface, ... )
(name = AdaBoostStumpClassifier, package_name = DecisionTree, ... )
(name = BaggingClassifier, package_name = MLJScikitLearnInterface, ... )
⋮
# filter by docstring content:
julia> models("pca")
(name = PCA, package_name = MultivariateStats, ... )
(name = PCADetector, package_name = OutlierDetectionPython, ... )
# retrieve docs without code import:
julia> doc("PCA", pkg="MultivariateStats")
A Model Registry stores detailed metadata for over 200 models and documentation can be searched without loading model code.
# choose a model and define hyperparameter ranges for tuning:
model = XGBoostRegressor()
r1 = range(model, :max_depth, lower=3, upper=10)
r2 = range(model, :gamma, lower=0, upper=10, scale=:log)
# wrap model to create self-tuning version:
tuned_model = TunedModel(model, range=[r1, r2], resampling=CV(), measure=l2)
# this both optimises and retrains on all data:
mach = machine(tuned_model, data) |> fit!
# predict using optimised params:
predict(mach, Xnew)
# inspect optimisation outcomes
report(mach).best_model # inspect optim. results
plot(mach)
For improved composability, and better data hygiene, an extensive number of meta-algorithms are implemented as model wrappers:
In this way, a model wrapped in a tuning strategy, for example, becomes a "self-tuning" model, with all data resampling (e.g., cross-validation) managed under the hood.
# create and train a pipeline model:
pipe = OneHotEncoder() |> PCA(maxout=3) |> DecisionTreeClassifier()
mach = machine(pipe, X, y) |> fit!
# get actual PCA reduction dimension:
report(mach).pca.outdim
# get the tree:
fitted_params(mach).decision_tree_classifier.tree
Conventional model pipelines are available out-of-the box. Hyper-parameters of different model components can be simultaneously tuned, but only necessary components are retrained in each pipeline evaluation. Training reports expose reports for individual components, and the same holds for learned parameters.
# create pipeline:
julia> pipe = ContinuousEncoder() |> RidgeRegressor()
DeterministicPipeline(
continuous_encoder = ContinuousEncoder(
drop_last = false,
one_hot_ordered_factors = false),
ridge_regressor = RidgeRegressor(
lambda = 1.0,
fit_intercept = true,
penalize_intercept = false,
scale_penalty_with_samples = true,
solver = nothing),
cache = true)
# define one or more hyperparameter ranges:
julia> r = range(pipe, :(ridge_regressor.lambda), lower=0.001, upper=10.0)
# create self-tuning version of pipeline:
julia> tuned_model = TunedModel(pipe, range=r, resampling=CV(), measure=l2)
Creating pipelines, or wrapping models in meta-algorithms, such as iteration control, creates nested hyper-parameters. Such parameters can be optimized like any other.
# wrap data in "source nodes":
X, y = source.(X, y)
# a normal MLJ workflow, with `fit` calls omitted:
mach1 = machine(model1, X, y)
mach2 = machine(model2, X, y)
y1 = predict(mach1, X) # a callable "node"
y2 = predict(mach2, X)
y = 0.5*(y1 + y2)
# train all models with one call:
fit!(y, acceleration=CPUThreads())
# blended prediction for new data:
y(Xnew)
In principle, any MLJ workflow is readily transformed into a lazily executed learning network.
For example, in the code block opposite, fit!
triggers training of both models in parallel, and the last line returns the average prediction. Mutate a hyper-parameter of model1
, call fit!
again, and only model1
is retrained.
Learning networks can be exported as new stand-alone model types. Internally, MLJ's pipelines and stacks are implemented using learning networks, which demonstrates their flexibility.
# choose an iterative model:
model = EvoTreeRegressor() # with iteration parameter `nrounds`
# choose iteration controls:
losses = []
controls = [Step(1), Patience(5), WithLossDo(x->push!(losses,x))]
iterated_model = IteratedModel(
model;
controls,
measure=l2,
resampling=Holdout(),
retrain=true,
)
# train on internal holdout to find `nrounds` and retrain on all data:
mach = machine(iterated_mode, X, y) |> fit!
# predict on new data:
predict(mach, Xnew)
MLJ provides a rich supply of iterative model "controls", such as early stopping criteria, snapshots, and callbacks for visualization. Any model with an iteration parameter can be wrapped in such controls, the iteration parameter becoming an additional learned parameter.