This is a Python implementation of scikit-learn estimators that use partial_fit method for distributed learning.
Implemented methods:
linear_model- callsSGDRegressororSGDClassifierneural_network- callsMLPRegressororMLPClassifiernaive_bayes- callsMixedNB(mix ofGaussianNBandMultinomialNB), only works for classification tasksgradient_boosting- callsGradientBoostingRegressororGradientBoostingClassifier, does not support distributed training
It has two modes
docker run --rm --env [list of environment variables] hbpmip/python-sgd-regression:VERSION compute --mode intermediate --job-id 12which calls partial_fit of scikit-learn estimator and saves intermediate results into job_results table. If
--job-id is specified, it will first load the estimator and continue its training. If not, it will start from scratch.
docker run --rm --env [list of environment variables] hbpmip/python-sgd-regression:VERSION compute --mode aggregate --job-id 13this mode in addition converts estimator into PFA. If you have only one node, calling naive_bayes with compute aggregate
will be equivalent to running Naive Bayes in a non-distributed way.
Environment variables are:
- NODE: name of the node (machine) used for execution
- JOB_ID: ID of the job.
- IN_JDBC_DRIVER: org.postgresql.Driver
- IN_JDBC_URL: URL to the input database, e.g. jdbc:postgresql://db:5432/features
- IN_JDBC_USER: User for the input database
- IN_JDBC_PASSWORD: Password for the input database
- OUT_JDBC_DRIVER: org.postgresql.Driver
- OUT_JDBC_URL: URL to the output database, jdbc:postgresql://db:5432/woken
- OUT_JDBC_USER: User for the output database
- OUT_JDBC_PASSWORD: Password for the output database
- PARAM_variables: Name of the target variable (only one variable is supported for KNN)
- PARAM_covariables: List of covariables
- PARAM_query: Query selecting the variables and covariables to feed into the algorithm for training.
- MODEL_PARAM_type: Type of model to use, could be
linear_model,neural_networkornaive_bayes
MODEL_PARAM_type specifies type of model to use, could be linear_model, neural_network or naive_bayes. Use additional MODEL_PARAM_[sklearn_parameter] envs to specify scikit-learn model parameters (e.g. MODEL_PARAM_alpha for Naive Bayes or MODEL_PARAM_learning_rate for SGDRegressor).
For Naive bayes it is enough to go over all data points once (call --mode intermediate on all nodes).
These methods are trained using Stochastic Gradient Descent and require several passes over training data in random order until convergence.
Does not support distributed training, calling it once on single node is enough.
Run: ./build.sh
Run: captain test
Run: ./publish.sh
Run: ./build.sh
WARNING: unit tests can fail nondeterministically on AttributeError: can't set attribute because of some error
in Titus port to Python 3
Run integration tests:
cd tests
./test.sh