Name	Name	Last commit message	Last commit date
parent directory ..
tests	tests
.bumpversion.cfg	.bumpversion.cfg
.dockerignore	.dockerignore
.gitignore	.gitignore
Dockerfile	Dockerfile
Dockerfile.gb	Dockerfile.gb
Dockerfile.lm	Dockerfile.lm
Dockerfile.nb	Dockerfile.nb
Dockerfile.nn	Dockerfile.nn
README.md	README.md
build.sh	build.sh
captain.yml	captain.yml
publish.sh	publish.sh
requirements-dev.txt	requirements-dev.txt
requirements.txt	requirements.txt
sgd_regression.py	sgd_regression.py
slack.json	slack.json

Python sgd-regression

This is a Python implementation of scikit-learn estimators that use partial_fit method for distributed learning.

Implemented methods:

linear_model - calls SGDRegressor or SGDClassifier
neural_network - calls MLPRegressor or MLPClassifier
naive_bayes - calls MixedNB (mix of GaussianNB and MultinomialNB), only works for classification tasks
gradient_boosting - calls GradientBoostingRegressor or GradientBoostingClassifier, does not support distributed training

Usage

It has two modes

docker run --rm --env [list of environment variables] hbpmip/python-sgd-regression:VERSION compute --mode intermediate --job-id 12

which calls partial_fit of scikit-learn estimator and saves intermediate results into job_results table. If --job-id is specified, it will first load the estimator and continue its training. If not, it will start from scratch.

docker run --rm --env [list of environment variables] hbpmip/python-sgd-regression:VERSION compute --mode aggregate --job-id 13

this mode in addition converts estimator into PFA. If you have only one node, calling naive_bayes with compute aggregate will be equivalent to running Naive Bayes in a non-distributed way.

Environment variables are:

NODE: name of the node (machine) used for execution
JOB_ID: ID of the job.
IN_JDBC_DRIVER: org.postgresql.Driver
IN_JDBC_URL: URL to the input database, e.g. jdbc:postgresql://db:5432/features
IN_JDBC_USER: User for the input database
IN_JDBC_PASSWORD: Password for the input database
OUT_JDBC_DRIVER: org.postgresql.Driver
OUT_JDBC_URL: URL to the output database, jdbc:postgresql://db:5432/woken
OUT_JDBC_USER: User for the output database
OUT_JDBC_PASSWORD: Password for the output database
PARAM_variables: Name of the target variable (only one variable is supported for KNN)
PARAM_covariables: List of covariables
PARAM_query: Query selecting the variables and covariables to feed into the algorithm for training.
MODEL_PARAM_type: Type of model to use, could be linear_model, neural_network or naive_bayes

Model parameters

MODEL_PARAM_type specifies type of model to use, could be linear_model, neural_network or naive_bayes. Use additional MODEL_PARAM_[sklearn_parameter] envs to specify scikit-learn model parameters (e.g. MODEL_PARAM_alpha for Naive Bayes or MODEL_PARAM_learning_rate for SGDRegressor).

Convergence

Naive bayes

For Naive bayes it is enough to go over all data points once (call --mode intermediate on all nodes).

SGDRegression, SGDClassifier, MLPRegressor and MLPClassifier

These methods are trained using Stochastic Gradient Descent and require several passes over training data in random order until convergence.

GradientBoostingRegressor, GradientBoostingClassifier

Does not support distributed training, calling it once on single node is enough.

Build (for contributors)

Run: ./build.sh

Integration Test (for contributors)

Run: captain test

Publish (for contributors)

Run: ./publish.sh

Unit tests (for contributors)

Run: ./build.sh

WARNING: unit tests can fail nondeterministically on AttributeError: can't set attribute because of some error in Titus port to Python 3

Run integration tests:

  cd tests
  ./test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Python sgd-regression

Usage

Model parameters

Convergence

Naive bayes

SGDRegression, SGDClassifier, MLPRegressor and MLPClassifier

GradientBoostingRegressor, GradientBoostingClassifier

Build (for contributors)

Integration Test (for contributors)

Publish (for contributors)

Unit tests (for contributors)

FilesExpand file tree

python-sgd-regression

Directory actions

More options

Directory actions

More options

Latest commit

History

python-sgd-regression

Folders and files

parent directory

README.md

Python sgd-regression

Usage

Model parameters

Convergence

Naive bayes

SGDRegression, SGDClassifier, MLPRegressor and MLPClassifier

GradientBoostingRegressor, GradientBoostingClassifier

Build (for contributors)

Integration Test (for contributors)

Publish (for contributors)

Unit tests (for contributors)