imputr.strategy.randomforest#

Module Contents#

Classes#

RandomForestStrategy

Strategy implementation for RandomForest-based imputation.

class imputr.strategy.randomforest.RandomForestStrategy(target_column: imputr.domain.Column, feature_columns: List[imputr.domain.Column], n_estimators: int = 64, max_depth: int = 8, min_sample_split: int = 512, min_samples_leaf: int = 128, min_weight_fraction_leaf: float = 0.35, max_features: Union[str, float] = 'sqrt', max_leaf_nodes: int = 32)#

Bases: imputr.strategy._base._MultivariateStrategy

Strategy implementation for RandomForest-based imputation.

Parameters
  • target_column (Column) – The column which needs imputation.

  • feature_columns (List[Column]) – The predictor columns for the Random Forest to train on.

  • data_type (Union[str, DataType] (optional)) – The string or enum representation of the data_type.

  • n_estimators (int (optional)) – Number of decision trees used in the forest. Please refer …

  • max_depth (int (optional)) – Maximum depth of decision trees used in the forest. Please refer …

  • min_sample_split (int (optional)) – Minimum sample split of decision trees. Please refer …

  • min_samples_leaf (int (optional)) – Minimum samples at leaves of decision trees. Please refer …

  • min_weight_fraction_leaf (float (optional)) – Minimum weight fractions of leaves of decision trees. Please refer…

  • max_features (Union[str, float] (optional)) – Max features used per decision tree. Can be fraction or identifier like sqrt. Please refer…

  • max_leaf_nodes (int (optional)) – Max number of nodes at leaves of the decision trees. Please refer…

supported_data_types :List#
classmethod from_dict(target_column: imputr.domain.Column, feature_columns: List[imputr.domain.Column], **kwargs: Dict)#

Class constructor that uses the dictionary to build strategy.

Uses a part of the dictionary given to imputer constructor.

Parameters

target_column (Column) – Column that needs imputation by strategy.

fit() None#

Fits RandomForest to make ready for imputation.

Looks at DataType to determine if it needs a Regressor or Classifier. The scikit APIs are the same for both models, which is why we use the estimator_cls variable.

impute_column() pandas.Series#

Imputes all null values with the Random Forest and unions with non-null values.

TODO: Refactor this in general method for better reuse.

Returns

pd.Series (fully imputed data column.)