Core class structure#

Strategy (class)#

A strategy in the imputr library is an imputation method. Its name stems from the Strategy behavioral design pattern. It allows you to indiciate how you want the Imputer to impute for a specific column.

An example of this is the RandomForestStrategy.

class imputr.strategy.RandomForestStrategy(target_column: Column, feature_columns: List[Column], n_estimators: int = 64, max_depth: int = 8, min_sample_split: int = 512, min_samples_leaf: int = 128, min_weight_fraction_leaf: float = 0.35, max_features: Union[str, float] = 'sqrt', max_leaf_nodes: int = 32)#

Strategy implementation for RandomForest-based imputation.

Parameters
  • target_column (Column) – The column which needs imputation.

  • feature_columns (List[Column]) – The predictor columns for the Random Forest to train on.

  • data_type (Union[str, DataType] (optional)) – The string or enum representation of the data_type.

  • n_estimators (int (optional)) – Number of decision trees used in the forest. Please refer …

  • max_depth (int (optional)) – Maximum depth of decision trees used in the forest. Please refer …

  • min_sample_split (int (optional)) – Minimum sample split of decision trees. Please refer …

  • min_samples_leaf (int (optional)) – Minimum samples at leaves of decision trees. Please refer …

  • min_weight_fraction_leaf (float (optional)) – Minimum weight fractions of leaves of decision trees. Please refer…

  • max_features (Union[str, float] (optional)) – Max features used per decision tree. Can be fraction or identifier like sqrt. Please refer…

  • max_leaf_nodes (int (optional)) – Max number of nodes at leaves of the decision trees. Please refer…

In short, a strategy: - is instantiated for a single column - can support one or more DataTypes - can be univariate or multivariate - can be customized

Its interface is defined in the abstract _BaseStrategy class.

Imputer (class)#

An Imputer is the highest level class of the library and also provides the API for the end user of the library.

An example of this is the AutoImputer.

class imputr.AutoImputer(data: DataFrame, predefined_order: Optional[Dict[str, int]] = None, predefined_strategies: Optional[Dict[str, Dict]] = None, predefined_datatypes: Optional[Dict[str, Union[str, DataType]]] = None, include_non_missing: bool = False)#

Automatic imputation class that implements the RandomForest strategy as main imputation method. Can be configured to implement other strategies for specific columns and a custom imputation order.

Variables
  • predefined_order (Dict[int, str] (optional)) – Dictionary of column names and their order for imputation. Keys must be incremental starting from zero: 0, 1, 2

  • strategies (Dict[str, Dict] (optional)) – Dictionary of column name and strategy kwargs.

  • predefined_datatypes (Dict[str, Union[str, DataType]] (optional)) – Dictionary that has column names as key and the data type as specified in the Column constructor as value.

Parameters
  • data (pd.DataFrame) – The dataframe which undergoes imputation.

  • predefined_order (Dict[int, str] (optional)) – Dictionary of column names and their order for imputation. Keys must be incremental starting from zero: 0, 1, 2

  • predefined_strategies (Dict[str, Dict] (optional)) – Dictionary of column name and strategy kwargs.

  • predefined_datatypes (Dict[str, Union[str, DataType]] (optional)) – Dictionary that has column names as key and the data type as specified in the Column constructor as value.

  • include_non_missing (bool (optional)) – Flag to indicate whether columns without missing value need fitting of strategies. Default is set to False.

It fits and imputes with 1 or more strategies for a whole table. Its behavior can be customized by the caller. It implements the _BaseImputer class.

Column (class)#

The Column is the dataclass that is used for holding the actual data as well as the metadata of a column such as name, number of unique values, number of missing values and DataType. It also comes with a basic set of methods that make it easier to work with an imputer. It is a list attribute of the Table dataclass.

class imputr.domain.Column(data: Series, data_type: Optional[Union[str, DataType]] = None)#

Data class that encapsulates the data and imputr-specific metadata of a column.

Parameters
  • data (pd.Series) – The Pandas Series that contains the column data.

  • data_type (Union[str, DataType] (optional)) – The imputr DataType specified per string or DataType enum class.

DataType (enum)#

The DataType describes the data in a column from an imputation perspective. A strategies must explicitly execute a DataType before and Imputer can use it.

class imputr.domain.DataType(value)#

Enum class that represents the various data types that the library is able to impute for and with.

Currently only contains categorical and continuous. Future releases may contain specific enumertions for discrete, discrete-ordinal and a separate datetime type.

The DataTypes Imputr uses are the following:
  • Categorical

  • Continuous

  • Discrete-ordinal (future release)

  • Datetime (future release, currently mapped to Continuous)