Learning from Continuous Data

All modules related to learning Bayesian Belief Networks BBNs from multivariate Gaussian data.

Data

The data.

class pysparkbbn.continuous.data.CondMvn(index_1: List[int], index_2: List[int], means: ndarray, cov: ndarray, zero=1e-06, cols_1: List[str] = None, cols_2: List[str] = None)

Bases: object

Conditional multivariate normal.

__init__(index_1: List[int], index_2: List[int], means: ndarray, cov: ndarray, zero=1e-06, cols_1: List[str] = None, cols_2: List[str] = None)

ctor

Parameters
  • index_1 – Index of dependent variables.

  • index_2 – Index oc conditioning variables.

  • means – Means.

  • cov – Covariance matrix.

  • zero – Threshold below which to consider a probability as zero.

  • cols_1 – Names corresponding to index_1.

  • cols_2 – Names corresponding to index_2.

equals(other: Any) bool

Checks if this is equal to other.

Parameters

other – CondMvn.

Returns

Boolean.

log_proba(x: ndarray) ndarray

Estimate the log conditional probability of the specified data point.

Parameters

x – Data point.

Returns

Log probability.

static partition_cov(cov: ndarray, index_1: List[int], index_2: List[int]) Tuple[ndarray, ndarray, ndarray, ndarray]

Partitions the covariance matrix.

Parameters
  • cov – Covariance matrix.

  • index_1 – Index.

  • index_2 – Index.

Returns

Partitioned covariance matrix.

static partition_means(means: ndarray, index_1: List[int], index_2: List[int]) Tuple[ndarray, ndarray]

Partitions the means.

Parameters
  • means – Means.

  • index_1 – Index.

  • index_2 – Index.

Returns

Partitoned mean.

static partition_x(x: ndarray, index_1: List[int], index_2: List[int]) Tuple[ndarray, ndarray]

Partitions the data point.

Parameters
  • x – Data point.

  • index_1 – Index.

  • index_2 – Index.

Returns

Tuple of data point partitioned.

pdf(x: ndarray) float

Estimate the conditional probability of the specified data point.

Parameters

x – Data point.

Returns

Probability.

class pysparkbbn.continuous.data.GaussianData(sdf: DataFrame, n_samples=6, spark=None)

Bases: object

Gaussian data.

__init__(sdf: DataFrame, n_samples=6, spark=None)

ctor

Parameters
  • sdf – Spark data frame.

  • n_samples – Number of samples.

  • spark – Spark object: Bug requirement.

drop(columns: List[str])

Drops specified columns.

Parameters

columns – List of columns.

Returns

Guassian data.

get_cmi_par(triplets: List[Triplet]) List[Tuple[List[str], List[str], List[str], float]]

Computes conditional mutual information between triplets in parallel.

Parameters

triplets – List of triplets (of variables).

Returns

List of conditional mutual information.

get_covariance() ndarray

Gets the covariance matrix.

Returns

Covariance matrix.

get_means() ndarray

Get means.

Returns

List of means.

get_mi_par(pairs: List[Pair]) List[Tuple[List[str], List[str], float]]

Computes mutual information between the pairs of variables in parallel.

Parameters

pairs – List of pairs (of variables).

Returns

List of mutual information.

get_min_max(columns: List[str]) Dict[str, Dict[str, Union[int, float]]]

Get dictionary of min/max.

Parameters

columns – Variable names.

Returns

Dictionary of min/max associated with names.

get_min_max_for(column: str) Dict[str, Union[int, float]]

Get min/max value for specified variable.

Parameters

column – Variable name.

Returns

Dictionary of min/max.

get_mvn(columns: List[str]) Mvn

Gets a multivariate normal instance.

Parameters

columns – List of variable names.

Returns

Multivariate normal.

get_pair(x: List[str], y: List[str]) Pair

Gets a pair.

Parameters
  • x – X variables.

  • y – Y variables.

Returns

Pair.

get_pairs(col_pairs: List[Tuple[List[str], List[str]]] = None) List[Pair]

Gets list of pairs.

Parameters

col_pairs – List of column pairs.

Returns

List of pairs.

get_pairwise_columns() Generator[Tuple[List[str], List[str]], None, None]

Gets pairs of columns.

Returns

List of pairs of columns. Each column is inside an array.

get_profile() Dict[str, Dict[str, float]]

Gets profile of variables.

Returns

Dictionary; keys are variable names and values are summary stats.

get_score_par(cmvns: List[CondMvn]) List[Tuple[List[str], List[str], float]]

Computes the scores.

Parameters

cmvns – List of conditional multivariate gaussian distributions.

Returns

List of scores.

get_triplet(x: List[str], y: List[str], z: List[str]) Triplet

Gets a triplet.

Parameters
  • x – X variables.

  • y – Y variables.

  • z – Z variables.

Returns

Triplet.

get_triplets(col_triplets: List[Tuple[List[str], List[str], List[str]]])

Gets list of triplets.

Parameters

col_triplets – List of column triplets.

Returns

List of triplets.

slice_covariance(columns: List[str]) ndarray

Slices covariance matrix according to variables.

Parameters

columns – List of variables.

Returns

Covariance matrix.

slice_means(columns: List[str]) ndarray

Slices means vector according to variables.

Parameters

columns – List of variables.

Returns

List of means.

class pysparkbbn.continuous.data.Mvn(columns: List[str], means: ndarray, cov: ndarray, profile: Dict[str, Dict[str, float]], n_samples=10)

Bases: object

Multivariate normal distribution.

__init__(columns: List[str], means: ndarray, cov: ndarray, profile: Dict[str, Dict[str, float]], n_samples=10)

ctor

Parameters
  • columns – List of variable names.

  • means – Vector means.

  • cov – Covariance matrix.

  • profile – Dictionary of min/max for each variable.

  • n_samples – Number of samples.

get_values() Generator[Tuple[float, ...], None, None]

Gets the sampled values.

Returns

Generator of values.

pdf(x: ndarray) float

Estimate the probability of the specified data point.

Parameters

x – Data point.

Returns

Probability.

class pysparkbbn.continuous.data.Pair(X: Mvn, Y: Mvn, XY: Mvn)

Bases: object

Pair of variables.

__init__(X: Mvn, Y: Mvn, XY: Mvn)

ctor

Parameters
  • X – X variables.

  • Y – Y variables.

  • XY – XY variables.

get_mi() float

Computes the mutual information.

Returns

Mutual information.

get_partial_mi(dp: ndarray) float

Computes the partial mutual information.

Parameters

dp – Data point.

Returns

Partial mutual information.

get_values() Generator[Tuple[float, ...], None, None]

Gets the XY values.

Returns

Generator of XY values.

class pysparkbbn.continuous.data.Triplet(x_cols: List[str], y_cols: List[str], z_cols: List[str], Z: Mvn, XZ: Mvn, YZ: Mvn, XYZ: Mvn)

Bases: object

Triplet variables.

__init__(x_cols: List[str], y_cols: List[str], z_cols: List[str], Z: Mvn, XZ: Mvn, YZ: Mvn, XYZ: Mvn)

ctor

Parameters
  • x_cols – X columns.

  • y_cols – Y columns.

  • z_cols – Z columns.

  • Z – Z variables.

  • XZ – XZ variables.

  • YZ – YZ variables.

  • XYZ – XYZ variables.

get_cmi() float

Computes the conditional mutual information.

Returns

Conditional mutual information.

get_partial_mi(dp: ndarray) float

Computes the partial mutual information.

Parameters

dp – Data point.

Returns

Partial mutual information.

get_values() Generator[Tuple[float, ...], None, None]

Gets the XYZ values.

Returns

List of XYZ values.

Structure Learning

Constraint-Based

Constraint-based structure learning.

class pysparkbbn.continuous.scblearn.Ban(data: GaussianData, clazz: str, cmi_threshold=0.0001, method='pc')

Bases: BaseStructureLearner

Modified Bayesian network augmented naive bayes (BAN). See Bayesian Network Classifiers.

__init__(data: GaussianData, clazz: str, cmi_threshold=0.0001, method='pc')

ctor

Parameters
  • data – Data.

  • clazz – Class variable.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

  • method – Either pc or tpda (default=pc).

get_network() DiGraph

Gets the network structure.

class pysparkbbn.continuous.scblearn.BaseStructureLearner(data: GaussianData)

Bases: object

Base structure learner.

__init__(data: GaussianData)

ctor

Parameters

data – Data.

get_network() DiGraph

Gets the network structure.

class pysparkbbn.continuous.scblearn.Mwst(data: GaussianData, cmi_threshold=0.01)

Bases: BaseStructureLearner

Maximum weight spanning tree.

__init__(data: GaussianData, cmi_threshold=0.01)

ctor

Parameters

data – Data. :param cmi_threshold: Threshold (equal to above which) to consider conditionally dependent.

get_network() DiGraph

Gets the network structure.

class pysparkbbn.continuous.scblearn.Naive(data: GaussianData, clazz: str)

Bases: BaseStructureLearner

Naive Bayesian network. The clazz variable/node is drawn with an arc to all other nodes.

__init__(data: GaussianData, clazz: str)

ctor.

Parameters
  • data – Data.

  • clazz – The clazz node.

get_network() DiGraph

Gets the network structure.

class pysparkbbn.continuous.scblearn.Pc(data: GaussianData, cmi_threshold=0.0001)

Bases: BaseStructureLearner

PC algorithm. See A fast PC algorithm for high dimensional causal discovery with multi-core PCs.

__init__(data: GaussianData, cmi_threshold=0.0001)

ctor

Parameters
  • data – Data.

  • depth – Maximum depth.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network() DiGraph

Gets the network structure.

learn_undirected_graph() Graph

Learns an undirected graph.

Returns

Undirected graph.

class pysparkbbn.continuous.scblearn.Tan(data: GaussianData, clazz: str, cmi_threshold=0.01)

Bases: Mwst

Tree-augmented network. See Comparing Bayesian Network Classifiers.

__init__(data: GaussianData, clazz: str, cmi_threshold=0.01)

ctor.

Parameters
  • data – Data.

  • clazz – The clazz node.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network() DiGraph

Gets the network structure.

class pysparkbbn.continuous.scblearn.Tpda(data: GaussianData, cmi_threshold=0.006)

Bases: Mwst

Three-phase dependency analysis (TPDA). See Learning Belief Networks from Data: An Information Theory Based Approach.

__init__(data: GaussianData, cmi_threshold=0.006)

ctor.

Parameters
  • data – Data.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network() DiGraph

Gets the network structure.

learn_undirected_graph() Graph

Learns an undirected graph.

Returns

Undirected graph.

Search-and-Scoring

Search-and-scoring structure learning.

class pysparkbbn.continuous.ssslearn.Ga(data: GaussianData, sc: SparkContext, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.0001}, seed=37)

Bases: object

Uses genetic algorithm to search-and-score candidate networks. The particular algorithm is actually a hybrid approach where the ordering of nodes is induced first by a constraint-based algorithm (MWST, PC or TPDA). The ordered nodes are then used to constrain the candidate parents of each node; later nodes cannot be parents of earlier ones. See Learning Bayesian Networks: Search Methods and Experimental results.

__init__(data: GaussianData, sc: SparkContext, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.0001}, seed=37)

ctor

Parameters
  • data – Data.

  • sc – Spark context.

  • max_parents – Maximum number of parents (default=4).

  • mutation_rate – Mutation rate (default=0.25).

  • pop_size – Population size (default=100).

  • crossover_prob – Crossover probability (default=0.5).

  • max_iters – Maximum iterations (default=20).

  • convergence_threshold – Convergence threshold; terminate when no improvement is made after this many generations (default=3).

  • ordering – Ordering method: mwst, pc or tpda (default=mwst).

  • ordering_params – Ordering parameters to the ordering method.

  • seed – Seed for random number generation (default=37).

get_network() DiGraph

Gets the network structure.