module: difficulty_based

class annotlib.difficulty_based.DifficultyBasedAnnot(X, y_true, classifiers=None, n_annotators=None, alphas=None, n_splits=5, n_repeats=10, confidence_noise=None, random_state=None)[source]

Bases: annotlib.standard.StandardAnnot

This class implements a simulation technique aiming at quantifying the difficulty of a sample. The estimated difficulty is used in combination with an annotator labelling performance to compute the probability that the corresponding annotator labels the sample correctly.

Parameters:
X: array-like, shape (n_samples, n_features)

Samples of the whole data set.

y_true: array-like, shape (n_samples)

True class labels of the given samples X.

n_annotators: int

Number of annotators who are simulated.

classifiers: sklearn.base.ClassifierMixin | list of ClassifierMixin, shape (n_classifiers)

The classifiers parameter is either a single sklearn classifier supporting :py:method::predict_proba` or a list of such classifiers. If the parameter is not a list, the simplicity scores are estimate by a single classifier, whereas if it is a list, the simplicity scores can be estimated by different classifier types or different parametrisations. The default classifiers parameter is a single SVM

alphas: array-like, shape (n_annotators)

The entry alphas[a_idx] indicates the annotator labelling performance, which is in the interval (-inf, inf). The following properties are valid: - alphas[a_idx] = 0: annotator with index a_idx makes random guesses, - alphas[a_idx] = inf: annotator with index a_idx is almost always right, - alphas[a_idx] = -inf: annotator with index a_idx is almost always wrong (adversarial).

n_splits: int

Number of folds of the cross-validation.

n_repeats: int

Number of repeats of the cross-validation.

confidence_noise: array-like, shape (n_annotators)

An entry of confidence_noise defines the interval from which the noise is uniformly drawn, e.g. confidence_noise[a] = 0.2 results in sampling n_samples times from U(-0.2, 0.2) and adding this noise to the confidence scores. Zero noise is the default value for each annotator.

random_state: None | int | instance of :py:class:`numpy.random.RandomState`

The random state used for generating class labels of the annotators.

Examples

>>> from sklearn.datasets import load_iris
>>> from sklearn.gaussian_process import GaussianProcessClassifier
>>> from sklearn.svm import SVC
>>> # load iris data set
>>> X, y_true = load_iris(return_X_y=True)
>>> # create list of SVM and Gaussian Process classifier
>>> classifiers = [SVC(C=1, probability=True, gamma='auto'), SVC(C=3, probability=True), GaussianProcessClassifier()]
>>> # set labelling performances of annotators
>>> alphas = [-3, 0, 3]
>>> # simulate annotators on the iris data set
>>> annotators = DifficultyBasedAnnot(X=X, y_true=y_true, classifiers=classifiers, n_annotators=3, alphas=alphas)
>>> # the number of annotators must be equal to the number of classifiers
>>> annotators.n_annotators()
3
>>> # query class labels of 100 samples from annotators a_0, a_2
>>> annotators.class_labels(X=X[0:100], y_true=y_true[0:100], annotator_ids=[0, 2], query_value=100).shape
(100, 3)
>>> # check query values
>>> annotators.n_queries()
array([100,   0, 100])
>>> # query confidence scores of these 100 samples from annotators a_0, a_2
>>> annotators.confidence_scores(X=X[0:100], y_true=y_true[0:100], annotator_ids=[0, 2]).shape
(100, 3)
>>> # query values are not affected by calling the confidence score method
>>> annotators.n_queries()
array([100,   0, 100])
>>> # labelling performance of annotator a_0 is adversarial (worse than guessing)
>>> annotators.labelling_performance(X=X, y_true=y_true)[0] < 1/len(np.unique(y_true))
True
Attributes:
X_: numpy.ndarray, shape (n_samples, n_features)

Samples of the whole data set.

Y_: numpy.ndarray, shape (n_samples, n_annotators)

Class labels of the given samples X.

C_: numpy.ndarray, shape (n_samples, n_annotators)

confidence score for labelling the given samples x.

C_noise_: numpy.ndarray, shape (n_samples, n_annotators)

The uniformly noise for each annotator and each sample, e.g. C[x_idx, a_idx] indicates the noise for the confidence score of annotator with id a_idx in labelling sample with id x_idx.

n_annotators_: int

Number of annotators.

n_queries_: numpy.ndarray, shape (n_annotators)

An entry n_queries_[a] indicates how many queries annotator a has processed.

queried_flags_: numpy.ndarray, shape (n_samples, n_annotators)

An entry queried_flags_[i, j] is a boolean indicating whether annotator a_i has provided a class label for sample x_j.

y_true_: numpy.ndarray, shape (n_samples)

The true class labels of the given samples.

alphas_: array-like, shape (n_annotators)

The entry alphas_[a_idx] indicates the annotator labelling performance, which is in the interval (-inf, inf). The following properties are valid: - alphas_[a_idx] = 0: annotator with index a_idx makes random guesses, - alphas_[a_idx] = inf: annotator with index a_idx is almost always right, - alphas_[a_idx] = -inf: annotator with index a_idx is almost always wrong (adversarial).

betas_: array-like, shape (n_annotators)

The entry betas_[x_idx] represents the simplicity score of sample X_[x_idx], where betas_[x_idx] is in the interval [0, inf): - betas_[x_idx] = 0: annotator with index a_idx makes random guesses, - betas_[x_idx] = inf: annotator with index a_idx is always right, if alphas_[a_idx] > 0

n_splits_: int

Number of folds of the cross-validation.

n_repeats: int

Number of repeats of the cross-validation.

random_state_: None | int | numpy.random.RandomState

The random state used for generating class labels of the annotators.

plot_annotators_labelling_probabilities(self, figsize=(5, 3), dpi=150, fontsize=7)[source]

Creates a plot of the correct labelling probabilities for given labelling performances and estimated sample simplicity scores.

Returns:
fig : matplotlib.figure.Figure object
ax : matplotlib.axes.Axes.