stringcluster package¶
Submodules¶
stringcluster.base module¶
Module for de-duplicating arrays of strings.
- class stringcluster.base.StringCluster(ngram_size: int = 2, threshold: float = 0.8, stop_tokens: str = '[\\W_]+')[source]¶
Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Transformer for de-duplicating an array-like container of strings.
- ngram_size¶
Size of ngrams to use in TfidfVectorizer.
- Type
int
- threshold¶
Threshold to determine similarities; only samples above this number are flagged as similar.
- Type
float
- stop_tokens¶
RegEx pattern of stop tokens for use in TfidfVectorizer.
- Type
re.Pattern
- vec¶
Scikit-Learn TfidfVectorizer.
- Type
TfidfVectorizer
- similarity_¶
Array of
- Type
np.ndarray
- labels_¶
- Type
np.ndarray
- fit_transform(X: Data, y: Optional[Data] = None, \*\*fit_params)[source]¶
Fit and transform the data.
Instantiate a StringCluster object.
- Parameters
ngram_size (int) – Size of ngrams to use in TfidfVectorizer; default 2.
threshold (float) – Threshold to determine similarities; default 0.8; must be between [0, 1].
stop_tokens (re.Pattern) – RegEx pattern of stop tokens for use in TfidfVectorizer; default r’[W_]+’.
- __init__(ngram_size: int = 2, threshold: float = 0.8, stop_tokens: str = '[\\W_]+')[source]¶
Instantiate a StringCluster object.
- Parameters
ngram_size (int) – Size of ngrams to use in TfidfVectorizer; default 2.
threshold (float) – Threshold to determine similarities; default 0.8; must be between [0, 1].
stop_tokens (re.Pattern) – RegEx pattern of stop tokens for use in TfidfVectorizer; default r’[W_]+’.
- fit(X: Union[List, pandas.core.series.Series, numpy.ndarray], y: Optional[Union[List, pandas.core.series.Series, numpy.ndarray]] = None) stringcluster.base.StringCluster [source]¶
Fit the transformer to data.
- Parameters
X (Data) – Array like object containing duplicated strings.
y (Optional[Data]) – Optional array like object containing ‘master list’ of values to map similar samples to.
- Returns
StringCluster – Self.
- transform(X: Union[List, pandas.core.series.Series, numpy.ndarray], y: Optional[Union[List, pandas.core.series.Series, numpy.ndarray]] = None) pandas.core.series.Series [source]¶
Transform data.
- Parameters
X (Data) – Array like object containing duplicated strings.
y (Optional[Data]) – Optional array like object containing ‘master list’ of values to map similar samples to.
- Returns
pd.Series – Pandas Series of de-duplicated values.
- fit_transform(X: Union[List, pandas.core.series.Series, numpy.ndarray], y: Optional[Union[List, pandas.core.series.Series, numpy.ndarray]] = None, **fit_params) pandas.core.series.Series [source]¶
Fit and transform the data.
- Parameters
X (Data) – Array like object containing duplicated strings.
y (Optional[Data]) – Optional array like object containing ‘master list’ of values to map similar samples to.
fit_params – Optional kwargs; for compatibility, only.
- Returns
pd.Series – Pandas Series of de-deduplicated values.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params (dict) – Parameter names mapped to their values.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self (estimator instance) – Estimator instance.