stringcluster package

Submodules

stringcluster.base module

Module for de-duplicating arrays of strings.

class stringcluster.base.StringCluster(ngram_size: int = 2, threshold: float = 0.8, stop_tokens: str = '[\\W_]+')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Transformer for de-duplicating an array-like container of strings.

ngram_size

Size of ngrams to use in TfidfVectorizer.

Type

int

threshold

Threshold to determine similarities; only samples above this number are flagged as similar.

Type

float

stop_tokens

RegEx pattern of stop tokens for use in TfidfVectorizer.

Type

re.Pattern

vec

Scikit-Learn TfidfVectorizer.

Type

TfidfVectorizer

similarity_

Array of

Type

np.ndarray

labels_
Type

np.ndarray

fit(X: Data, y: Optional[Data] = None)[source]

Fit the transformer to data.

transform(X: Data, y: Optional[Data] = None)[source]

Transform the data.

fit_transform(X: Data, y: Optional[Data] = None, \*\*fit_params)[source]

Fit and transform the data.

Instantiate a StringCluster object.

Parameters
  • ngram_size (int) – Size of ngrams to use in TfidfVectorizer; default 2.

  • threshold (float) – Threshold to determine similarities; default 0.8; must be between [0, 1].

  • stop_tokens (re.Pattern) – RegEx pattern of stop tokens for use in TfidfVectorizer; default r’[W_]+’.

__init__(ngram_size: int = 2, threshold: float = 0.8, stop_tokens: str = '[\\W_]+')[source]

Instantiate a StringCluster object.

Parameters
  • ngram_size (int) – Size of ngrams to use in TfidfVectorizer; default 2.

  • threshold (float) – Threshold to determine similarities; default 0.8; must be between [0, 1].

  • stop_tokens (re.Pattern) – RegEx pattern of stop tokens for use in TfidfVectorizer; default r’[W_]+’.

fit(X: Union[List, pandas.core.series.Series, numpy.ndarray], y: Optional[Union[List, pandas.core.series.Series, numpy.ndarray]] = None) stringcluster.base.StringCluster[source]

Fit the transformer to data.

Parameters
  • X (Data) – Array like object containing duplicated strings.

  • y (Optional[Data]) – Optional array like object containing ‘master list’ of values to map similar samples to.

Returns

StringCluster – Self.

transform(X: Union[List, pandas.core.series.Series, numpy.ndarray], y: Optional[Union[List, pandas.core.series.Series, numpy.ndarray]] = None) pandas.core.series.Series[source]

Transform data.

Parameters
  • X (Data) – Array like object containing duplicated strings.

  • y (Optional[Data]) – Optional array like object containing ‘master list’ of values to map similar samples to.

Returns

pd.Series – Pandas Series of de-duplicated values.

fit_transform(X: Union[List, pandas.core.series.Series, numpy.ndarray], y: Optional[Union[List, pandas.core.series.Series, numpy.ndarray]] = None, **fit_params) pandas.core.series.Series[source]

Fit and transform the data.

Parameters
  • X (Data) – Array like object containing duplicated strings.

  • y (Optional[Data]) – Optional array like object containing ‘master list’ of values to map similar samples to.

  • fit_params – Optional kwargs; for compatibility, only.

Returns

pd.Series – Pandas Series of de-deduplicated values.

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params (dict) – Parameter names mapped to their values.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self (estimator instance) – Estimator instance.

stringcluster.base.dedupe_companies()[source]

Deduplicate a list of publicly traded companies.

Module contents