`chunker`

Argument chunker module

Module Contents

Classes

TopicModel

Topic modeling class.

Functions

`load_nlp_pipe`(model_name)	Download the required nlp pipe if not exist
`get_chunk`(→ Tuple[List[int], List[str]])	Split documents of a given corpus into chunks.
`get_chunk_polarity_score`(chunks)	Compute polarity score of each chunk in the given list.
`get_chunk_topic`(chunks)	Get topic information and embedding vectors of chunks via topic modeling.
`get_chunk_rank`(arg_ids, embeds)	In each argument, comput rank of chunks within.
`get_chunk_table`(arg_ids, chunks, p_scores, topics, ranks)	Given all the measures of chunks, generate and return the chunk table as a pandas dataframe, with pre-defined column names.

chunker.load_nlp_pipe(model_name: str)[source]

Download the required nlp pipe if not exist

Parameters:: model_name (str) – name of the nlp pipe, a full list of models can be found from https://spacy.io/usage/models.
Returns:: The spacy nlp model.

chunker.get_chunk(docs: List[str]) → Tuple[List[int], List[str]][source]

Split documents of a given corpus into chunks.

A chunk can be considered as a meaningful clause, which can be part of a sentence. For instance, the sentence “I like the color of this car but it’s too expensive.” will be splitted as two chunks, which are “I like the color of this car” and “but it’s too expensive”. A dependency parser is implemented for doing this job.

Parameters:: docs (List[str]) – The input corpus.
Returns:: ids of the arguments that the chunks belongs to. List[str]: chunk text.
Return type:: List[int]

chunker.get_chunk_polarity_score(chunks: List[str])[source]

Compute polarity score of each chunk in the given list.

The polarity score is a float within the range [-1.0, 1.0], where 0 means neutral, + means positive, and - means negative.

Parameters:: chunks (List[str]) – chunk list
Returns:: polarity scores of the given chunks
Return type:: List[float]

chunker.get_chunk_topic(chunks: List[str])[source]

Get topic information and embedding vectors of chunks via topic modeling.

Parameters:: chunks (List[str]) – chunk list.
Returns:: topic ids of chunks. np.ndarray: embedding vectors of chunks. pd.DataFrame: Table of topic information.
Return type:: List[int]

chunker.get_chunk_rank(arg_ids: List[int], embeds: numpy.ndarray)[source]

In each argument, comput rank of chunks within.

Rank can be understood as importance of chunks. This function computes the relative importance of chunks within arguments they belong to. This is done by applying the Pagerank algorithm, where similarity is computed as the cosine similarity of chunk embedding vectors.

Parameters:

arg_ids (List[int]) – ids of arguments that chunks belongs to.
embeds (np.ndarray) – embedding vectors of chunks.

Returns:

rank of chunks

Return type:

List[float]

chunker.get_chunk_table(arg_ids: List[int], chunks: List[str], p_scores: List[float], topics: List[int], ranks: List[float])[source]

Given all the measures of chunks, generate and return the chunk table as a pandas dataframe, with pre-defined column names.

Parameters:

arg_ids (List[int]) – ids of arguments that chunks belong to
chunks (List[str]) – chunk text
p_scores (List[float]) – polarity score of chunks
topics (List[int]) – topic id of chunks
ranks (List[float]) – rank of chunks

Returns:

chunk table

Return type:

pd.DataFrame

class chunker.TopicModel[source]

Topic modeling class.

Functions are implemented based on the BERTopic model. For now, the topic model is setup with a set of default parameters of the sub-models. However, it should be possible that the user can config it further. This will be a next step.

_rd_model (: obj:’UMAP’): instance of UMAP algorithm as the dimensionality reduction sub-model.

model (: obj:’BERTopic’): the topic model that applied the sub-models predefined.

init_model(transformer: str = 'all-mpnet-base-v1', n_components: int = 5, min_cluster_size: int = 10, ngram_min: int = 1, ngram_max: int = 1)[source]

Initialize the topic model by indicating a number of arguments.

Parameters:

transformer (str, optional) – Name of the sentence embedding model. Defaults to “all-mpnet-base-v1”. A list of pretrained models can be found here: https://www.sbert.net/docs/pretrained_models.html.
n_components (int, optional) – Number of dimensions after reduction. Defaults to 5.
min_cluster_size (int, optional) – Minimum size of clusters for the clustering algorithm. Defaults to 5.
ngram_min (int, optional) – Low band of ngram range for topic representation. Defaults to 1.
ngram_max (int, optional) – High band of ngram range for topic representation. Defaults to 1.

fit_transform_reduced(docs: List[str]) → List[int][source]

Further reduce outliers from the result of the fit_transform function.

Note that BERTopic is a clustering approach, which means that it doesn not work if there is nothing to be clustered. And keep in mind that the input corpus should contain at least 1000 documents to get meaningful results. Refer to this thread: https://github.com/MaartenGr/BERTopic/issues/59#issuecomment-775718747.

Parameters:: docs (List[str]) – The input corpus.
Returns:: Topics of the input docs.
Return type:: List[int]

get_topic_table() → pandas.DataFrame[source]

Get the table of topic information and return it as a pandas dataframe.

Returns:: The topic table.
Return type:: pd.DataFrame

get_doc_embeds() → numpy.ndarray[source]

Get the embeddings of the docs.

Returns:: Embeddings of the docs, in size of (n_doc, n_components).
Return type:: np.ndarray

chunker

Module Contents

Classes

Functions

`chunker`