chunker
Argument chunker module
Module Contents
Classes
Topic modeling class. |
Functions
|
Download the required nlp pipe if not exist |
|
Split documents of a given corpus into chunks. |
|
Compute polarity score of each chunk in the given list. |
|
Get topic information and embedding vectors of chunks via topic modeling. |
|
In each argument, comput rank of chunks within. |
|
Given all the measures of chunks, generate and return the chunk table as a pandas dataframe, with pre-defined column names. |
- chunker.load_nlp_pipe(model_name: str)[source]
Download the required nlp pipe if not exist
- Parameters:
model_name (str) – name of the nlp pipe, a full list of models can be found from https://spacy.io/usage/models.
- Returns:
The spacy nlp model.
- chunker.get_chunk(docs: List[str]) Tuple[List[int], List[str]][source]
Split documents of a given corpus into chunks.
A chunk can be considered as a meaningful clause, which can be part of a sentence. For instance, the sentence “I like the color of this car but it’s too expensive.” will be splitted as two chunks, which are “I like the color of this car” and “but it’s too expensive”. A dependency parser is implemented for doing this job.
- chunker.get_chunk_polarity_score(chunks: List[str])[source]
Compute polarity score of each chunk in the given list.
The polarity score is a float within the range [-1.0, 1.0], where 0 means neutral, + means positive, and - means negative.
- chunker.get_chunk_topic(chunks: List[str])[source]
Get topic information and embedding vectors of chunks via topic modeling.
- chunker.get_chunk_rank(arg_ids: List[int], embeds: numpy.ndarray)[source]
In each argument, comput rank of chunks within.
Rank can be understood as importance of chunks. This function computes the relative importance of chunks within arguments they belong to. This is done by applying the Pagerank algorithm, where similarity is computed as the cosine similarity of chunk embedding vectors.
- chunker.get_chunk_table(arg_ids: List[int], chunks: List[str], p_scores: List[float], topics: List[int], ranks: List[float])[source]
Given all the measures of chunks, generate and return the chunk table as a pandas dataframe, with pre-defined column names.
- class chunker.TopicModel[source]
Topic modeling class.
Functions are implemented based on the BERTopic model. For now, the topic model is setup with a set of default parameters of the sub-models. However, it should be possible that the user can config it further. This will be a next step.
- _rd_model (
obj:’UMAP’): instance of UMAP algorithm as the dimensionality reduction sub-model.
- model (
obj:’BERTopic’): the topic model that applied the sub-models predefined.
- init_model(transformer: str = 'all-mpnet-base-v1', n_components: int = 5, min_cluster_size: int = 10, ngram_min: int = 1, ngram_max: int = 1)[source]
Initialize the topic model by indicating a number of arguments.
- Parameters:
transformer (str, optional) – Name of the sentence embedding model. Defaults to “all-mpnet-base-v1”. A list of pretrained models can be found here: https://www.sbert.net/docs/pretrained_models.html.
n_components (int, optional) – Number of dimensions after reduction. Defaults to 5.
min_cluster_size (int, optional) – Minimum size of clusters for the clustering algorithm. Defaults to 5.
ngram_min (int, optional) – Low band of ngram range for topic representation. Defaults to 1.
ngram_max (int, optional) – High band of ngram range for topic representation. Defaults to 1.
- fit_transform_reduced(docs: List[str]) List[int][source]
Further reduce outliers from the result of the fit_transform function.
Note that BERTopic is a clustering approach, which means that it doesn not work if there is nothing to be clustered. And keep in mind that the input corpus should contain at least 1000 documents to get meaningful results. Refer to this thread: https://github.com/MaartenGr/BERTopic/issues/59#issuecomment-775718747.
- get_topic_table() pandas.DataFrame[source]
Get the table of topic information and return it as a pandas dataframe.
- Returns:
The topic table.
- Return type:
pd.DataFrame
- get_doc_embeds() numpy.ndarray[source]
Get the embeddings of the docs.
- Returns:
Embeddings of the docs, in size of (n_doc, n_components).
- Return type:
np.ndarray