dialogy.base.entity_extractor package¶

Module contents¶

class EntityScoringMixin[source]¶

Bases: object

Mixin for scoring and aggregation of entities over a set of transcripts.

aggregate_entities(entity_type_value_group, input_size)[source]¶

Reduce entities sharing same type and value.

Entities with same type and value are considered identical even if other metadata is same. These entities are part of a group.
We track the transcript indices for every entity in a group.
Select the minimum of all the indices. (because 0th transcript has highest confidence)
We pick one entity per group and modify its index to the minimum and score is aggregated for the group.
The entity picked is added to a list of aggregates.

The above is done for all entities in a group

Parameters: entity_type_val_group (Dict[Tuple[str, Any], List[BaseEntity]]) – A data-structure that groups entities by type and value.
Returns: A list of de-duplicated entities.
Return type: List[BaseEntity]

apply_filters(entities)[source]¶

Filter entities with score less than the threshold.

Parameters: entities (List[BaseEntity]) – A list of entities.
Returns: A list of entities. This can be at most the same length as entities.
Return type: List[BaseEntity]

entity_consensus(entities, input_size)[source]¶

Combine entities by type and value.

This issue: https://github.com/Vernacular-ai/dialogy/issues/52 Points at the problems where we can return multiple identical entities, depending on the number of transcripts that contain same body.

Parameters: entities (List[BaseEntity]) – A list of entities which may have duplicates.
Returns: A list of entities scored and unique by type and value.
Return type: List[BaseEntity]

static make_transform_values(transcript)[source]¶

Make transcripts from a string/json-string.

remove_low_scoring_entities(entities)[source]¶

Remove entities with a lower score than the threshold.

This doesn’t apply to entities with score=None.

threshold: Optional[float] = None¶: Value to compare against an entity’s score.

entity_scoring(presence, input_size)[source]¶