Skip to content

Topic Exploration of a Textual Dataset

Bunkatopics is a package designed for Topic Exploration.

Discover different examples using our Google Colab Notebooks

Theme Google Colab Link
Visual Topic Modeling with Bunka and datasets from HuggingFace Open In Colab

Installation via Pip

pip install bunkatopics

Installation via Git Clone

git clone https://github.com/charlesdedampierre/BunkaTopics.git
cd BunkaTopics
pip install -e .

Quick Start

Uploading Sample Data

To get started, let's upload a sample of Medium Articles into Bunkatopics:

from datasets import load_dataset
docs = load_dataset("bunkalab/medium-sample-technology")["train"]["title"] # 'docs' is a list of text [text1, text2, ..., textN]

Choose Your Embedding Model

Bunkatopics offers seamless integration with Huggingface's extensive collection of embedding models. You can select from a wide range of models, but be mindful of their size. Please refer to the langchain documentation for details on available models.

from bunkatopics import Bunka
from langchain_community.embeddings import HuggingFaceEmbeddings

# Choose your embedding model
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")  # set to True if you have mutliprocessing

# Initialize Bunka with your chosen model
bunka = Bunka(embedding_model=embedding_model)

# Fit Bunka to your text data
bunka.fit(docs)

You can use other models like OpenAI thanks to langchain integration

from langchain_openai import OpenAIEmbeddings
embedding_model = OpenAIEmbeddings(openai_api_key='OPEN_AI_KEY')
bunka = Bunka(embedding_model=embedding_model)
# Get a list of topics
bunka.get_topics(n_clusters=15, name_length=3)# Specify the number of terms to describe each topic

Topics are described by the most specific terms belonging to the cluster.

topic_id topic_name size percent
bt-12 technology - Tech - Children - student - days 322 10.73
bt-11 blockchain - Cryptocurrency - sense - Cryptocurrencies - Impact 283 9.43
bt-7 gadgets - phone - Device - specifications - screen 258 8.6
bt-8 software - Kubernetes - ETL - REST - Salesforce 258 8.6
bt-1 hackathon - review - Recap - Predictions - Lessons 257 8.57
bt-4 Reality - world - cities - future - Lot 246 8.2
bt-14 Product - Sales - day - dream - routine 241 8.03
bt-0 Words - Robots - discount - NordVPN - humans 208 6.93
bt-2 Internet - Overview - security - Work - Development 202 6.73
bt-13 Course - Difference - Step - science - Point 192 6.4
bt-6 quantum - Cars - Way - Game - quest 162 5.4
bt-3 Objects - Strings - app - Programming - Functions 119 3.97
bt-5 supply - chain - revolution - Risk - community 119 3.97
bt-9 COVID - printing - Car - work - app 89 2.97
bt-10 Episode - HD - Secrets - TV 44 1.47

Visualize Your Topics

Finally, let's visualize the topics that Bunka has computed for your text data:

bunka.visualize_topics(width=800, height=800, colorscale='YIGnBu')

Topic Modeling with GenAI Summarization of Topics

Explore the power of Generative AI for summarizing topics! We use the 7B-instruct model of Mistral AI from the huggingface hub using the langchain framework.

from langchain.llms import HuggingFaceHub

# Define the repository ID for Mistral-7B-v0.1
repo_id = 'mistralai/Mistral-7B-v0.1'

# Using Mistral AI to Summarize the Topics
llm = HuggingFaceHub(repo_id='mistralai/Mistral-7B-v0.1', huggingfacehub_api_token="HF_TOKEN")

# Obtain clean topic names using Generative Model
bunka.get_clean_topic_name(llm=llm)
bunka.visualize_topics( width=800, height=800, colorscale = 'Portland')

You can also use a model from OpenAI thanks to the langchain integration

from langchain.llms import OpenAI

llm = OpenAI(openai_api_key = 'OPEN_AI_KEY')
bunka.get_clean_topic_name(llm=llm)

Finally, let's visualize again the topics. We can chose from different colorscale.

bunka.visualize_topics(width=800, height=800)

We can now access the newly made topics

>>> bunka.df_topics_
topic_id topic_name size percent
bt-1 Cryptocurrency Impact 345 12.32
bt-3 Data Management Technologies 243 8.68
bt-14 Everyday Life 230 8.21
bt-0 Digital Learning Campaign 225 8.04
bt-12 Business Development 223 7.96
bt-2 Technology Devices 212 7.57
bt-10 Market Predictions Recap 201 7.18
bt-4 Comprehensive Learning Journey 187 6.68
bt-6 Future of Work 185 6.61
bt-11 Internet Discounts 175 6.25
bt-5 Technological Urban Water Management 172 6.14
bt-9 Electric Vehicle Technology 145 5.18
bt-8 Programming Concepts 116 4.14
bt-13 Quantum Technology Industries 105 3.75
bt-7 High Definition Television (HDTV) 36 1.29

Manually Cleaning the topics

Are you happy with the topics yes ? Let's change them manually. Click on Apply changes when you are done. In the example, we changed the topic Cryptocurrency Impact to Cryptocurrency and Internet Discounts to Advertising.

The new topics will also appear on the Map.

bunka.manually_clean_topics()

Exploring topics on a REACT Front-end

Start the serveur to run the React Application

bunka.start_server() # A serveur will open on your computer at http://localhost:3000/ 

Using other LLM for Summarizing titles

from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

text_generation_pipeline = transformers.pipeline(
   model=model,
   tokenizer=tokenizer,
   task="text-generation",
   temperature=0.2,
   repetition_penalty=1.1,
   return_full_text=True,
   max_new_tokens=300,
)

mistral_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)
# Obtain clean topic names using Generative Model
bunka.get_clean_topic_name(llm=mistral_llm)
bunka.visualize_topics( width=800, height=800, colorscale = 'Portland')