Topic Exploration of a Textual Dataset¶

Bunkatopics is a package designed for Topic Exploration.

Discover different examples using our Google Colab Notebooks¶

Theme	Google Colab Link
Visual Topic Modeling with Bunka and datasets from HuggingFace

Installation via Pip¶

pip install bunkatopics

Installation via Git Clone¶

git clone https://github.com/charlesdedampierre/BunkaTopics.git
cd BunkaTopics
pip install -e .

Quick Start¶

Uploading Sample Data¶

To get started, let's upload a sample of Medium Articles into Bunkatopics:

from datasets import load_dataset
docs = load_dataset("bunkalab/medium-sample-technology")["train"]["title"] # 'docs' is a list of text [text1, text2, ..., textN]

Choose Your Embedding Model¶

Bunkatopics offers seamless integration with Huggingface's extensive collection of embedding models. You can select from a wide range of models, but be mindful of their size. Please refer to the langchain documentation for details on available models.

from bunkatopics import Bunka
from langchain_community.embeddings import HuggingFaceEmbeddings

# Choose your embedding model
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")  # set to True if you have mutliprocessing

# Initialize Bunka with your chosen model
bunka = Bunka(embedding_model=embedding_model)

# Fit Bunka to your text data
bunka.fit(docs)

You can use other models like OpenAI thanks to langchain integration

from langchain_openai import OpenAIEmbeddings
embedding_model = OpenAIEmbeddings(openai_api_key='OPEN_AI_KEY')
bunka = Bunka(embedding_model=embedding_model)

# Get a list of topics
bunka.get_topics(n_clusters=15, name_length=3)# Specify the number of terms to describe each topic

Topics are described by the most specific terms belonging to the cluster.

topic_id	topic_name	size	percent
bt-12	technology - Tech - Children - student - days	322	10.73
bt-11	blockchain - Cryptocurrency - sense - Cryptocurrencies - Impact	283	9.43
bt-7	gadgets - phone - Device - specifications - screen	258	8.6
bt-8	software - Kubernetes - ETL - REST - Salesforce	258	8.6
bt-1	hackathon - review - Recap - Predictions - Lessons	257	8.57
bt-4	Reality - world - cities - future - Lot	246	8.2
bt-14	Product - Sales - day - dream - routine	241	8.03
bt-0	Words - Robots - discount - NordVPN - humans	208	6.93
bt-2	Internet - Overview - security - Work - Development	202	6.73
bt-13	Course - Difference - Step - science - Point	192	6.4
bt-6	quantum - Cars - Way - Game - quest	162	5.4
bt-3	Objects - Strings - app - Programming - Functions	119	3.97
bt-5	supply - chain - revolution - Risk - community	119	3.97
bt-9	COVID - printing - Car - work - app	89	2.97
bt-10	Episode - HD - Secrets - TV	44	1.47

Visualize Your Topics¶

Finally, let's visualize the topics that Bunka has computed for your text data:

bunka.visualize_topics(width=800, height=800, colorscale='YIGnBu')

Topic Modeling with GenAI Summarization of Topics¶

Explore the power of Generative AI for summarizing topics! We use the 7B-instruct model of Mistral AI from the huggingface hub using the langchain framework.

from langchain.llms import HuggingFaceHub

# Define the repository ID for Mistral-7B-v0.1
repo_id = 'mistralai/Mistral-7B-v0.1'

# Using Mistral AI to Summarize the Topics
llm = HuggingFaceHub(repo_id='mistralai/Mistral-7B-v0.1', huggingfacehub_api_token="HF_TOKEN")

# Obtain clean topic names using Generative Model
bunka.get_clean_topic_name(llm=llm)
bunka.visualize_topics( width=800, height=800, colorscale = 'Portland')

You can also use a model from OpenAI thanks to the langchain integration

from langchain.llms import OpenAI

llm = OpenAI(openai_api_key = 'OPEN_AI_KEY')
bunka.get_clean_topic_name(llm=llm)

Finally, let's visualize again the topics. We can chose from different colorscale.

bunka.visualize_topics(width=800, height=800)

We can now access the newly made topics

>>> bunka.df_topics_

topic_id	topic_name	size	percent
bt-1	Cryptocurrency Impact	345	12.32
bt-3	Data Management Technologies	243	8.68
bt-14	Everyday Life	230	8.21
bt-0	Digital Learning Campaign	225	8.04
bt-12	Business Development	223	7.96
bt-2	Technology Devices	212	7.57
bt-10	Market Predictions Recap	201	7.18
bt-4	Comprehensive Learning Journey	187	6.68
bt-6	Future of Work	185	6.61
bt-11	Internet Discounts	175	6.25
bt-5	Technological Urban Water Management	172	6.14
bt-9	Electric Vehicle Technology	145	5.18
bt-8	Programming Concepts	116	4.14
bt-13	Quantum Technology Industries	105	3.75
bt-7	High Definition Television (HDTV)	36	1.29

Manually Cleaning the topics¶

Are you happy with the topics yes ? Let's change them manually. Click on Apply changes when you are done. In the example, we changed the topic Cryptocurrency Impact to Cryptocurrency and Internet Discounts to Advertising.

The new topics will also appear on the Map.

bunka.manually_clean_topics()

Exploring topics on a REACT Front-end¶

Start the serveur to run the React Application

bunka.start_server() # A serveur will open on your computer at http://localhost:3000/

Using other LLM for Summarizing titles¶

from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

text_generation_pipeline = transformers.pipeline(
   model=model,
   tokenizer=tokenizer,
   task="text-generation",
   temperature=0.2,
   repetition_penalty=1.1,
   return_full_text=True,
   max_new_tokens=300,
)

mistral_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)
# Obtain clean topic names using Generative Model
bunka.get_clean_topic_name(llm=mistral_llm)
bunka.visualize_topics( width=800, height=800, colorscale = 'Portland')