Skip to content

Cleaning Datasets for models Fine-tuning

To achieve precise fine-tuning, it's crucial to exercise control over the data, filtering what is relevant and discarding what isn't. Bunka is a valuable tool for accomplishing this task. You can remove cliusters of information automatically and in a few seconds.

Theme Google Colab Link
Data Cleaning Open In Colab

Installation via Pip

pip install bunkatopics

Installation via Git Clone

git clone https://github.com/charlesdedampierre/BunkaTopics.git
cd BunkaTopics
pip install -e .

Quick Start

Uploading Sample Data

To get started, let's upload a sample of Medium Articles into Bunkatopics:

from datasets import load_dataset
docs = load_dataset("bunkalab/medium-sample-technology")["train"]["title"]

Choose Your Embedding Model

Bunkatopics offers seamless integration with Huggingface's extensive collection of embedding models. You can select from a wide range of models, but be mindful of their size. Please refer to the langchain documentation for details on available models.

from bunkatopics import Bunka
from langchain_community.embeddings import HuggingFaceEmbeddings

# Choose your embedding model
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")  # set to True if you have mutliprocessing

# Initialize Bunka with your chosen model
bunka = Bunka(embedding_model=embedding_model)

# Fit Bunka to your text data
bunka.fit(docs)

# Get a list of topics
print(df_topics)
>>> bunka.get_topics(n_clusters=15, name_length=3)# Specify the number of terms to describe each topic
topic_id topic_name size percent
bt-12 technology - Tech - Children - student - days 322 10.73
bt-11 blockchain - Cryptocurrency - sense - Cryptocurrencies - Impact 283 9.43
bt-7 gadgets - phone - Device - specifications - screen 258 8.6
bt-8 software - Kubernetes - ETL - REST - Salesforce 258 8.6
bt-1 hackathon - review - Recap - Predictions - Lessons 257 8.57
bt-4 Reality - world - cities - future - Lot 246 8.2
bt-14 Product - Sales - day - dream - routine 241 8.03
bt-0 Words - Robots - discount - NordVPN - humans 208 6.93
bt-2 Internet - Overview - security - Work - Development 202 6.73
bt-13 Course - Difference - Step - science - Point 192 6.4
bt-6 quantum - Cars - Way - Game - quest 162 5.4
bt-3 Objects - Strings - app - Programming - Functions 119 3.97
bt-5 supply - chain - revolution - Risk - community 119 3.97
bt-9 COVID - printing - Car - work - app 89 2.97
bt-10 Episode - HD - Secrets - TV 44 1.47

Topic Modeling with GenAI Summarization of Topics

Explore the power of Generative AI for summarizing topics! We use the 7B-instruct model of Mistral AI from the huggingface hub using the langchain framework.

from langchain.llms import HuggingFaceHub

# Define the repository ID for Mistral-7B-v0.1
repo_id = 'mistralai/Mistral-7B-v0.1'

# Using Mistral AI to Summarize the Topics
llm = HuggingFaceHub(repo_id='mistralai/Mistral-7B-v0.1', huggingfacehub_api_token="HF_TOKEN")

# Obtain clean topic names using Generative Model
bunka.get_clean_topic_name(generative_model=llm)
bunka.visualize_topics( width=800, height=800, colorscale = 'Portland')

Finally, let's visualize again the topics. We can chose from different colorscale.

bunka.visualize_topics(width=800, height=800)

>>> bunka.df_topics_
topic_id topic_name size percent
bt-1 Cryptocurrency Impact 345 12.32
bt-3 Data Management Technologies 243 8.68
bt-14 Everyday Life 230 8.21
bt-0 Digital Learning Campaign 225 8.04
bt-12 Business Development 223 7.96
bt-2 Technology Devices 212 7.57
bt-10 Market Predictions Recap 201 7.18
bt-4 Comprehensive Learning Journey 187 6.68
bt-6 Future of Work 185 6.61
bt-11 Internet Discounts 175 6.25
bt-5 Technological Urban Water Management 172 6.14
bt-9 Electric Vehicle Technology 145 5.18
bt-8 Programming Concepts 116 4.14
bt-13 Quantum Technology Industries 105 3.75
bt-7 High Definition Television (HDTV) 36 1.29

Removing Data based on topics for fine-tuning purposes

You have the flexibility to construct a customized dataset by excluding topics that do not align with your interests. For instance, in the provided example, we omitted topics associated with Advertising and High-Definition television, as these clusters primarily contain promotional content that we prefer not to include in our model's training data.

>>> bunka.clean_data_by_topics()

>>> bunka.df_cleaned_
doc_id content topic_id topic_name
873ba315 Invisibilize Data With JavaScript bt-8 Programming Concepts
1243d58f Why End-to-End Testing is Important for Your Team bt-3 Data Management Technologies
45fb8166 This Tiny Wearable Device Uses Your Body Heat... bt-2 Technology Devices
a122d1d2 Digital Policy Salon: The Next Frontier bt-0 Digital Learning Campaign
1bbcfc1c Preparing Hardware for Outdoor Creative Technology Installations bt-5 Technological Urban Water Management
79580c34 Angular Or React ? bt-8 Programming Concepts
af0b08a2 Ed-Tech Startups Are Cashing in on Parents’ Insecurities bt-0 Digital Learning Campaign
2255c350 Former Google CEO Wants to Create a Government-Funded University to Train A.I. Coders bt-6 Future of Work
d2bc4b33 Applying Action & The Importance of Ideas bt-12 Business Development
5219675e Why You Should (not?) Use Signal bt-2 Technology Devices
... ... ... ...