Dec 3, 2025 by Enrico Migliorini | 280 views
https://cylab.be/blog/460/setting-up-a-llm-via-pythons-huggingface
Ah, Artificial Intelligence. THE contentious topic of the moment, dividing society and being debated by minds much more articulate than mine. Still, notwithstanding its potential for abuse and misuse, non-generative AI can be extremely useful.
Machine learning systems can be revolutionary for those looking to compute proteins’ structures, fight cancer, remove landmines and neutralise hostile threats, among many other potential uses.
A full outline of what modern AI is, how it relates to classical Artificial Intelligence and Machine Learning tools, and its different potentials, is far too large for these margins. I will, instead, tell you how to download, personalise and implement a Large Language Model for text classification via the Python HuggingFace library.
Well, I would write the necessary steps for a simple classifier here, but it would not be a lot better than the official one. So, instead, I will try to give you a few pointers for what you might want to do after you have gone through that tutorial.
The most basic tool of the computer scientist is the humble search query pointing you to StackOverflow. This still works for HuggingFace. However, be EXTREMELY careful, as the libraries are updated very fast, and solutions that were highly rated only a few months ago might be outdated and deprecated. Integrate questions with the rather labyrinthine official documentation. This includes this article, which may be outdated by the time you read it.
If you want to use a GPU to accelerate your training (and you really should) download the right version of PyTorch, choosing the CUDA version if you have an nVidia GPU, and the ROCm version if you have an AMD one. Test whether it works well by opening the Python console and running
> import torch
> torch.cuda.is_available()
If it returns True, hardware acceleration should be enabled. However, if you do not see the speedup in your code, you may have to manually set your datasets to use acceleration. After you create your dataset, use
dataset.set_format("torch", device="cuda")
and that should be all you need.
Instead of loading a dataset online, you might want to use your own data. How would you go around formatting it?
You may simply load it from a .csv
from datasets import Dataset, load_dataset
dataset = load_dataset("csv", data_files="data.csv")
or transform a dictionary into a dataset
from datasets import Dataset
dataset_dict = {"data": data_as_a_list, "labels": labels_as_a_list}
dataset = Dataset.from_dict(dataset_dict)
or use a generator function, which is very efficient if you need to apply some transformation to a large dataset
from datasets import Dataset
def return_dataset (data: list, labels: list):
# You may apply any transformation here.
for idx in range(len(data)):
row_dict = {
"data": data[idx],
"label": label[idx],
}
yield row_dict
dataset = Dataset.from_generator(
unwrap_ds, gen_kwargs={"data": data_as_a_list, "label": label_as_a_list]
)
You might have to adjust your dataset a bit to ensure that it works well with the infrastructure. Frustratingly, sometimes tokenizers might even change the level of nesting of the parameters, forcing you to adjust them manually. Ensure that what you feed to the model is consistent with what it requires (i.e. you are passing tokenised inputs as a list of tokens, rather than a list of lists of tokens).
Better to ensure you have a good train set
Do you have an unbalanced dataset with more positives than negatives? Do you want to implement your custom loss algorithm? If so, you will want to subclass Trainer. In my case, I wanted to use custom weights for an imbalanced dataset with far more negatives than positives, so that the classifier would not “cut corners” deciding to just classify everything as a negative. To do that, I had to use weights that would penalise false negatives much more than false positives.
The code I used for that was to install scikit-learn from pip, then use the following code
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight(
class_weight="balanced", classes=np.unique(labels), y=labels
)
class CustomTrainer(Trainer):
def compute_loss(
self, model, inputs, return_outputs=False, num_items_in_batch=None
):
labels = inputs.get("labels")
outputs = model(**inputs)
logits = outputs.get("logits").float()
loss_fct = torch.nn.CrossEntropyLoss(
weight=torch.tensor(
[class_weights[0], class_weights[1]], device="cuda"
).float()
)
loss = loss_fct(
logits.view(-1, self.model.config.num_labels).float(), labels.view(-1)
)
return (loss, outputs) if return_outputs else loss
You can adapt the loss to any algorithm you might want.
The documentation will recommend you to use TextClassificationPipeline to classify text snippets, and that works well enough. However, you might want to have more granular control over the output, or perhaps you want to do sequence classification, which unexplainably does not have its own pipeline. No worries.
Assuming that you have the data you want to predict set up in a HuggingFace dataset, you can just use your trainer with
trainer.predict(test_data)
and get your predictions delivered to you in form of an array. In the two-label text classification (positive or negative) you will want to convert it to a human-understandable form via
outcome = 'NEGATIVE' if p[0] > p[1] else 'POSITIVE'
Remember that if you have saved your model using save_pretrained() you can load it via load_pretrained and create a trainer from it.
This blog post is licensed under
CC BY-SA 4.0