machine learning | Pablo Bernabeu

You shall know a word by the company it keeps — so choose your prompts wisely

Post

In 1957, linguist J. R. Firth observed that 'you shall know a word by the company it keeps'. That principle — words that co-occur share meaning — is the foundation on which all of generative AI was built, from early Latent Semantic Analysis to today's trillion-parameter Transformers. This post traces the lineage with three interactive LSA-to-PCA visualisations in R (Reuters newswire, State of the Union addresses and IMDB reviews), showing where simple co-occurrence models succeed, where they fail and why scale alone turned a modest insight into the technology behind ChatGPT. It then examines why LLMs are optimised for fluency rather than truth — hallucinations are a structural consequence, not a bug to be patched — and argues that careful prompt engineering is the best tool we have for steering a fundamentally heuristic machine.

Secure, private transcription at scale with Whisper and GitHub Copilot

Post

Case study showcasing how secure, private transcription at scale can be achieved using Whisper and GitHub Copilot, demonstrating practical applications of AI in research environments while maintaining data privacy and security standards.

Secure and scalable speech transcription for local and HPC

Post

A production-ready local transcription workflow leveraging OpenAI's Whisper models that addresses the limitations of cloud-based solutions through complete data sovereignty, unlimited scale, reproducible processing and advanced quality control, while maintaining GDPR compliance.

Secure and scalable speech transcription for local and HPC

Publication

A production-ready, local transcription workflow using OpenAI's Whisper, designed for security, scalability on HPC, and advanced quality control. It overcomes the privacy and reproducibility limitations of cloud-based services, offering a robust alternative for academic and enterprise use.