OpenAI | Pablo Bernabeu

You shall know a word by the company it keeps — so choose your prompts wisely

Post

In 1957, linguist J. R. Firth observed that 'you shall know a word by the company it keeps'. That principle — words that co-occur share meaning — is the foundation on which all of generative AI was built, from early Latent Semantic Analysis to today's trillion-parameter Transformers. This post traces the lineage with three interactive LSA-to-PCA visualisations in R (Reuters newswire, State of the Union addresses and IMDB reviews), showing where simple co-occurrence models succeed, where they fail and why scale alone turned a modest insight into the technology behind ChatGPT. It then examines why LLMs are optimised for fluency rather than truth — hallucinations are a structural consequence, not a bug to be patched — and argues that careful prompt engineering is the best tool we have for steering a fundamentally heuristic machine.

Secure and scalable speech transcription for local and HPC

Post

A production-ready local transcription workflow leveraging OpenAI's Whisper models that addresses the limitations of cloud-based solutions through complete data sovereignty, unlimited scale, reproducible processing and advanced quality control, while maintaining GDPR compliance.

Secure and scalable speech transcription for local and HPC

Publication

A production-ready, local transcription workflow using OpenAI's Whisper, designed for security, scalability on HPC, and advanced quality control. It overcomes the privacy and reproducibility limitations of cloud-based services, offering a robust alternative for academic and enterprise use.