Large Models Meet Big Data: Spark and LLMs in Harmony | by Naser Tamimi | Dec, 2023


DATA ENGINEERING & GENERATIVE AI

A step-by-step guide to use Apache Spark and large language models

The image is generated by Midjourney.

Generative AI, including Large Language Models (LLMs), is revolutionizing different aspects of human life. Over the past five years, Generative AI has evolved from a research project into a real-life application for many people. As a data engineer interested in Generative AI, I have always asked myself, what does this technology bring to my work and Data Engineering applications? There are some common applications of Gen AI and LLMs for engineers such as pilot coding, assisting in documentation, and so on. But, here, I am evaluating some of the more specialized uses of Gen AI and LLMs for data engineering. If you are interested in this topic, please read this article and follow me on Medium and Linkedin to get more articles about other use cases.

It is not new that data engineers love structured and abstracted data. But, the world is full of unstructured and disorganized data that requires the attention of data engineers. Transformations on unstructured data are always complicated and sometimes impossible with traditional tools. Historically, one of these challenging unstructured data was text (e.g. comments, reviews, conversation). Simple transformations on texts were not a big deal, but complicated transformations can extract more information from texts and we can make more rich data sets.

Examples of complicated text transformations could be extracting names and objects from a text, sentiment analysis on a review or a comment, masking important information (e.g. private data, user data) in the stored texts, translating from one language to a standard language, text summarization, and so on. The good news is nowadays LLMs can do all sorts of these transformations. Therefore, I believe one of hundreds LLMs applications in data engineering, is to act as transform functions for complicated data such as texts.

In this article, I will show this ability of LLMs via Apache Spark, a powerful distributed data processing system. More specifically, I am going to use, a small LLM…



Source link

Leave a Comment