From Data Ingestion to Real-Time Queries: Powering Microservices with Scala and Apache Pinot?

May 2, 2023
5 min read

Scala is a modern programming language that combines object-oriented and functional programming paradigms. It is a high-level language that is easy to learn and use. Scala has gained much popularity in recent years, especially in the big data world, where it is used in frameworks such as Apache Spark and Apache Kafka.

Apache Pinot, on the other hand, is an open-source distributed real-time analytics engine designed for low-latency OLAP queries on large-scale datasets. It is popular for real-time monitoring, anomaly detection, and personalization use cases.

In this blog post, we will explore how to develop applications using Scala language and leverage Apache Pinot for real-time analytics.

Getting Started with Scala

Scala can be downloaded and installed from the Scala website. Once Scala is installed on your machine, you can use an integrated development environment (IDE) such as IntelliJ IDEA or Eclipse to write and run Scala code.

Scala has a simple and intuitive syntax similar to other programming languages such as Java and Python. Here is an example of a simple Scala program that prints "Hello, world!" to the console:

object HelloWorld {

def main(args: Array[String]) {

println("Hello, world!")

}

This program defines an object named "HelloWorld" that has a "main" method. The "main" method takes an array of strings as its input and prints "Hello, world!" to the console.

What is Data Ingestion?

Apache Pinot is a distributed real-time analytics engine that is designed to perform low-latency OLAP queries on large-scale datasets. It is popular for real-time monitoring, anomaly detection, and personalization use cases.

Here are some steps to get started with Apache Pinot:

1. Download and install Apache Pinot on your machine or cluster.

2. Use the Pinot Query Console to create a Pinot schema and upload data to Pinot.

3. Write queries in Pinot SQL to query data stored in Pinot.

The Pinot Query Console provides a web-based interface for creating Pinot schemas and uploading data to Pinot. It also includes a query editor that allows you to write and test Pinot SQL queries. Once you have created a Pinot schema and uploaded data to Pinot, you can write queries in Pinot SQL to query data stored in Pinot. Here is an example of a Pinot SQL query that selects the average price of products by category:

SELECT AVG(price), category FROM products GROUP BY category

Developing Applications Using Scala and Apache Pinot

Now that we've covered the basics of Scala and Apache Pinot, let's explore how to develop applications using Scala and Apache Pinot.

Here are some steps to get started:

1. Use the Apache Pinot Java client library to connect to Pinot from Scala code.

2. Write Scala code to retrieve data from Pinot using Pinot SQL queries.

3. Use Scala's functional programming features to process and analyze data retrieved from Pinot.

4. The Apache Pinot Java client library provides a simple and easy-to-use interface for connecting to Pinot from Java or Scala code. To use the Pinot Java client library in your Scala code, you first need to add the following dependency to your build.sbt file:

libraryDependencies += "org.apache.pinot" % "pinot-client" % "x.y.z"

Suppose we want to build a microservice that retrieves and displays the top 10 most popular articles from a news website. Here's how we can use Scala and Pinot to build this microservice:

a) Data Ingestion: We first need to ingest data from the news website into Pinot. This can be done using Pinot's built-in data ingestion tools or by using custom data ingestion scripts written in Scala. The data can be stored in Pinot's columnar storage format, which is designed to optimize query performance.

import org.apache.pinot.ingestion.api.{DataIngestionConfig, IngestionJobLauncher}

import org.apache.pinot.ingestion.common.JobConfigConstants

val config = new DataIngestionConfig()

config.set(JobConfigConstants.INPUT_FILE_PATH, "/path/to/data.csv")

config.set(JobConfigConstants.TABLE_NAME, "news_articles")

config.set(JobConfigConstants.SCHEMA_FILE_PATH, "/path/to/schema.json")

val jobLauncher = new IngestionJobLauncher()

jobLauncher.launchJob(config)

b) Querying Data: Once the data is ingested, we can write a Scala application that uses the Pinot Java client library to query the data. We can write a simple query that retrieves the top 10 articles based on their popularity score, which is computed using the number of page views and other metrics.

import org.apache.pinot.client.PinotClientFactory

import org.apache.pinot.client.ResultSet

import org.apache.pinot.client.ResultSetGroup

import org.apache.pinot.client.request.AggregationGroupByQuery

val pinotClient = PinotClientFactory.build().create()

val query = new AggregationGroupByQuery()

query.setTableName("news_articles")

query.setColumnsToGroupBy("article_id")

query.setAggregationFunctions("SUM(page_views) as popularity_score")

query.setOrderBy("popularity_score DESC")

query.setLimit(10)

val resultSetGroup: ResultSetGroup = pinotClient.execute(query)

val resultSet: ResultSet = resultSetGroup.getResultSet(0)

while (resultSet.next()) {

val articleId = resultSet.getString("article_id")

val popularityScore = resultSet.getLong("popularity_score")

println(s"Article $articleId has a popularity score of $popularityScore")

}

c) Building the Microservice: We can then use the Play framework, a popular web application framework for Scala, to build a microservice that retrieves and displays the top 10 articles. The microservice can use the query from step 2 to retrieve the top 10 articles from Pinot and display them on a web page.

import javax.inject.Inject

import play.api.mvc.{AbstractController, ControllerComponents}

import org.apache.pinot.client.PinotClientFactory

import org.apache.pinot.client.ResultSet

import org.apache.pinot.client.ResultSetGroup

import org.apache.pinot.client.request.AggregationGroupByQuery

class NewsController @Inject()(cc: ControllerComponents) extends AbstractController(cc) {

def topArticles = Action { implicit request =>

val pinotClient = PinotClientFactory.build().create()

val query = new AggregationGroupByQuery()

query.setTableName("news_articles")

query.setColumnsToGroupBy("article_id")

query.setAggregationFunctions("SUM(page_views) as popularity_score")

query.setOrderBy("popularity_score DESC")

query.setLimit(10)

val resultSetGroup: ResultSetGroup = pinotClient.execute(query)

val resultSet: ResultSet = resultSetGroup.getResultSet(0)

val topArticles = Iterator.continually(resultSet).takeWhile(_.next()).map { row =>

val articleId = row.getString

d) Real-time Updates: To make the microservice more responsive, we can use Pinot's real-time ingestion feature to ingest new data in real time as it becomes available. This allows us to update the popularity score of articles in real time and provide up-to-date results to users.

e) Scaling the Microservice: Finally, we can deploy the microservice to a cloud provider such as AWS or Google Cloud Platform and use Pinot's distributed architecture to scale the microservice horizontally as needed. This allows the microservice to handle increasing traffic and growing volumes of data over time.

In summary, by using Scala and Pinot, we can build a powerful and efficient microservice that retrieves and displays the top 10 most popular articles from a news website. The microservice is scalable, responsive, and can handle large volumes of data in real time.

Advantages of Using Scala and Apache Pinot

High-level abstractions: Scala provides high-level abstractions that make it easy to write concise and expressive code. These abstractions, such as functions and closures, make it easier to write code that is both readable and maintainable. Pinot, on the other hand, provides a SQL-like interface that is familiar to developers who are used to working with relational databases.
Functional Programming: Scala is a functional programming language that allows developers to write code in a functional style. Functional programming provides several benefits, including easier code maintenance, better code reuse, and improved performance. Pinot's query interface is also designed to work well with functional programming paradigms, making it easier to write code that is both functional and efficient.
Real-time analytics: Pinot is a distributed real-time analytics engine that is designed to perform low-latency OLAP queries on large-scale datasets. This makes it an ideal choice for real-time monitoring, anomaly detection, and personalization use cases. By using Scala and Pinot together, developers can build applications that are capable of processing and analyzing large volumes of data in real time.
Integration with other frameworks: Scala is widely used in the big data world, and it integrates well with other popular frameworks such as Apache Spark and Apache Kafka. This makes it easier to build applications that leverage multiple data processing and analytics tools. Pinot also integrates well with other big data frameworks such as Apache Hadoop and Apache Flink, making it easier to build end-to-end data processing pipelines.
Scalability: Both Scala and Pinot are designed to scale horizontally, making it possible to process and analyze large volumes of data efficiently. This makes it easier to build applications that can handle growing volumes of data over time. Pinot's distributed architecture also makes it easy to add new nodes to a cluster as needed, allowing applications to scale dynamically based on changing data processing needs.

In summary, developing applications using Scala and Pinot provides several advantages, including high-level abstractions, functional programming, real-time analytics, integration with other frameworks, and scalability. By leveraging the strengths of both Scala and Pinot, developers can build powerful and efficient data processing and analytics applications.

For your Business needs you may contact us at hello@fusionpact.com