Are you interested in running large language models (LLMs) on your laptop for local development?
Curious how LLMs can enhance Kafka Streams for state store analysis?
This walkthrough is for you.
Goals
- Set up a local LLM for application development
- Integrate the LLM with Kafka Streams for state store analysis
When using a local LLM, your performance will depend on your hardware. I’m using an M1 Max MacBook Pro.
Setting Up a Local LLM
Introduction
My first attempt was deploying Ollama inside a Docker container. While I expected some slowness, I hoped to tune Docker to access the GPU, but this wasn’t straightforward. So shifted to a local installation; installing Ollama via Homebrew, downloaded the model, and used two hostnames to access it:
localhostfor development on the laptophost.docker.internalfor accessing localhost services from within Docker containers
While I like a complete docker with docker-compose environment, as long as I could access my LLM from an application running directly on my laptop or from a docker container, I was happy.
Installation
All steps below assume macOS with Homebrew installed.
Install Ollama
Installation is straightforward with Homebrew.
brew install ollama
Download a Model
The hardest part of downloading a model is selecting one. Once selected, you download it like a Docker image:
ollama pull deepseek-r1:latest
- Start Ollama Server
You can run Ollama manually or as a service.
Manually
ollama serveAs a service
brew services start ollamabrew services stop ollama
I recommend manual execution, as the console log is very insightful. It is easy to change settings by setting environment variables. Here’s a simple way to increase context length:
OLLAMA_CONTEXT_LENGTH=16535 ollama serve
Model Selection
Browse available models at ollama.com/library.
I recommend deepseek-r1 for its reasoning capabilities and local compatibility.
I wanted a model that could run on my laptop and handle language reasoning.
My goal was to integrate an LLM with Kafka Streams and develop everything locally. Once that works, migrating to a cloud environment becomes simpler.
After pulling multiple models, use ollama list to view them.
Endpoint Selection
Ollama endpoint
While I started with the standard Ollama chat endpoint (/api/chat), I also experimented with the OpenAI endpoint (/v1/chat/completions).
This endpoint offers more tuning options I wanted to explore, such as top_p and temperature to control creativity for my use case.
Endpoint configuration
The use of Java records and Jackson for serialization are an easy and powerful way to quickly work with a REST API from your Java application.
record Message(String role, String content) {}
record OllamaRequest(
String model,
Boolean stream,
List<Message> messages,
BigDecimal top_k,
BigDecimal temperature
) {}
record OllamaResponse(
String model,
String created_at,
Message message,
Boolean done,
String done_reason,
long total_duration,
long load_duration,
long prompt_eval_count,
long prompt_eval_duration,
long eval_count,
long eval_duration
) {}
HTTP Client
Use whatever client you are currently using from your Java applications; I’ve been quite happy using Apache HttpComponents, HttpClient.
Integrating the LLM with Kafka Streams
Overview
With Ollama running and a model downloaded, it was time to integrate with Kafka Streams.
Kafka Streams excels at stateful stream processing, but it is not designed for non-stream computation inside stream threads—doing so hurts performance.
Kafka Streams exposes state stores via interactive queries:
streams.store()(IQv1)streams.query()(IQv2)
If your state store is analytical in nature, it is a great data source for AI. Do not execute inquiries to the model from within the stream thread (topology).
While code fragments here may make things look complex, the coding is simple. The hard part is learning how to write, test, and refine prompts to improve analysis.
Accessing the State Stores
To provide context in prompts to the LLM, include attributes for the products that make the top 5. Global stores are handy for this.
Use the IQv1 API to access any global store. There is currently no IQv2 API for global stores. For this walkthrough, global stores are used for product information. This can run on a schedule or via user interaction. For demonstration, it is scheduled to run at the end of each window.
var products = new HashMap<>();
var q = streams.store(StoreQueryParameters.fromNameAndType("product", QueryableStoreTypes.keyValueStore()));
try (var iterator = q.all()) {
while (iterator.hasNext()) {
var next = iterator.next();
products.put(next.key, next.value);
}
}
The data we want the LLM to analyze resides in tumbling window state stores.
Use the IQv2 API to access a specific window from the state store. While the IQv1 API is still available, use the IQv2 API when possible. For a tumbling window, use a range query with the start/end aligned to the tumbling window definition.
Create a top-N list for easy consumption when writing the prompt.
List<ProductAnalytic> pa = new ArrayList<>();
Window current = currentWindow(windowSize);
var request = StateQueryRequest.inStore("TUMBLING-aggregate-purchase-order")
.withQuery(WindowRangeQuery.withWindowStartRange(current.start(), current.end()));
var result = streams.query(request);
for (var partitionResult : result.getPartitionResults().values()) {
if (partitionResult.isSuccess()) {
try (var iterator = partitionResult.getResult()) {
while (iterator.hasNext()) {
var next = iterator.next();
Windowed<String> windowedKey = next.key;
ProductAnalytic value = next.value.value(); // TimestampAndValue, so value.value() is indeed correct.
pa.add(value);
}
}
}
}
pa.sort(Comparator.comparingLong(ProductAnalytic::getQuantity));
pa.reversed();
Writing a prompt is not difficult; writing an effective prompt is the challenge. Once orchestration is in place, collect your top-N datasets and construct the prompt. Expect weak outputs at first, and improve them as you learn how to communicate with your model effectively. If you want to build on context from past responses, ensure that process is part of your model testing and analysis.
var sb = new StringBuilder();
sb.append("predict trends on purchasing, considering previous window purchases\n");
sb.append("top-5 quantity for current window:\n");
sb.append(String.format("window [%s, %s)\n", current.start(), current.end()));
pa.subList(0, Math.min(pa.size(), 5)).forEach(p -> {
var product = products.get(p.getSku());
sb.append(String.format("sku=%s, qty=%d, attrs=%s\n", p.getSku(), p.getQuantity(), product.attributes()));
});
observer.inquire(sb.toString());
The Observer retains past responses and system messages to provide context without resending earlier requests, keeping input tokens lower.
You can further compress or summarize past responses to reduce input tokens.
public Observer() {
httpClient = HttpClients.custom()
.setDefaultRequestConfig(RequestConfig.custom()
.setResponseTimeout(Timeout.ofSeconds(60))
.build()
)
.build();
messages.add(new Message(SYSTEM, "the time windows are small for demo purposes, so don't worry about that; it is correct."));
messages.add(new Message(SYSTEM, "focus on trends, predict future purchases"));
messages.add(new Message(SYSTEM, "You are a concise assistant. Do not include any <think> sections or internal reasoning in your responses. Only return direct answers or summaries."));
}
public void inquire(String prompt) throws IOException {
List<Message> request = new ArrayList<>(this.messages);
request.add(new Message("user", prompt));
HttpPost post = new HttpPost(ollamaUrl);
post.setHeader("Accept", "application/json");
post.setHeader("Content-Type", "application/json");
post.setEntity(new StringEntity(
objectMapper.writeValueAsString(
new OllamaRequest(
"deepseek-r1",
false,
request,
null,
null
)
), ContentType.APPLICATION_JSON)
);
OllamaResponse response = httpClient.execute(post, (ClassicHttpResponse resp) -> {
final int status = resp.getCode();
if (status >= 200 && status < 300) {
return objectMapper.readValue(resp.getEntity().getContent(), OllamaResponse.class);
}
throw new RuntimeException("status=" + status);
});
//store response to be provided for the next inquiry, trim/filter as needed.
messages.add(response.message());
}
Walkthrough
Here is a walkthrough of my example. It is here to showcase the importance of making your LLM conversations replayable, so you can tune and verify.
Summary
The architecture highlights are:
- The prompt is a short write-up with the time window and top-5 products, including all product attributes.
- The model is
deepseek-r1running in Ollama. - Kafka Streams uses IQv1 and IQv2 APIs to read state stores from outside the topology.
- The LLM receives current information and past responses.
- Only historical responses were provided in the next conversation, not the historical requests.
Actual Interaction
Orders are created once every second, with a publisher that randomly selects products from the product list.
./publisher/run.sh
After some analysis, restart the publisher with favoritism to select tablets more frequently than phones and computers.
./publisher/run.sh --attribute-name productType --attribute-value tablets --percentage 0.80
Response 1
- Work is needed to reduce thinking content in responses; short term, remove
… sections for the next query. - Little perceived value in this response.
Message[role=assistant, content=<think>
Okay, the user is asking to predict trends in purchasing based on previous window data. They provided five entries with SKUs and their quantities within a time window from 2025-10-22T11:55:30Z to 2025-10-22T11:56:00Z, which is very small. Since the assistant's focus is on trends without worrying about the window size for demos, I should treat this as a micro-window analysis.
Looking at the SKUs and their quantities (3,5,6,6,6), there's a slight upward trend from 3 to higher numbers in subsequent items. The products span different categories—computers, tablets, phones. The user might be interested in overall purchasing patterns across these categories or perhaps specific ones.
Considering product types: computers and tablets each have two SKUs (73 with qty=3, then both 100 and 338 are tablets at qty=6). Phones also have one SKU (362) at qty=6. Maybe the user wants to know if there's a category-wise trend? Tablets and phones show higher quantities than computers here.
Categorizing SKUs: The assistant can group by productType or other attributes. For example, storageType includes SSD, NVMe SSD, etc., but maybe focusing on core specs like RAM size (8GB vs 16GB) would help. Higher RAM items have more quantity? But the computer with 8GB has qty=3, while tablets and phones here have higher quantities.
Release dates: The latest product is released in April 2025 (sku=100), others are older. Maybe newer products see higher purchases? Or maybe not, since the window is so recent that only one new product was added after Oct 2024.
The user's deeper need might be understanding which factors influence purchase decisions to forecast future trends, especially for inventory or marketing. They might want actionable insights like focusing on RAM size upgrades or newer products in phones/tablets. But since the data is limited and from a tiny window, extrapolating much beyond that isn't reliable without more context.
</think>
Based on these very recent purchases (a micro-window), trends observed are:
1. Higher quantities purchased (`qty=5` or `6`) for items with larger RAM size (`ramSize=12GB`, `16GB`) and higher review counts/scores (`reviewCount >=900`, `reviewScore>=4.1`).
2. Items from newer product lines (like tablets) might be seeing slightly more purchase volume in this period.
3. The specific CPU model (`M2`) continues to influence purchases within computers and tablets seen here.
Response 2
- While data is random, product attributes show patterns, and the model appears to detect similarities.
- The response provided a prediction of future trends, indicating my system request “focus on trends, predict future purchases” was understood.
Based on these recent purchases (a very short time window), the observed trend is:
* **Higher quantities purchased:** `qty` ranges from 2 to 7, with most entries being higher than average.
* **Dominant product type:** Most items are computers (`productType=computers`).
* **Key attributes driving purchases:** High RAM size (16GB or more), SSD storage (UFS/NVMe), recent release date, high review counts/scores.
**Prediction for the next similar small time window:**
Based on this limited data, we can predict that in the immediate future:
* Computers with large amounts of RAM (>= 16GB) and fast SSDs will likely continue to be popular.
* Newer models within computers might see higher purchase quantities.]
Response 3
- The model provided concrete observations, but did not provide a prediction.
- Indicates that I need to work on my system messages and see if I can better communicate with the LLM.
In this short time window (10 seconds), purchase activity remains dominated by SKUs from the previous top-5 list.
* **SKU 103** ("Apple MacBook Pro") continues to be a massive driver of volume (~248k+).
* **SKU 105** ("Samsung Galaxy S Series Tablet - M2 Chip") leads this period with over 297k purchases.
* **SKU 129** ("Dell XPS Laptop - New Model") debuts strongly, securing the third position (~50k+).
Producer Changed
- The producer was redeployed with 80% favoritism for tablets.
Response 4
- A prediction is provided; the change in product distribution is not reflected.
- The formatting is inconsistent; this is why I’m considering fine-tuning a model instead to see if I can get more consistent responses.
**Observed Trends in Current Window:**
1. **Dominance of Electronics:** Top purchases are exclusively from `tablets` and `computers`, alongside `phones`.
2. **Lower Purchase Volume for Computers:** The highest quantity (`qty`) recorded is `4`, significantly lower than the previous window's peak (which was higher, e.g., around `50k+`). Only one computer SKU appears in this top-5.
3. **Higher Quantities for Tablets and Phones:** Tablet SKUs have quantities ranging from `1` to `2`. Phone has a quantity of `4`.
4. **Consistent Specs (Computers & Tablets):** Items purchased are still characterized by high RAM (`ramSize=16GB`, `ramType=LPDDR5`), SSD/NVMe/UFS storage, and positive reviews.
5. **Stable Top Category Presence:** Electronics remain the focus of top purchase activity.
**Predicted Trend for Future:**
Based on these observations:
* **Continued Focus on Tablets/Phones:** These categories are expected to maintain high purchase volumes in future short time windows due to their presence and performance here.
* **Potential Continued Low Volume for Computers:** While consistent specs might persist, the lower volume observed suggests computers may not be as impactful drivers of purchases in this very short timeframe compared to the previous window or longer periods. However, tracking across more data points is recommended.
* **Review Score/Count Remains Important:** High scores and counts continue to correlate with purchased items (like SKUs 0379, 0244).]
Response 5
- Super short response.
- Again, response formats are wildly different, even though the inputs are uniform.
Based on this current top-5 (all tablets) with similar attributes (M2/A16/A17 chips, high storage/RAM), the trend remains focused on **tablets** from specific brands like **Apple** and one **Samsung**,
driven by strong performance specifications. Purchase quantities are lower than previous windows but still reflect sustained interest in these products during short time periods.]
Response 6
- The LLM provides the insight I hoped it would see: “High demand is expected to continue for tablets.”
- It also suggests “high review counts/scores will continue to be popular.” This could be due to the sample skew or inference from limited data.
Based on the top-5 purchased SKUs from the current very short time window:
1. **Dominant Product Type:** Tablets significantly dominate this purchase period (4 out of 5 items).
2. **Key Attributes:** Items feature high SSD storage, LPDDR5 RAM, and good review scores/counts.
3. **Performance Specs:** High CPU specs (e.g., `cpuCores=10`, `gpuCores=12`) are associated with higher quantities (`qty=2`). Lower quantities (`qty=1` or `3`) correspond to items with slightly lower RAM (`ramSize=8GB`), older release dates, or different screen ratios.
4. **Stable Tablet Focus:** This reinforces the trend observed previously that tablets remain a primary focus for high purchase volume in short time windows.
**Predicted Trend:**
Based *strictly* on these recent purchases:
* High demand is expected to continue for **tablets**, particularly those with **high performance (e.g., many CPU/GPU cores) and large amounts of RAM/SSD storage.**
* Newer tablet models (`released`=2025-01-28, 2025-04-13) are likely to maintain strong purchase interest.
* Items with **high review counts/scores** will continue to be popular.
Conclusion
There’s a great opportunity to use Kafka Streams and LLMs together. The LLM is not for stream processing. Leverage operations outside of the stream threads for your LLM interactions.
Takeaways
- It is possible to use local LLMs for developer experience, exploration, and functional/integration tests.
- Leverage the OpenAI-compatible REST API for flexibility in model selection.
- Put in the effort to alter and replay conversations when experimenting with the best way to interact with LLMs.
- Reduce responses by learning the model and pruning output (for example, remove ‘<think>…</think>).
- Capture your data so you can alter prompts and repeat.
- The better you know the data trends yourself, the easier it is to validate whether the LLM is seeing what you expect—and to notice when it finds something you did not expect.
Next Steps
This is just a start to get you thinking on how you can leveraged the data in your Kafka Streams with LLMs. My next considuerations are:
- Combining the state-store across kafka stream instances for LLM analysis.
- Do tokenization analysis to reduce the token spend.
- Explore fine-tuning a model with product information and compare base model with tuned model.
Reach out
Please contact us if you have ideas or suggestions.
