Real-Time Equity Streaming and Clustering Pipeline
We built an Apache Kafka + PySpark MLlib pipeline streaming daily OHLCV across 55 per-ticker topics, then clustered with KMeans and PCA.
55
Streaming topics
Apr to Sep 2023
Window
4
Chosen K
0.74
Silhouette @ K=4
A streaming layer for 55 tickers, plus a clustering experiment on top
We bootstrapped a single-broker Apache Kafka 3.3.1 plus Zookeeper cluster from a notebook and registered 55 per-ticker topics (stock_AAPL, stock_AMZN, and so on for the US large-cap universe). A KafkaProducer streamed daily OHLCV data (April to September 2023, yfinance) into those topics so downstream consumers could subscribe per ticker without re-fetching.
Scope: the streaming layer carries all 55 tickers, but the K-means experiment runs on a single ticker's price series. Wiring the Kafka source into the clustering pipeline is the next step.
K = 4 picked from three independent signals
The PySpark MLlib pipeline goes VectorAssembler → KMeans (seed=1) → PCA (k=2) → ClusteringEvaluator. We swept K from 2 to 10 and looked at three orthogonal criteria together:
- Silhouette score (cluster cohesion + separation)
- Within-cluster sum of squares, the classic Elbow test
- 2D PCA visualisation, to eyeball how clusters fragment
Silhouette peaks at K=3 but stays close at K=4 (0.756 vs 0.740); Elbow shows the largest drop between K=2 and K=3, then diminishing returns; the PCA grid shows K=4 separates the price regimes cleanly without over-fragmenting. K=4 was the pick.
Two interactive views of why K = 4
The silhouette and elbow curves below are rebuilt from the notebook outputs as interactive charts. Hover any point for the exact value at that K. Amber rings mark K = 4, the chosen partition.
K selection (1 of 2)
Silhouette score across K
Higher is better.
How to read it: silhouette measures cluster cohesion plus separation. The global peak is at K = 3 (0.76) and K = 4 sits close behind (0.74, amber marker). Either is defensible from this metric alone, which is why we cross-checked against Elbow and the PCA grid before committing.
K selection (2 of 2)
Within-cluster sum of squares (Elbow)
Lower is tighter clusters; the bend is the natural K.
How to read it: WCSS always falls as K grows; what matters is where the curve bends. The drop from K = 2 to K = 3 is the biggest single step, then K = 4 (amber marker) flattens. After K = 4 each new cluster buys very little, the signature of over-fragmenting.
What the clusters actually look like
Two notebook outputs that don't translate cleanly to interactive charts. The grid below shows the K-means assignment at K = 2 through 6, and the 3D scatter shows the final K = 4 partition in raw Open / High / Low price space.


Frameworks and infrastructure
Source code on GitHub.