ML Systems · High Dimensional Data · Scalable Inference

Stellamaris
Nakacwa

Applied Research Scientist on adaptations computational intelligence (AI/ML) to real-world messiness — from efficient neural retrieval under network constraints, to document layout intelligence training and inferencing, to sensor datamodelling with advanced model methods. Eight years across national labs, academic research programs, and applied systems.

8
Years of research & engineering across
national labs, academia, applied ML
6+
Research outputs — 6 published works
+ 1 paper in preparation
3
Benchmarking retrieval architecture
under real-world constraints

Selected Work

Research & Projects

Three areas of sustained inquiry — each asking a variation of the same question: how do you make ML systems that actually work on messy, high dimensional, real-world data, at inference time, under real constraints?

01 — Thesis (04.2026)

Algorithm Resilience Under System

"Is embedding algorithm architecture inherently resilient to traffic constraints — or does resilience have to be designed in, at a cost?"

Stress-tests HNSW and IVF ANN architectures across a controlled degradation ladder (baseline → bandwidth-constrained → high-latency → packet loss), mirroring cloud throttling, cross-region lag, and edge node failures. Core contribution: a diagnostic sensitivity framework — HNSW is latency-sensitive (sequential traversal); IVF is bandwidth-sensitive (fewer, larger transfers) — giving engineers a principled vocabulary for architecture-to-environment matching. Energy per query tracked as a first-class metric alongside recall and latency; the research asks which design decisions make retrieval systems inherently robust or fragile, and at what cost. In short: Most research asks how to make retrieval faster. My thesis asks which retrieval architectures are inherently fragile to network conditions — and what that fragility costs in energy

Neural Retrieval architecture Netwoork Resilience FAISS H100, NSF ACCESS Parallel Systems
Thesis forthcoming →

02 — Publication (In Preparation)

Differentiable ML Evaluation: A Numerical Conditioning Approach

"What if metric instability isn't a statistics problem — it's a numerical conditioning problem?"

Co-authored paper reframing binary classifier evaluation as a well-posedness problem. We extend the binormal ROC model into a unified differentiable manifold linking ROC, PR, and F₁ simultaneously — enabling threshold optimization via Brent, Golden-section, and RK4 algorithms on a smooth, analytically grounded surface. Four of five optimizers converged to identical optima within 10⁻⁶ tolerance. Bootstrap experiments show >40% reduction in threshold variance under smoothing. Newton's method diverged, exposing the non-convex structure of empirical F₁ and the necessity of bounded search.

Numerical Analysis ML Evaluation ROC / PR Curves Optimization Binormal Model
Preprint forthcoming →

03 — Applied ML Engineering

AIPDF2Table: Structured Extraction from Unstructured Documents

"How do you turn a decade of PDF reports into a queryable knowledge base?"

Built a production document intelligence system that extracts structured table data from heterogeneous PDF corpora — the kind of documents where layout, encoding, and schema vary unpredictably. The key research contribution is the disambiguation layer: handling merged cells, multi-header tables, and rotated layouts without a fixed template. Now used on real document pipelines.

Document Intelligence NLP Information Extraction Python
View on GitHub →

04 — Program Leadership

YouthMappers: Open Geospatial Data at Scale

"How do you build AI/ML-ready datasets for domains where none exist?"

Expanded open source science curriculum and research program for YouthMappers through training, projects, and creating open geospatial datasets from diverse sources for policy and ML research learning. Developed pedagogical frameworks for ML + GIS methods and trained student scale. The outputs — datasets, methods, and trained practitioners — are in active use in the research community. Work directly addressed training data limitations and new ML/DL model developments.

Open Data Ecosystems Data-centric AI Spatial Intelligence Training Data Development
Program overview →

Ongoing Series

Parallel & Distributed Systems Notes

A chapter-by-chapter study of parallel computing applied to ML systems & large-scale data structures. 3 entries published, ongoing.

View Series →

Dynamic Resource

AI/ML Systems Research Links

Automatically curated papers and news on embedding systems, geospatial ML, and parallel computing. Updated weekly.

Explore Links →

Scholarly Work

Publications

Peer-reviewed contributions across energy systems informatics, geological data infrastructure, and applied learning frameworks.

2024

WELLBASE: A Standardized Data Infrastructure for Well Log Analytics

Geological, Oil & Gas data systems · Peer-reviewed

View →
2024

ROKBASE: Rock Sample Database for Imaging DL Applications

Imaging systems · Peer-reviewed

View →
2023

Lite Learning: A Lightweight Framework for Model Training in Resource-Constrained Environments

Model Training research · Peer-reviewed

View →
2025

Numerical Smoothing of Noisy Evaluation Surfaces: A Classical Approach to Robust ML Threshold Optimization

Nakacwa S., Luis P. · Harrisburg University · In preparation

Preprint forthcoming
2025

Thesis: Architecture Resilience Under Network Degradation: A Controlled Benchmarking Study of Embedding Retrieval Systems

Harrisburg University · In preparation

Forthcoming

Background

Experience

Eight years across mission-critical research, academic ML, and applied systems engineering — always working on the same class of problem from different vantage points - how do intelligent systems behave when the environment they were designed for stops cooperating?

Leidos / NETL 2023 – Present

AI/ML R&D Science and Engineering

Applies embedding knowledge and computer vision to categorize mineral regimes, and establish a computational basis for energy resource evaluation. Designed mathematical models and data pipelines to reconstruct fragmented oil and gas records from disparate sources into a unified, queryable national science asset. Develops inference systems that make inaccessible document archives — PDFs, scanned reports, legacy formats — machine-readable, to recover decades of domain knowledge for advanced research

Harrisburg University 2024 – 2026

Graduate Research

Computing Systems and Algorthms - Benchmarking robustness of retrieval algorithms architecture. Strengthem AI/ML and computing systems knowledge.

West Virginia University 2021 – 2022

Graduate Research Assistant

Research project on ML Training Data & Large Data System Design. Addressed losses created when by db schema invariation for opensource data

YouthMappers 2017 – 2023

Program Director, Regional Program Training

Expanded and Increased open science learning curriculums for GIS, ML application. Increased open geospatial dataset creation for coummunities. Program outputs supporting policy & AI/ML research.

Research Product 2024

Applied ML Research - Software Development

Produced document intelligence utility tool for structured extraction from heterogeneous PDF corpora. Focus on software architecture and design application for transformer models

Policy Research- LANDnet 2018 - 2020

Geospatial SME Expert

Mapped and advised schema variations and adaptations for Land Data Mapping and National Record Digitalisation. Data Management and Processing. Software Development.

Exploratory Work

Experiments & Projects

Algorithmic tests, benchmark studies, and data models — work that expands the boundaries of knowledge to real-world incidents.

Behavioral Signal Extraction from Mobility Data

Treats COVID-19 lockdown periods as a natural experiment — asking what population movement signals reveal about how policy propagates through behavior, and whether real-time observation changes the answer.

GitHub →

Infrastructure Adoption as a Spatial Problem

Models U.S. EV charging growth as a spreading pattern across geography - to where adoption moves next and what the shape of that curve reveals about energy transition timelines.

GitHub →

Air Quality Geospatial Pipeline

Scrapy-based pipeline collecting EPA AirNow data for geospatial air quality analysis.

GitHub →

Time Series Forecasting Compendium

Comparative study of classical statistical and deep learning forecasting models (ARIMA, Prophet) — where each architectural class breaks down and whether failure modes are predictable from the structure of the algorithm.

GitHub →

Document Intelligence -OCR

Early tests in recovering structured data from documents never intended for machines- a precursor to the AIPDF2Table tool - A research table extraction and processing pipeline for PDF and multi-format document corpora.

GitHub →

GPU Modernization of Floyd-Warshall

Extending and benchmarking of Floyd-Warshall all-pairs shortest path on CPU vs Modern GPUs — examining where the architecture assumptions change. Part of the PCAM methodology study; results and benchmarks in the repo.

GitHub →

Get In Touch

Open to Research Collaborations
and Senior AI/ML Roles

I'm interested in research scientist and senior AI/ML engineering roles at organizations expanding knowledge/products on geometric data, earth systems, infrastructure-scale ML, or efficient retrieval.

Connect on LinkedIn ResearchGate