Technology

Technology

Analysis Chart Examples:

Time Analysis

core (count)

Parallel Time Run
Analysis: Time vs Cores(what’s happening • what it shows • why it matters • website phrasing)
What’s happening
Processing time drops as core count increases — but not linearly. This is the classic diminishing-returns shape you expect when only part of the work can be parallelized and the remaining portion is effectively serial (or overhead-limited).
What this graph shows
  • At 1 core, you see the full end-to-end pathway execution time.
  • As cores increase, variable-time work can be discretized across processing elements.
  • The static portion of the pathway remains, so the curve never approaches zero.
  • Result: big early gains, then tapering improvement as additional cores mostly reduce the remaining variable-time fraction.
Why it matters
  • It demonstrates realistic scaling behavior (not idealized linear speedup).
  • It sets up the TALP story: separating static vs variable execution time enables prediction and control.
  • This time curve is the foundation for energy/carbon optimization (energy is driven by power × time).
Website-ready phrasing
TALPs separate static and variable execution time. Only variable-time work is discretized across cores, producing a predicted parallel time curve with realistic scaling behavior.

Space Analysis

core (count)

Parallel Space Run
Analysis: Space vs Cores(what’s happening • what it shows • why it matters • website phrasing)
What’s happening
Memory allocation stays flat as you increase cores. Parallelization changes scheduling and execution, but this TALP does not require additional memory replication to scale for this workload.
What this graph shows
  • Predicted memory usage remains ~constant across core counts.
  • Parallel execution does not introduce a growing per-core memory tax here.
  • This suggests the working set is dominated by a fixed footprint (or shared structures), rather than per-thread buffers.
Why it matters
  • Many parallel approaches increase memory pressure (buffers, queues, per-thread copies). This shows “scale without bloat.”
  • Flat space supports higher core counts without hitting cache/RAM constraints early.
Website-ready phrasing
This TALP’s predicted memory footprint stays flat as core count increases — parallel execution doesn’t require extra memory for the same workload.

Energy Analysis

core (count)

Serial Energy Run
Analysis: Energy vs Cores(what’s happening • what it shows • why it matters • website phrasing)
What’s happening
Total energy drops sharply early, then tapers. Energy is driven by power × time. Even if instantaneous power rises as cores activate, faster completion can reduce the total energy spent.
What this graph shows
  • Large early energy reductions because runtime collapses quickly with additional cores.
  • Later cores provide smaller gains as static time and overhead dominate the remaining runtime.
  • Small bumps reflect step changes in platform states or normal measurement/model noise.
Why it matters
  • The goal isn’t “max cores,” it’s minimum energy for the same output.
  • This supports selecting an operating point that minimizes cost and carbon while meeting performance constraints.
Website-ready phrasing
Total energy is predicted from power × time. Even as power rises with more cores, faster completion reduces overall energy — enabling an explicit energy-minimizing operating point.

Power Analysis

core (count)

Parallel Power Run
Analysis: Power vs Cores(what’s happening • what it shows • why it matters • website phrasing)
What’s happening
Instantaneous power rises in steps as additional cores (and supporting subsystems) become active. Platforms tend to change power consumption in discrete “bands,” not smoothly.
What this graph shows
  • Step changes correspond to activation thresholds (core groups, frequency states, scheduling behavior).
  • Higher power is expected when using more processing elements — that’s the “finish sooner” trade.
Why it matters
  • Power alone isn’t the optimization target — energy is.
  • This explains why energy can drop even while power rises: the time integral shrinks.
Website-ready phrasing
Power increases stepwise as more processing elements activate. TALPs combine this with the time curve to optimize total energy, not just instantaneous watts.

Speedup Analysis

core (count)

Time Efficiency
Analysis: Speedup (T1/Tcore)(what’s happening • what it shows • why it matters • website phrasing)
What’s happening
Speedup increases with core count but remains sublinear. That’s normal: the non-parallelizable (static) portion of the pathway, synchronization, and overhead prevent perfect linear scaling.
What this graph shows
  • A direct view of “how much faster” relative to the 1-core baseline.
  • Where the curve flattens is where extra cores stop paying off meaningfully.
  • This matches real production behavior more closely than idealized scaling.
Why it matters
  • Speedup bridges the performance story and the energy story (time reduction is what enables energy reduction).
  • It supports a simple claim: same code, faster completion — without forcing developers into a new parallel programming model.
Website-ready phrasing
Speedup is the predicted performance multiplier (T1/Tcore). Gains taper naturally as static work and overhead dominate — showing realistic, production-grade behavior.

Free-up Analysis

core (count)

Space Efficiency
Analysis: Free-up (S1/Score)(what’s happening • what it shows • why it matters • website phrasing)
What’s happening
Free-up stays ~1.0, meaning memory usage does not decrease with more processing elements for this TALP. Parallelization improves time/energy here, not memory.
What this graph shows
  • If Free-up > 1, memory per run would drop as cores increase.
  • If Free-up ≈ 1 (this chart), memory is essentially unchanged.
  • If Free-up < 1, parallel execution would increase memory usage (not the case here).
Why it matters
  • It prevents overclaiming: not every workload yields memory savings.
  • It increases credibility by showing the analytics explain which resource improves for a given TALP.
Website-ready phrasing
Free-up measures how memory allocation changes with parallelism. A flat line means this TALP keeps the same memory footprint while improving time and energy.

Power-up Analysis

core (count)

Power-up
Analysis: Power-up (P1/Pcore)(what’s happening • what it shows • why it matters • website phrasing)
What’s happening
Power-up decreases in steps as cores increase. This indicates improved normalized power behavior as work is distributed — often reflecting real platform “efficiency bands” as scheduling and hardware states change.
What this graph shows
  • Normalized power behavior relative to the 1-core baseline.
  • Stepwise changes indicate thresholds where the platform shifts how it runs (core groups, frequency, or resource allocation).
  • Useful for explaining why “more cores” isn’t just “more watts” — it can also be more efficient execution.
Why it matters
  • It provides the missing link between raw power and energy outcomes.
  • Combined with the time curve, it helps identify the best operating point rather than chasing peak core count.
Website-ready phrasing
Power-up shows normalized power behavior as cores increase. Stepwise changes reflect real system efficiency thresholds — and help explain where energy savings come from.

Green-up Analysis

core (count)

Green-up
Analysis: Green-up (E1/Ecore)(what’s happening • what it shows • why it matters • website phrasing)
What’s happening
Green-up rises with core count, representing an energy-efficiency multiplier relative to the single-core baseline. Higher values mean less energy used to complete the same work.
What this graph shows
  • A direct “efficiency multiplier” view of the energy story: how much better energy usage gets as parallelism increases.
  • Where the curve plateaus indicates where more cores deliver limited additional efficiency.
  • The most business-friendly metric because it’s already normalized to the baseline.
Why it matters
  • This is the investor-friendly output: same output, less energy.
  • It maps cleanly to cost and carbon impact when paired with the right workload scenario.
Website-ready phrasing
Green-up is the energy-efficiency multiplier (E1/Ecore). Higher is better: you use less energy to complete the same workload as parallelism increases.

Note: Let's identify what we want to show here: It should be a TS/HTML animation on medium speed repeat.

How Parallelism Is Created: CUDA vs OpenCL vs TALPs

Same serial algorithm. Three fundamentally different transformations (with CC + TALP extraction artifacts).

Serial Algorithm (same starting point)

function f(inputs) {
  if (mode) { … } else { … }
  for i = 0..N-1:
    for j = 0..M(i)-1:
      y[i] = g(y[i], x[j])
}

Non-loop branch: if (mode)

Variable-time loop: M(i) depends on input attributes

Dependency example: updates to y[i] may require coordination

CUDA

Kernel + grid + explicit memory & sync

What you change

  • • Choose thread mapping (iteration → thread)
  • • Rewrite loops as GPU kernel(s)
  • • Manage GPU memory transfers
  • • Launch grid + synchronize

CUDA kernel artifact

Loop → tidSync/atomics
__global__ void k(...) {
  tid = blockIdx.x*blockDim.x
        + threadIdx.x;
  i = tid;
  if (i < N) { … }
}

Execution model

Grid → Blocks → Threads → GPU

Tuning: occupancy, shared memory, coalescing

Outcome

Parallelism: you mapped work to threads. Correctness: sync/atomics.

OpenCL

Kernel + NDRange + context/queue/buffers

What you change

  • • Write kernel + choose NDRange geometry
  • • Create context + command queue
  • • Create buffers + explicit transfers
  • • Enqueue kernel + barriers

OpenCL kernel artifact

Loop → gidQueue/barriers
kernel void k(...) {
  gid = get_global_id(0);
  i = gid;
  if (i < N) { … }
}

Execution model

Host API → Queue → Device

NDRange → Work-groups → Work-items

Outcome

Parallelism: NDRange schedule. Correctness: memory + barrier discipline.

TALPs (TALPified Algorithm)

Functional decomposition → TALPs → graph → discretized instances

TALPify: extraction pipeline

  • • Functional decomposition (all functions)
  • • Cyclomatic complexity (CC) highlights hotspots
  • • Extract execution pathways (TALPs) per function
  • • Catalog inputs (dims/ranges) → build TALP graph

Functional decomposition tree (CC)

CC guides focus
app()CC 8
parse()CC 12
update()CC 18
solve()CC 34hotspot

Functional decomposition exposes all functions and their cyclomatic complexity so TALP extraction can prioritize the highest-branching hotspots.

CC = branching pressureExtract per function + pathway

TALP selection table (catalog)

Function
TALP
Inputs (dims / ranges)
solve()
#2
x[0..N], y[0..N], mode
update()
#1
a,b (scalar), cfg

TALP graph + discretization (what becomes parallel)

TALP-1: mode=trueTALP-2: mode=falsenon-loop conditional splits pathways
Block ALoop L1Block BLoop L2Block C
Loop L2 variable-time (input-driven)

Parallelization: discretize variable-time loops

Change loop start/end per TALP instance (slicing iteration space).

Discretize
Instance 0: i = 0..k0
Instance 1: i = k0+1..k1
Cross-comm only if dependency rule triggersAnalyticsChange loop start/end

Outcome

Parallelism: discretized loop instances. CC + pathway objects enable automation.

Technology

TALPs: parallel performance, without parallel programming

Use investor-safe language first, bridge the concept, then go deep for developers.

1

Investor-safe definition

Time-Affecting Linear Pathways (TALPs) are execution paths within software that determine total runtime. Massively Parallel’s technology automatically identifies independent TALPs and executes them concurrently.

  • Software-only approach on existing CPUs
  • Works with real, existing codebases
  • Enables scalable concurrent execution
2

Bridging sentence

TALPs are not tasks or threads — they are the time-critical execution pathways already present in your software.

This is the key reframing: most systems talk about splitting work. TALPs focus on how time flows through the program, and where it can safely flow in parallel.

Conceptual time overlap

SequentialConcurrent TALPsone long time-dominant pathwayTALP ATALP B (overlaps in time)
3

Developer-safe definition

A TALP is a linear chain of dependent operations whose execution time contributes to overall latency. Independent TALPs can be scheduled concurrently while preserving program semantics.

Developer implications

  • Parallelism emerges from pathway independence (not manual task creation)
  • Synchronization only at necessary merge points
  • Works beyond loop-level parallelism (control flow + call graphs)

Conceptual TALP form (illustrative)

TALP A: compute a[i] from x[i]
TALP B: compute b[i] from y[i]
JOIN(A, B) -> TALP C: fuse(a[i], b[i])

Want the full “code → TALPification” flow (analysis, pathway discovery, execution-aware optimization)?

Technology

Why Existing Parallel Models Fall Short

Most parallel approaches require developers to restructure software around a parallel model—loops, kernels, or explicit tasks. That works for some workloads, but it breaks down across real applications: irregular control flow, evolving codebases, and end-to-end performance.

TALPs (Time-Affecting Linear Pathways) take a different approach: instead of asking you to define parallel work, we identify the time-critical execution pathways already present in your code and run independent pathways concurrently.

Traditional models start with structure

They assume you can express the program as a set of loops, kernels, or tasks. If your software doesn’t naturally fit, you rewrite until it does.

Real software starts with time

Performance bottlenecks often come from execution pathways that span functions, branches, and calls—not a single loop you can “pragma.”

TALPs start with time, not structure

TALPification makes those pathways explicit and executes independent TALPs concurrently, synchronizing only where dependencies require merge points.

TALPs vs common parallel models

A practical comparison of what each model is good at—and why TALPs are designed to generalize across real codebases.

OpenMP / Pragmas

Common approach

Best for

Regular loops and clearly parallel regions.

Why it falls short

Requires developers to annotate code and reason about correctness, granularity, and data-sharing. It’s often loop-centric and struggles with irregular control flow and application-wide optimization.

TALP advantage

TALPs are discovered automatically across functions and control flow. Concurrency emerges from pathway independence—not manual directives—so you parallelize more than loops and reduce maintenance risk.

GPU Offload (CUDA / OpenCL)

Common approach

Best for

Highly data-parallel kernels with predictable memory access.

Why it falls short

Demands algorithm refactoring and a separate programming model. Performance depends heavily on memory movement and kernel structure, and many real-world codes don’t map cleanly without significant rewrite.

TALP advantage

TALPs focus on time-critical pathways in the existing program and enable concurrent execution on CPUs today. You get scalable parallelism without re-architecting code around accelerators.

Task Frameworks (TBB / Cilk / OpenMP Tasks)

Common approach

Best for

Applications already structured as tasks or pipelines.

Why it falls short

You still have to design the task decomposition and dependency structure. Too-coarse tasks waste cores; too-fine tasks add overhead. The model is explicit and can be brittle as code evolves.

TALP advantage

TALPs make the dependency/merge points implicit in the program’s time structure. The system extracts concurrency and can tune pathway granularity without forcing developers to rewrite software as a task graph.

Manual Threads (pthreads / std::thread)

Common approach

Best for

Low-level control in expert hands.

Why it falls short

Hard to get right, expensive to maintain, and easy to regress. Locking, races, false sharing, and portability issues consume engineering time and limit scaling.

TALP advantage

TALPs replace manual concurrency management with automatic pathway discovery and scheduling. You preserve program semantics while letting the system exploit concurrency safely and repeatedly as the code changes.

What “better” means with TALPs

  • No requirement to restructure code into loops/kernels/tasks
  • Concurrency discovered across control flow and call graphs
  • Synchronization only at necessary merge points
  • Parallelization that remains valid as the code evolves

The one-line differentiation

Tasks and threads describe work you create. TALPs describe time that already exists.

That’s why TALPs can generalize beyond “easy” parallelism and unlock concurrency across real applications without forcing a new programming model.

TECHNOLOGY

TALPs Demo Flow

A canonical walkthrough: import code safely, decompose into real execution pathways, generate predictive analytics, then transparently parallelize and execute toward explicit goals.

01Step
CONTEXT
Next →

Login & Hardware Context

Anchor every prediction and optimization to the real machine.

WHAT HAPPENS

  • User signs in to the TALP system UI

  • Hardware + core configuration is explicit (not abstract)

Start of flowStep 02
02Step
TRUST
Next →

Repo Import (Clone-Only)

Safety + auditability: work happens on a cloned artifact.

WHAT HAPPENS

  • Select repo item + dataset + output artifact name

  • Original source is not mutated

Follows step 01Step 03
03Step
DISCOVERY
Next →

Auto Decomposition & Profiling

Turn “code” into selectable execution pathways + real measurements.

WHAT HAPPENS

  • Functional decomposition + call structure

  • Runs: serial, standard parallel, persistent parallel

  • Dataset variables + ranges are characterized

Follows step 02Step 04
04Step
PREDICTION
Next →

Predictive Model Generation

Predict behavior before spending compute.

WHAT HAPPENS

  • Predict time + space for specific input ranges

  • Extend to energy/cost/carbon accounting

Follows step 03Step 05
05Step
TRANSPARENCY
Next →

Transparent Source Diff

Nothing is hidden—every change is inspectable.

WHAT HAPPENS

  • Side-by-side original vs TALP-augmented source

  • Visual highlight of changed vs unchanged logic

Follows step 04Step 06
06Step
STRUCTURE
Next →

Functional Decomposition + Complexity

Structure drives parallelism; complexity surfaces hotspots.

WHAT HAPPENS

  • Function tree + cyclomatic complexity annotations

  • Guides where parallel structure is most valuable

Follows step 05Step 07
07Step
PATHWAYS
Next →

TALP Decomposition (Execution Pathways)

Programs don’t run “the code”—they run one pathway.

WHAT HAPPENS

  • Select a TALP (pathway) explicitly

  • See exact blocks, order, and variables on that path

Follows step 06Step 08
08Step
VALIDATION
Next →

Input Variable Analysis & Test Generation

Ranges → cases → correctness + scaling behavior checks.

WHAT HAPPENS

  • Track pathway-driving input attributes

  • Auto-generate tests across valid ranges

Follows step 07Step 09
09Step
CONTROL
Done ✓

Goal-Driven Execution

Optimize toward performance or energy.

WHAT HAPPENS

  • User selects goal + constraints (e.g., max cores)

  • System chooses core count (not always “max”)

  • Run + report deltas (time/energy/cost)

Follows step 08End

KEY IDEA

TALPs make execution pathways explicit, generate predictive analytics from real runs, and use those predictions to choose parallel strategies and resource levels aligned to performance or sustainability goals.

TECHNOLOGY

Code → TALPification

MPT transforms existing code into predictable, parallel execution through a time-centric pipeline: structural analysis, live predictive analytics, TALP-based transformation, and cost-aware optimization.

01

Functional Decomposition

Turn a codebase into a complete, navigable structure of functions and relationships.

  • Extracts every function and call relationship across the project.

  • Highlights hotspots instantly with cyclomatic complexity per function.

  • Surfaces dependency variables and control structure to guide optimization.

  • Creates the structural model used by the rest of the pipeline.

WHAT YOU GET

A full function tree + maintainability signals (complexity) + dependency context—without manual tracing.

Next: TALP definition & why it matters (separate section/component)

This section describes the end-to-end pipeline (analysis → prediction → transformation). A separate section should define TALPs and explain why time-centric pathways matter.

TALPsTime-Affecting Linear PathwaysParallelism is derived from execution pathways + prediction polynomials (not pragmas, kernels, or task graphs).Source CodeC/C++/Rust/Go/…No annotations requiredFunctional DecompositionExtract functions + complexityReveal hotspots instantlyTALP ExtractionIdentify execution pathwaysInclude variable-time loopsTALP AnalyzerGenerate prediction polynomialsTime / space / value analyticsExecution PlannerDiscretization + resource selectionPer dataset + per target environmentObjective: speed, energy, or bothTALP Execution EngineRun discretized TALPs in parallelCorrectness + cross-communication rulesCPU / GPU / multi-serverCPUGPUResults + ValidationPredicted vs measuredSpeedup + energy + costRepeatable + explainableNo #pragma omp, no CUDA kernels: parallelism is derived from pathways + analytics.Dataset-aware planning: choose cores/partitions to hit speed, energy, and cost targets.Canonical TALP pipeline (analysis → planning → execution)TALPs vs OpenMP vs CUDA/OpenCLSame goal (parallel speedup). Different source of truth for “how to parallelize.”OpenMPParallelism declared by the programmerFork/join regions + scheduling hintsCorrectness depends on human intentPerformance varies by compiler/runtimeManualCUDA / OpenCLProgrammer writes kernels + manages memoryExplicit grid/block/thread mappingData movement is a first-class concernGreat when you can rewrite for GPUManualTALPsParallelism derived from execution pathwaysPrediction polynomials guide execution planningDataset-aware + architecture-aware decisionsAutomatic discretization + runtime orchestrationDerivedSource of parallelismPragmas + programmer decisionsKernels + explicit mappingPathways + analytics (polynomials)Typical artifact#pragma omp parallel for+ schedule clauses, reductions, etc.__global__ kernel(...)+ buffers, copies, launch parametersTALP graph + analytics→ execution plan → parallel runtimeDifferentiation diagramA Program is Many PathwaysConditionals select the pathway; loops determine how time varies for that pathway.Function: process(data)if (mode == A) …else if (mode == B) …else …Same function, different pathwayTALP-1 (Pathway A)Includes loops as part of the pathwayVariable-time loops depend on input attributesSelectedTALP-2 (Pathway B)Different sequence of code blocksDifferent time/space behaviorSelectedTALP-3 (Pathway C)Still linear end-to-end for that pathwaySelectedstatic loopvariable-time loopstatic loopvariable-time loopConceptual “what is a TALP” diagramPrediction Polynomials → Execution PlanningTALPs turn “how long will this take?” into an actionable plan for parallel execution.TALP Analyzer OutputPrediction polynomial for each pathwayTime / space / value modelsBuilt from measured execution behaviorReusable + explainableAnalyticstimeinput attributeExecution PlannerDiscretize variable-time loopsChoose cores/partitions per targetObjective: speed / energy / costOutput: a concrete execution planPlancoresenergycostRuntimeRun plan on the targetEnforce correctnessMeasure + validateFeed results backExecuteTALPTALPTALPCPUGPUFeedback: predicted vs measured → refine analytics“Why it matters” diagram
TALPs: Time-Affecting Linear Pathways (Derived + Selected)Same semantics. Different execution plan. Optimized for target hardware, expected data behavior, and goals.LegendExecuted PathCandidate TALP (evaluated)SelectedChosen plan / outputINPUTOriginal CodeCanonical SemanticsAnalyzeTALPIFICATION ENGINEProprietary Functional DecompositionDerive TALP CandidatesTime-affecting linear pathwaysPredictive AnalyticsMachine-level performance modelConfigure Execution PlanArchitecture + goals + constraintsTARGET INPUTSTarget ArchitectureCPU / ISA -> core-count optionsmemory topology -> cache/NUMAOptimization Goalsmax speed -> min energy -> balancedData Behavior (Expected)structure (e.g., sparsity)locality -> distributionsConstraints + ParametersCANDIDATE TALPs (EVALUATED)Generate + EvaluateTALP A: Locality-Optimized PathPred Time +Pred Energy +Locality +TALP B: Parallel Plan (N Cores)Cores: NSpeedup CurveDiminishing ReturnsTALP C: Efficiency Plan (Cost-Aware)Energy / Cost +Throughput/Watt +Target BudgetSELECTIONPlan SelectionBest fit for hardware + data + goalsOUTPUTTALPified CodeSame semantics -> Different executionOptimized for the selected target planSelectedSelected Execution PlanTime restructuredChosen core count -> locality/scheduling -> configurationEmitTALPs aren't more threads. They're linear pathways discovered and selected to reshape time on a real machine.