Technology | Massively Parallel

Technology

Analysis Chart Examples:

Time Analysis

core (count)

Parallel Time Run

Analysis: Time vs Cores(what’s happening • what it shows • why it matters • website phrasing)

What’s happening

Processing time drops as core count increases — but not linearly. This is the classic diminishing-returns shape you expect when only part of the work can be parallelized and the remaining portion is effectively serial (or overhead-limited).

What this graph shows

At 1 core, you see the full end-to-end pathway execution time.
As cores increase, variable-time work can be discretized across processing elements.
The static portion of the pathway remains, so the curve never approaches zero.
Result: big early gains, then tapering improvement as additional cores mostly reduce the remaining variable-time fraction.

Why it matters

It demonstrates realistic scaling behavior (not idealized linear speedup).
It sets up the TALP story: separating static vs variable execution time enables prediction and control.
This time curve is the foundation for energy/carbon optimization (energy is driven by power × time).

Website-ready phrasing

TALPs separate static and variable execution time. Only variable-time work is discretized across cores, producing a predicted parallel time curve with realistic scaling behavior.

Space Analysis

core (count)

Parallel Space Run

Analysis: Space vs Cores(what’s happening • what it shows • why it matters • website phrasing)

What’s happening

Memory allocation stays flat as you increase cores. Parallelization changes scheduling and execution, but this TALP does not require additional memory replication to scale for this workload.

What this graph shows

Predicted memory usage remains ~constant across core counts.
Parallel execution does not introduce a growing per-core memory tax here.
This suggests the working set is dominated by a fixed footprint (or shared structures), rather than per-thread buffers.

Why it matters

Many parallel approaches increase memory pressure (buffers, queues, per-thread copies). This shows “scale without bloat.”
Flat space supports higher core counts without hitting cache/RAM constraints early.

Website-ready phrasing

This TALP’s predicted memory footprint stays flat as core count increases — parallel execution doesn’t require extra memory for the same workload.

Energy Analysis

core (count)

Serial Energy Run

Analysis: Energy vs Cores(what’s happening • what it shows • why it matters • website phrasing)

What’s happening

Total energy drops sharply early, then tapers. Energy is driven by power × time. Even if instantaneous power rises as cores activate, faster completion can reduce the total energy spent.

What this graph shows

Large early energy reductions because runtime collapses quickly with additional cores.
Later cores provide smaller gains as static time and overhead dominate the remaining runtime.
Small bumps reflect step changes in platform states or normal measurement/model noise.

Why it matters

The goal isn’t “max cores,” it’s minimum energy for the same output.
This supports selecting an operating point that minimizes cost and carbon while meeting performance constraints.

Website-ready phrasing

Total energy is predicted from power × time. Even as power rises with more cores, faster completion reduces overall energy — enabling an explicit energy-minimizing operating point.

Power Analysis

core (count)

Parallel Power Run

Analysis: Power vs Cores(what’s happening • what it shows • why it matters • website phrasing)

What’s happening

Instantaneous power rises in steps as additional cores (and supporting subsystems) become active. Platforms tend to change power consumption in discrete “bands,” not smoothly.

What this graph shows

Step changes correspond to activation thresholds (core groups, frequency states, scheduling behavior).
Higher power is expected when using more processing elements — that’s the “finish sooner” trade.

Why it matters

Power alone isn’t the optimization target — energy is.
This explains why energy can drop even while power rises: the time integral shrinks.

Website-ready phrasing

Power increases stepwise as more processing elements activate. TALPs combine this with the time curve to optimize total energy, not just instantaneous watts.

Speedup Analysis

core (count)

Time Efficiency

Analysis: Speedup (T1/Tcore)(what’s happening • what it shows • why it matters • website phrasing)

What’s happening

Speedup increases with core count but remains sublinear. That’s normal: the non-parallelizable (static) portion of the pathway, synchronization, and overhead prevent perfect linear scaling.

What this graph shows

A direct view of “how much faster” relative to the 1-core baseline.
Where the curve flattens is where extra cores stop paying off meaningfully.
This matches real production behavior more closely than idealized scaling.

Why it matters

Speedup bridges the performance story and the energy story (time reduction is what enables energy reduction).
It supports a simple claim: same code, faster completion — without forcing developers into a new parallel programming model.

Website-ready phrasing

Speedup is the predicted performance multiplier (T1/Tcore). Gains taper naturally as static work and overhead dominate — showing realistic, production-grade behavior.

Free-up Analysis

core (count)

Space Efficiency

Analysis: Free-up (S1/Score)(what’s happening • what it shows • why it matters • website phrasing)

What’s happening

Free-up stays ~1.0, meaning memory usage does not decrease with more processing elements for this TALP. Parallelization improves time/energy here, not memory.

What this graph shows

If Free-up > 1, memory per run would drop as cores increase.
If Free-up ≈ 1 (this chart), memory is essentially unchanged.
If Free-up < 1, parallel execution would increase memory usage (not the case here).

Why it matters

It prevents overclaiming: not every workload yields memory savings.
It increases credibility by showing the analytics explain which resource improves for a given TALP.

Website-ready phrasing

Free-up measures how memory allocation changes with parallelism. A flat line means this TALP keeps the same memory footprint while improving time and energy.

Power-up Analysis

core (count)

Power-up

Analysis: Power-up (P1/Pcore)(what’s happening • what it shows • why it matters • website phrasing)

What’s happening

Power-up decreases in steps as cores increase. This indicates improved normalized power behavior as work is distributed — often reflecting real platform “efficiency bands” as scheduling and hardware states change.

What this graph shows

Normalized power behavior relative to the 1-core baseline.
Stepwise changes indicate thresholds where the platform shifts how it runs (core groups, frequency, or resource allocation).
Useful for explaining why “more cores” isn’t just “more watts” — it can also be more efficient execution.

Why it matters

It provides the missing link between raw power and energy outcomes.
Combined with the time curve, it helps identify the best operating point rather than chasing peak core count.

Website-ready phrasing

Power-up shows normalized power behavior as cores increase. Stepwise changes reflect real system efficiency thresholds — and help explain where energy savings come from.

Green-up Analysis

core (count)

Green-up

Analysis: Green-up (E1/Ecore)(what’s happening • what it shows • why it matters • website phrasing)

What’s happening

Green-up rises with core count, representing an energy-efficiency multiplier relative to the single-core baseline. Higher values mean less energy used to complete the same work.

What this graph shows

A direct “efficiency multiplier” view of the energy story: how much better energy usage gets as parallelism increases.
Where the curve plateaus indicates where more cores deliver limited additional efficiency.
The most business-friendly metric because it’s already normalized to the baseline.

Why it matters

This is the investor-friendly output: same output, less energy.
It maps cleanly to cost and carbon impact when paired with the right workload scenario.

Website-ready phrasing

Green-up is the energy-efficiency multiplier (E1/Ecore). Higher is better: you use less energy to complete the same workload as parallelism increases.

Note: Let's identify what we want to show here: It should be a TS/HTML animation on medium speed repeat.

How Parallelism Is Created: CUDA vs OpenCL vs TALPs

Same serial algorithm. Three fundamentally different transformations (with CC + TALP extraction artifacts).

Serial Algorithm (same starting point)

function f(inputs) {
  if (mode) { … } else { … }
  for i = 0..N-1:
    for j = 0..M(i)-1:
      y[i] = g(y[i], x[j])
}

Non-loop branch: if (mode)

Variable-time loop: M(i) depends on input attributes

Dependency example: updates to y[i] may require coordination

CUDA

Kernel + grid + explicit memory & sync

What you change

• Choose thread mapping (iteration → thread)
• Rewrite loops as GPU kernel(s)
• Manage GPU memory transfers
• Launch grid + synchronize

CUDA kernel artifact

Loop → tidSync/atomics

__global__ void k(...) {
  tid = blockIdx.x*blockDim.x
        + threadIdx.x;
  i = tid;
  if (i < N) { … }
}

Execution model

Grid → Blocks → Threads → GPU

Tuning: occupancy, shared memory, coalescing

OpenCL

Kernel + NDRange + context/queue/buffers

What you change

• Write kernel + choose NDRange geometry
• Create context + command queue
• Create buffers + explicit transfers
• Enqueue kernel + barriers

OpenCL kernel artifact

Loop → gidQueue/barriers

kernel void k(...) {
  gid = get_global_id(0);
  i = gid;
  if (i < N) { … }
}

Execution model

Host API → Queue → Device

NDRange → Work-groups → Work-items

TALPs (TALPified Algorithm)

Functional decomposition → TALPs → graph → discretized instances

TALPify: extraction pipeline

• Functional decomposition (all functions)
• Cyclomatic complexity (CC) highlights hotspots
• Extract execution pathways (TALPs) per function
• Catalog inputs (dims/ranges) → build TALP graph

Functional decomposition tree (CC)

CC guides focus

app()CC 8

parse()CC 12

update()CC 18

solve()CC 34hotspot

Functional decomposition exposes all functions and their cyclomatic complexity so TALP extraction can prioritize the highest-branching hotspots.

CC = branching pressureExtract per function + pathway

TALP selection table (catalog)

Function

TALP

Inputs (dims / ranges)

solve()

x[0..N], y[0..N], mode

update()

a,b (scalar), cfg

TALP graph + discretization (what becomes parallel)

TALP-1: mode=trueTALP-2: mode=falsenon-loop conditional splits pathways

Block A→Loop L1→Block B→Loop L2→Block C

Loop L2 variable-time (input-driven)

Parallelization: discretize variable-time loops

Change loop start/end per TALP instance (slicing iteration space).

Discretize

Instance 0: i = 0..k0

Instance 1: i = k0+1..k1

Cross-comm only if dependency rule triggersAnalyticsChange loop start/end

Technology

TALPs: parallel performance, without parallel programming

Use investor-safe language first, bridge the concept, then go deep for developers.

Investor-safe definition

Time-Affecting Linear Pathways (TALPs) are execution paths within software that determine total runtime. Massively Parallel’s technology automatically identifies independent TALPs and executes them concurrently.

Software-only approach on existing CPUs
Works with real, existing codebases
Enables scalable concurrent execution

Bridging sentence

TALPs are not tasks or threads — they are the time-critical execution pathways already present in your software.

This is the key reframing: most systems talk about splitting work. TALPs focus on how time flows through the program, and where it can safely flow in parallel.

Conceptual time overlap

Developer-safe definition

A TALP is a linear chain of dependent operations whose execution time contributes to overall latency. Independent TALPs can be scheduled concurrently while preserving program semantics.

Developer implications

Parallelism emerges from pathway independence (not manual task creation)
Synchronization only at necessary merge points
Works beyond loop-level parallelism (control flow + call graphs)

Conceptual TALP form (illustrative)

TALP A: compute a[i] from x[i]
TALP B: compute b[i] from y[i]
JOIN(A, B) -> TALP C: fuse(a[i], b[i])

Want the full “code → TALPification” flow (analysis, pathway discovery, execution-aware optimization)?

Explore TALPs on this page Talk to us

Technology

Why Existing Parallel Models Fall Short

Most parallel approaches require developers to restructure software around a parallel model—loops, kernels, or explicit tasks. That works for some workloads, but it breaks down across real applications: irregular control flow, evolving codebases, and end-to-end performance.

TALPs (Time-Affecting Linear Pathways) take a different approach: instead of asking you to define parallel work, we identify the time-critical execution pathways already present in your code and run independent pathways concurrently.

Traditional models start with structure

They assume you can express the program as a set of loops, kernels, or tasks. If your software doesn’t naturally fit, you rewrite until it does.

Real software starts with time

Performance bottlenecks often come from execution pathways that span functions, branches, and calls—not a single loop you can “pragma.”

TALPs start with time, not structure

TALPification makes those pathways explicit and executes independent TALPs concurrently, synchronizing only where dependencies require merge points.

TALPs vs common parallel models

A practical comparison of what each model is good at—and why TALPs are designed to generalize across real codebases.

OpenMP / Pragmas

Common approach

Best for

Regular loops and clearly parallel regions.

Why it falls short

Requires developers to annotate code and reason about correctness, granularity, and data-sharing. It’s often loop-centric and struggles with irregular control flow and application-wide optimization.

TALP advantage

TALPs are discovered automatically across functions and control flow. Concurrency emerges from pathway independence—not manual directives—so you parallelize more than loops and reduce maintenance risk.

GPU Offload (CUDA / OpenCL)

Common approach

Best for

Highly data-parallel kernels with predictable memory access.

Why it falls short

Demands algorithm refactoring and a separate programming model. Performance depends heavily on memory movement and kernel structure, and many real-world codes don’t map cleanly without significant rewrite.

TALP advantage

TALPs focus on time-critical pathways in the existing program and enable concurrent execution on CPUs today. You get scalable parallelism without re-architecting code around accelerators.

Task Frameworks (TBB / Cilk / OpenMP Tasks)

Common approach

Best for

Applications already structured as tasks or pipelines.

Why it falls short

You still have to design the task decomposition and dependency structure. Too-coarse tasks waste cores; too-fine tasks add overhead. The model is explicit and can be brittle as code evolves.

TALP advantage

TALPs make the dependency/merge points implicit in the program’s time structure. The system extracts concurrency and can tune pathway granularity without forcing developers to rewrite software as a task graph.

Manual Threads (pthreads / std::thread)

Common approach

Best for

Low-level control in expert hands.

Why it falls short

Hard to get right, expensive to maintain, and easy to regress. Locking, races, false sharing, and portability issues consume engineering time and limit scaling.

TALP advantage

TALPs replace manual concurrency management with automatic pathway discovery and scheduling. You preserve program semantics while letting the system exploit concurrency safely and repeatedly as the code changes.

What “better” means with TALPs

No requirement to restructure code into loops/kernels/tasks
Concurrency discovered across control flow and call graphs
Synchronization only at necessary merge points
Parallelization that remains valid as the code evolves

The one-line differentiation

Tasks and threads describe work you create. TALPs describe time that already exists.

That’s why TALPs can generalize beyond “easy” parallelism and unlock concurrency across real applications without forcing a new programming model.

See TALPs in the pipeline Talk to us

TECHNOLOGY

TALPs Demo Flow

A canonical walkthrough: import code safely, decompose into real execution pathways, generate predictive analytics, then transparently parallelize and execute toward explicit goals.

01Step

CONTEXT

Login & Hardware Context

Anchor every prediction and optimization to the real machine.

WHAT HAPPENS

User signs in to the TALP system UI
Hardware + core configuration is explicit (not abstract)

Start of flowStep 02

02Step

TRUST

Repo Import (Clone-Only)

Safety + auditability: work happens on a cloned artifact.

WHAT HAPPENS

Select repo item + dataset + output artifact name
Original source is not mutated

Follows step 01Step 03

03Step

DISCOVERY

Auto Decomposition & Profiling

Turn “code” into selectable execution pathways + real measurements.

WHAT HAPPENS

Functional decomposition + call structure
Runs: serial, standard parallel, persistent parallel
Dataset variables + ranges are characterized

Follows step 02Step 04

04Step

PREDICTION

Predictive Model Generation

Predict behavior before spending compute.

WHAT HAPPENS

Predict time + space for specific input ranges
Extend to energy/cost/carbon accounting

Follows step 03Step 05

05Step

TRANSPARENCY

Transparent Source Diff

Nothing is hidden—every change is inspectable.

WHAT HAPPENS

Side-by-side original vs TALP-augmented source
Visual highlight of changed vs unchanged logic

Follows step 04Step 06

06Step

STRUCTURE

Functional Decomposition + Complexity

Structure drives parallelism; complexity surfaces hotspots.

WHAT HAPPENS

Function tree + cyclomatic complexity annotations
Guides where parallel structure is most valuable

Follows step 05Step 07

07Step

PATHWAYS

TALP Decomposition (Execution Pathways)

Programs don’t run “the code”—they run one pathway.

WHAT HAPPENS

Select a TALP (pathway) explicitly
See exact blocks, order, and variables on that path

Follows step 06Step 08

08Step

VALIDATION

Input Variable Analysis & Test Generation

Ranges → cases → correctness + scaling behavior checks.

WHAT HAPPENS

Track pathway-driving input attributes
Auto-generate tests across valid ranges

Follows step 07Step 09

09Step

CONTROL

Done ✓

Goal-Driven Execution

Optimize toward performance or energy.

WHAT HAPPENS

User selects goal + constraints (e.g., max cores)
System chooses core count (not always “max”)
Run + report deltas (time/energy/cost)

Follows step 08End

KEY IDEA

TALPs make execution pathways explicit, generate predictive analytics from real runs, and use those predictions to choose parallel strategies and resource levels aligned to performance or sustainability goals.

TECHNOLOGY

Code → TALPification

MPT transforms existing code into predictable, parallel execution through a time-centric pipeline: structural analysis, live predictive analytics, TALP-based transformation, and cost-aware optimization.

Functional Decomposition

Turn a codebase into a complete, navigable structure of functions and relationships.

Extracts every function and call relationship across the project.
Highlights hotspots instantly with cyclomatic complexity per function.
Surfaces dependency variables and control structure to guide optimization.
Creates the structural model used by the rest of the pipeline.

WHAT YOU GET

A full function tree + maintainability signals (complexity) + dependency context—without manual tracing.

Next: TALP definition & why it matters (separate section/component)

This section describes the end-to-end pipeline (analysis → prediction → transformation). A separate section should define TALPs and explain why time-centric pathways matter.