Job Lineage Tracking: Debugging Multi-Step Pipelines

by Alex Johnson 53 views

Debugging complex, multi-step job pipelines can be a nightmare without proper visibility. When a workflow fails, identifying the root cause becomes a time-consuming and frustrating process. This article proposes a solution: Job Lineage Tracking, a system designed to track parent-child relationships between jobs, providing a clear view of job pipelines and simplifying debugging.

The Problem: Debugging Multi-Step Job Pipelines

Currently, when a complex workflow encounters an issue, there's a significant lack of insight into the job execution flow. We are unable to determine which parent job triggered specific child jobs, the number of jobs spawned at each stage, the precise location of the failure within the pipeline, the completion status of individual jobs, and the overall job dependency tree. This lack of visibility makes debugging a challenging and time-intensive task.

Consider a current production example: the Kino Krakow Scraper failure. In this scenario, 1,111 ShowtimeProcessJobs failed with a :movie_not_ready error. The existing blind spots make it impossible to answer crucial questions such as:

  1. Did the SyncJob initiate all 108 expected MoviePageJobs?
  2. Did the 70 completed MoviePageJobs successfully spawn MovieDetailJobs?
  3. Do the 115 "available" MovieDetailJobs originate from the same import run?
  4. Why are certain MovieDetailJobs never executing?
  5. Which specific ShowtimeProcessJobs failed and for which MoviePageJobs?

Our current debugging process involves manual SQL queries, joining jobs by timestamp and arguments, and often relying on educated guesses to establish relationships. This approach is inefficient and prone to errors, leading to incorrect assumptions and prolonged debugging cycles. For instance, in issue #2290, we initially assumed a UUID bug when the actual issue might have been different, highlighting the risks of debugging in the dark.

Proposed Solution: Job Lineage Tracking

To address these challenges, we propose implementing Job Lineage Tracking. This system will track the parent-child relationships between jobs, enabling us to visualize and query job pipelines effectively. The core concept involves including a parent reference in the metadata when scheduling child jobs. For example:

# In SyncJob when scheduling MoviePageJob:
MoviePageJob.new(
  %{movie_slug: slug, ...},
  meta: %{parent_job_id: job.id}  # ← Track parent
)

# In MoviePageJob when scheduling MovieDetailJob:
MovieDetailJob.new(
  %{movie_slug: slug, ...},
  meta: %{parent_job_id: job.id}  # ← Track parent
)

By adding this metadata, we can construct queries to trace job trees. A sample SQL query demonstrates how this can be achieved:

-- Recursive query to get all jobs spawned by SyncJob #12345
WITH RECURSIVE job_tree AS (
  SELECT id, worker, state, args, 1 as depth
  FROM oban_jobs 
  WHERE id = 12345  -- Root SyncJob
  
  UNION ALL
  
  SELECT j.id, j.worker, j.state, j.args, jt.depth + 1
  FROM oban_jobs j
  JOIN job_tree jt ON j.meta->>'parent_job_id' = jt.id::text
)
SELECT 
  depth,
  worker,
  state,
  COUNT(*) as count
FROM job_tree
GROUP BY depth, worker, state
ORDER BY depth, worker;

The results of such queries can provide valuable insights. For instance, the following result set reveals that MovieDetailJobs are scheduled but not executing:

depth | worker                    | state     | count
------+---------------------------+-----------+-------
1     | SyncJob                   | completed | 1
2     | MoviePageJob              | completed | 70
2     | MoviePageJob              | available | 63
2     | MoviePageJob              | executing | 3
3     | MovieDetailJob            | available | 115  ← STUCK
3     | ShowtimeProcessJob        | discarded | 1111 ← FAILING

This level of visibility allows us to quickly identify bottlenecks and failure points within the pipeline.

Implementation Phases

To implement Job Lineage Tracking, we propose a phased approach, starting with immediate debug queries and progressing towards a comprehensive UI dashboard.

Phase 0: Immediate Debug Queries (0.5 hour) ⚑ URGENT

This phase involves no code changes and focuses on using SQL queries to debug the current Kino Krakow Scraper failure. This is an urgent step to gain immediate insights into the issue. Sample queries include:

-- 1. Find the most recent SyncJob
SELECT id, state, inserted_at 
FROM oban_jobs 
WHERE worker LIKE '%SyncJob'
ORDER BY inserted_at DESC 
LIMIT 1;

-- 2. Manually trace children by timestamp + source_id
-- (because we don't have parent_job_id yet)
SELECT 
  worker,
  state,
  COUNT(*) as count
FROM oban_jobs
WHERE args->>'source_id' = '13'  -- Kino Krakow source
AND inserted_at > '2025-11-18 13:00:00'
GROUP BY worker, state;

The purpose of this phase is to understand the current failure while laying the groundwork for the tracking system.

Phase 1: Metadata Tracking (1-2 hours) 🎯 MVP

This phase focuses on adding parent tracking to existing jobs. The files to modify include:

  • lib/eventasaurus_discovery/sources/kino_krakow/jobs/sync_job.ex
  • lib/eventasaurus_discovery/sources/kino_krakow/jobs/movie_page_job.ex

Changes involve adding the parent_job_id to the metadata when scheduling child jobs, as demonstrated in the following code snippet:

# sync_job.ex - when scheduling MoviePageJob
MoviePageJob.new(
  args,
  queue: :scraper_detail,
  meta: %{parent_job_id: job.id}  # ← ADD THIS
)

# movie_page_job.ex - when scheduling MovieDetailJob
MovieDetailJob.new(
  args,
  queue: :scraper_detail,
  meta: %{parent_job_id: job.id}  # ← ADD THIS
)

# movie_page_job.ex - when scheduling ShowtimeProcessJob
ShowtimeProcessJob.new(
  args,
  queue: :scraper,
  meta: %{parent_job_id: job.id}  # ← ADD THIS
)

The immediate benefit of this phase is the ability to query job relationships from the next import run onwards.

Phase 2: Query Helpers (1 hour) πŸ“Š

This phase involves creating a helper module for common queries. A new file, lib/eventasaurus_app/oban/job_lineage.ex, will be created with the following structure:

defmodule EventasaurusApp.Oban.JobLineage do
  @moduledoc """
  Helper functions for querying job parent-child relationships.
  """
  
  import Ecto.Query
  alias EventasaurusApp.Repo
  
  @doc "Get all descendants of a job (recursive)"
  def get_job_tree(root_job_id) do
    # Recursive CTE query
  end
  
  @doc "Get direct children of a job"
  def get_children(parent_job_id) do
    # Simple meta->>'parent_job_id' query
  end
  
  @doc "Get ancestors of a job (trace back to root)"
  def get_ancestors(child_job_id) do
    # Recursive CTE query upward
  end
  
  @doc "Get pipeline statistics (counts per stage)"
  def get_pipeline_stats(root_job_id) do
    # Job tree with COUNT(*) per worker/state
  end
end

This module will provide functions to get job trees, children, ancestors, and pipeline statistics. Usage examples include:

# In IEx or controller
JobLineage.get_pipeline_stats(sync_job_id)
# => %{
#   "MoviePageJob" => %{completed: 70, available: 63},
#   "MovieDetailJob" => %{available: 115},
#   "ShowtimeProcessJob" => %{discarded: 1111}
# }

Phase 3: Simple UI Dashboard (2-3 hours) 🎨

This phase focuses on creating a simple UI dashboard to visualize job lineage. A new route, /admin/job-lineage/:job_id, will be added, along with a controller (lib/eventasaurus_web/controllers/admin/job_lineage_controller.ex) and a view (lib/eventasaurus_web/templates/admin/job_lineage/show.html.heex). The dashboard will provide a text-based job tree visualization, such as:

SyncJob #12345 (completed)
β”œβ”€ MoviePageJob (136 scheduled)
β”‚  β”œβ”€ Completed: 70
β”‚  β”œβ”€ Available: 63 ⚠️
β”‚  └─ Executing: 3
β”œβ”€ MovieDetailJob (115 scheduled)
β”‚  β”œβ”€ Available: 115 ⚠️ STUCK
β”‚  └─ Completed: 0
└─ ShowtimeProcessJob (1111 scheduled)
   β”œβ”€ Discarded: 1111 ❌
   └─ Completed: 0

Key features will include job counts by state at each pipeline stage, highlighting stuck stages, links to the Oban dashboard for individual job details, and sample job arguments for debugging.

Phase 4: Enhanced Visualization (Future) πŸš€

This phase outlines potential future enhancements, including interactive graph visualizations (using D3.js), real-time updates via Phoenix LiveView, historical tracking in a dedicated table, exporting job trees to JSON/CSV, and comparing job runs side-by-side.

Use Cases

Job Lineage Tracking has several critical use cases:

1. Debug Kino Krakow Scraper (Current)

Previously, debugging the Kino Krakow Scraper involved manual SQL queries and guesswork. With Job Lineage, we can use the JobLineage.get_pipeline_stats(sync_job_id) function in IEx to instantly identify where the pipeline is stuck.

2. Debug Any Multi-Step Job

This system can be applied to any multi-step job, such as Unsplash refresh coordinators, venue enrichment pipelines, and report generation processes.

3. Monitor Job Execution

The dashboard provides insights into the number of jobs spawned at each stage, bottlenecks, jobs that never executed, and failure propagation through the pipeline.

Benefits

Job Lineage Tracking offers significant benefits:

βœ… Visibility - See the entire job pipeline at a glance. βœ… Debugging - Trace failures back to the source job. βœ… Monitoring - Identify bottlenecks and stuck stages. βœ… Reusable - Works for any multi-step job workflow. βœ… Simple - Uses existing Oban metadata, requiring no new tables for the MVP. βœ… Immediate - Job relationships can be queried immediately after Phase 1.

Relationship to Other Issues

vs. Issue #2261 (Job Execution Monitoring)

While #2261 focuses on batch-level metrics and historical trends, Job Lineage Tracking provides insight into pipeline structure and relationships. They are complementary; #2261 answers β€œDid the job succeed?” while Job Lineage answers β€œWhat jobs ran? What’s the flow?”

vs. Issue #2291 (Debugging Instrumentation)

#2291 focuses on logging and error context within jobs, while Job Lineage addresses relationships between jobs. Both are needed for comprehensive observability; #2291 for job internals and Job Lineage for job relationships.

Success Criteria

The success of this project will be measured by the ability to:

  • Trace ShowtimeProcessJob back to the originating SyncJob.
  • See how many MoviePageJobs were spawned by the SyncJob.
  • Identify which stage in the pipeline is stuck or failing.
  • Query the job tree with a single function call.
  • Visualize the job pipeline in the admin dashboard.
  • Use this system for ANY multi-step job.

Technical Notes

Why metadata over a dedicated table?

Using job.meta is simpler and faster for Phases 1-3. A dedicated table can be considered for historical tracking in Phase 4.

Why recursive CTEs?

PostgreSQL supports WITH RECURSIVE for tree queries, requiring no additional dependencies and being efficient for job trees typically 3-4 levels deep.

Backward compatibility:

Existing jobs without parent_job_id will work fine, with only new jobs getting tracking. No migration is needed for Phases 1-2.

Related Files

To modify (Phase 1):

  • lib/eventasaurus_discovery/sources/kino_krakow/jobs/sync_job.ex
  • lib/eventasaurus_discovery/sources/kino_krakow/jobs/movie_page_job.ex

To create (Phase 2):

  • lib/eventasaurus_app/oban/job_lineage.ex

To create (Phase 3):

  • lib/eventasaurus_web/controllers/admin/job_lineage_controller.ex
  • lib/eventasaurus_web/templates/admin/job_lineage/show.html.heex

Next Steps

Immediate (to debug current failure):

  1. Implement Phase 0 queries to understand the current state.
  2. Implement Phase 1 metadata tracking.
  3. Re-run Kino Krakow import with tracking enabled.
  4. Use Phase 2 queries to diagnose the actual issue.

Short-term:

  1. Implement Phase 2 query helpers.
  2. Implement Phase 3 UI dashboard.

Long-term:

  1. Phase 4 enhancements as needed.

This initiative is CRITICAL due to its prerequisite nature for debugging complex job workflows, providing HIGH impact by offering reusable infrastructure for all multi-step jobs, and requires LOW effort for the MVP (Phases 0-2: ~3 hours) and MEDIUM effort for the full system (Phases 0-3: ~6 hours).

In conclusion, Job Lineage Tracking is a crucial step towards improving the observability and debuggability of multi-step job pipelines. By tracking parent-child relationships and providing tools for querying and visualizing job flows, we can significantly reduce debugging time and improve the reliability of our systems.

For more information on job management and pipeline observability, consider exploring resources like Oban's official documentation, a robust job processing library for Elixir applications.