Job Lineage Tracking: Debugging Multi-Step Pipelines
Debugging complex, multi-step job pipelines can be a nightmare without proper visibility. When a workflow fails, identifying the root cause becomes a time-consuming and frustrating process. This article proposes a solution: Job Lineage Tracking, a system designed to track parent-child relationships between jobs, providing a clear view of job pipelines and simplifying debugging.
The Problem: Debugging Multi-Step Job Pipelines
Currently, when a complex workflow encounters an issue, there's a significant lack of insight into the job execution flow. We are unable to determine which parent job triggered specific child jobs, the number of jobs spawned at each stage, the precise location of the failure within the pipeline, the completion status of individual jobs, and the overall job dependency tree. This lack of visibility makes debugging a challenging and time-intensive task.
Consider a current production example: the Kino Krakow Scraper failure. In this scenario, 1,111 ShowtimeProcessJobs failed with a :movie_not_ready error. The existing blind spots make it impossible to answer crucial questions such as:
- Did the SyncJob initiate all 108 expected MoviePageJobs?
- Did the 70 completed MoviePageJobs successfully spawn MovieDetailJobs?
- Do the 115 "available" MovieDetailJobs originate from the same import run?
- Why are certain MovieDetailJobs never executing?
- Which specific ShowtimeProcessJobs failed and for which MoviePageJobs?
Our current debugging process involves manual SQL queries, joining jobs by timestamp and arguments, and often relying on educated guesses to establish relationships. This approach is inefficient and prone to errors, leading to incorrect assumptions and prolonged debugging cycles. For instance, in issue #2290, we initially assumed a UUID bug when the actual issue might have been different, highlighting the risks of debugging in the dark.
Proposed Solution: Job Lineage Tracking
To address these challenges, we propose implementing Job Lineage Tracking. This system will track the parent-child relationships between jobs, enabling us to visualize and query job pipelines effectively. The core concept involves including a parent reference in the metadata when scheduling child jobs. For example:
# In SyncJob when scheduling MoviePageJob:
MoviePageJob.new(
%{movie_slug: slug, ...},
meta: %{parent_job_id: job.id} # β Track parent
)
# In MoviePageJob when scheduling MovieDetailJob:
MovieDetailJob.new(
%{movie_slug: slug, ...},
meta: %{parent_job_id: job.id} # β Track parent
)
By adding this metadata, we can construct queries to trace job trees. A sample SQL query demonstrates how this can be achieved:
-- Recursive query to get all jobs spawned by SyncJob #12345
WITH RECURSIVE job_tree AS (
SELECT id, worker, state, args, 1 as depth
FROM oban_jobs
WHERE id = 12345 -- Root SyncJob
UNION ALL
SELECT j.id, j.worker, j.state, j.args, jt.depth + 1
FROM oban_jobs j
JOIN job_tree jt ON j.meta->>'parent_job_id' = jt.id::text
)
SELECT
depth,
worker,
state,
COUNT(*) as count
FROM job_tree
GROUP BY depth, worker, state
ORDER BY depth, worker;
The results of such queries can provide valuable insights. For instance, the following result set reveals that MovieDetailJobs are scheduled but not executing:
depth | worker | state | count
------+---------------------------+-----------+-------
1 | SyncJob | completed | 1
2 | MoviePageJob | completed | 70
2 | MoviePageJob | available | 63
2 | MoviePageJob | executing | 3
3 | MovieDetailJob | available | 115 β STUCK
3 | ShowtimeProcessJob | discarded | 1111 β FAILING
This level of visibility allows us to quickly identify bottlenecks and failure points within the pipeline.
Implementation Phases
To implement Job Lineage Tracking, we propose a phased approach, starting with immediate debug queries and progressing towards a comprehensive UI dashboard.
Phase 0: Immediate Debug Queries (0.5 hour) β‘ URGENT
This phase involves no code changes and focuses on using SQL queries to debug the current Kino Krakow Scraper failure. This is an urgent step to gain immediate insights into the issue. Sample queries include:
-- 1. Find the most recent SyncJob
SELECT id, state, inserted_at
FROM oban_jobs
WHERE worker LIKE '%SyncJob'
ORDER BY inserted_at DESC
LIMIT 1;
-- 2. Manually trace children by timestamp + source_id
-- (because we don't have parent_job_id yet)
SELECT
worker,
state,
COUNT(*) as count
FROM oban_jobs
WHERE args->>'source_id' = '13' -- Kino Krakow source
AND inserted_at > '2025-11-18 13:00:00'
GROUP BY worker, state;
The purpose of this phase is to understand the current failure while laying the groundwork for the tracking system.
Phase 1: Metadata Tracking (1-2 hours) π― MVP
This phase focuses on adding parent tracking to existing jobs. The files to modify include:
lib/eventasaurus_discovery/sources/kino_krakow/jobs/sync_job.exlib/eventasaurus_discovery/sources/kino_krakow/jobs/movie_page_job.ex
Changes involve adding the parent_job_id to the metadata when scheduling child jobs, as demonstrated in the following code snippet:
# sync_job.ex - when scheduling MoviePageJob
MoviePageJob.new(
args,
queue: :scraper_detail,
meta: %{parent_job_id: job.id} # β ADD THIS
)
# movie_page_job.ex - when scheduling MovieDetailJob
MovieDetailJob.new(
args,
queue: :scraper_detail,
meta: %{parent_job_id: job.id} # β ADD THIS
)
# movie_page_job.ex - when scheduling ShowtimeProcessJob
ShowtimeProcessJob.new(
args,
queue: :scraper,
meta: %{parent_job_id: job.id} # β ADD THIS
)
The immediate benefit of this phase is the ability to query job relationships from the next import run onwards.
Phase 2: Query Helpers (1 hour) π
This phase involves creating a helper module for common queries. A new file, lib/eventasaurus_app/oban/job_lineage.ex, will be created with the following structure:
defmodule EventasaurusApp.Oban.JobLineage do
@moduledoc """
Helper functions for querying job parent-child relationships.
"""
import Ecto.Query
alias EventasaurusApp.Repo
@doc "Get all descendants of a job (recursive)"
def get_job_tree(root_job_id) do
# Recursive CTE query
end
@doc "Get direct children of a job"
def get_children(parent_job_id) do
# Simple meta->>'parent_job_id' query
end
@doc "Get ancestors of a job (trace back to root)"
def get_ancestors(child_job_id) do
# Recursive CTE query upward
end
@doc "Get pipeline statistics (counts per stage)"
def get_pipeline_stats(root_job_id) do
# Job tree with COUNT(*) per worker/state
end
end
This module will provide functions to get job trees, children, ancestors, and pipeline statistics. Usage examples include:
# In IEx or controller
JobLineage.get_pipeline_stats(sync_job_id)
# => %{
# "MoviePageJob" => %{completed: 70, available: 63},
# "MovieDetailJob" => %{available: 115},
# "ShowtimeProcessJob" => %{discarded: 1111}
# }
Phase 3: Simple UI Dashboard (2-3 hours) π¨
This phase focuses on creating a simple UI dashboard to visualize job lineage. A new route, /admin/job-lineage/:job_id, will be added, along with a controller (lib/eventasaurus_web/controllers/admin/job_lineage_controller.ex) and a view (lib/eventasaurus_web/templates/admin/job_lineage/show.html.heex). The dashboard will provide a text-based job tree visualization, such as:
SyncJob #12345 (completed)
ββ MoviePageJob (136 scheduled)
β ββ Completed: 70
β ββ Available: 63 β οΈ
β ββ Executing: 3
ββ MovieDetailJob (115 scheduled)
β ββ Available: 115 β οΈ STUCK
β ββ Completed: 0
ββ ShowtimeProcessJob (1111 scheduled)
ββ Discarded: 1111 β
ββ Completed: 0
Key features will include job counts by state at each pipeline stage, highlighting stuck stages, links to the Oban dashboard for individual job details, and sample job arguments for debugging.
Phase 4: Enhanced Visualization (Future) π
This phase outlines potential future enhancements, including interactive graph visualizations (using D3.js), real-time updates via Phoenix LiveView, historical tracking in a dedicated table, exporting job trees to JSON/CSV, and comparing job runs side-by-side.
Use Cases
Job Lineage Tracking has several critical use cases:
1. Debug Kino Krakow Scraper (Current)
Previously, debugging the Kino Krakow Scraper involved manual SQL queries and guesswork. With Job Lineage, we can use the JobLineage.get_pipeline_stats(sync_job_id) function in IEx to instantly identify where the pipeline is stuck.
2. Debug Any Multi-Step Job
This system can be applied to any multi-step job, such as Unsplash refresh coordinators, venue enrichment pipelines, and report generation processes.
3. Monitor Job Execution
The dashboard provides insights into the number of jobs spawned at each stage, bottlenecks, jobs that never executed, and failure propagation through the pipeline.
Benefits
Job Lineage Tracking offers significant benefits:
β Visibility - See the entire job pipeline at a glance. β Debugging - Trace failures back to the source job. β Monitoring - Identify bottlenecks and stuck stages. β Reusable - Works for any multi-step job workflow. β Simple - Uses existing Oban metadata, requiring no new tables for the MVP. β Immediate - Job relationships can be queried immediately after Phase 1.
Relationship to Other Issues
vs. Issue #2261 (Job Execution Monitoring)
While #2261 focuses on batch-level metrics and historical trends, Job Lineage Tracking provides insight into pipeline structure and relationships. They are complementary; #2261 answers βDid the job succeed?β while Job Lineage answers βWhat jobs ran? Whatβs the flow?β
vs. Issue #2291 (Debugging Instrumentation)
#2291 focuses on logging and error context within jobs, while Job Lineage addresses relationships between jobs. Both are needed for comprehensive observability; #2291 for job internals and Job Lineage for job relationships.
Success Criteria
The success of this project will be measured by the ability to:
- Trace ShowtimeProcessJob back to the originating SyncJob.
- See how many MoviePageJobs were spawned by the SyncJob.
- Identify which stage in the pipeline is stuck or failing.
- Query the job tree with a single function call.
- Visualize the job pipeline in the admin dashboard.
- Use this system for ANY multi-step job.
Technical Notes
Why metadata over a dedicated table?
Using job.meta is simpler and faster for Phases 1-3. A dedicated table can be considered for historical tracking in Phase 4.
Why recursive CTEs?
PostgreSQL supports WITH RECURSIVE for tree queries, requiring no additional dependencies and being efficient for job trees typically 3-4 levels deep.
Backward compatibility:
Existing jobs without parent_job_id will work fine, with only new jobs getting tracking. No migration is needed for Phases 1-2.
Related Files
To modify (Phase 1):
lib/eventasaurus_discovery/sources/kino_krakow/jobs/sync_job.exlib/eventasaurus_discovery/sources/kino_krakow/jobs/movie_page_job.ex
To create (Phase 2):
lib/eventasaurus_app/oban/job_lineage.ex
To create (Phase 3):
lib/eventasaurus_web/controllers/admin/job_lineage_controller.exlib/eventasaurus_web/templates/admin/job_lineage/show.html.heex
Next Steps
Immediate (to debug current failure):
- Implement Phase 0 queries to understand the current state.
- Implement Phase 1 metadata tracking.
- Re-run Kino Krakow import with tracking enabled.
- Use Phase 2 queries to diagnose the actual issue.
Short-term:
- Implement Phase 2 query helpers.
- Implement Phase 3 UI dashboard.
Long-term:
- Phase 4 enhancements as needed.
This initiative is CRITICAL due to its prerequisite nature for debugging complex job workflows, providing HIGH impact by offering reusable infrastructure for all multi-step jobs, and requires LOW effort for the MVP (Phases 0-2: ~3 hours) and MEDIUM effort for the full system (Phases 0-3: ~6 hours).
In conclusion, Job Lineage Tracking is a crucial step towards improving the observability and debuggability of multi-step job pipelines. By tracking parent-child relationships and providing tools for querying and visualizing job flows, we can significantly reduce debugging time and improve the reliability of our systems.
For more information on job management and pipeline observability, consider exploring resources like Oban's official documentation, a robust job processing library for Elixir applications.