Limits...
Titian: Data Provenance Support in Spark.

Interlandi M, Shah K, Tetali SD, Gulzar MA, Yoo S, Kim M, Millstein T, Condie T - Proceedings VLDB Endowment (2015)

Bottom Line: Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort.Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result.Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders-of-magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

View Article: PubMed Central - HTML - PubMed

Affiliation: University of California, Los Angeles.

ABSTRACT

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance-tracking data through transformations-in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders-of-magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

No MeSH data available.


LineageRDD methods for traversing through the data lineage in both backward and forward directions. The native Spark compute method is used to plug a LineageRDD instance into the Spark dataflow (described in Section 4).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4697929&req=5

Figure 4: LineageRDD methods for traversing through the data lineage in both backward and forward directions. The native Spark compute method is used to plug a LineageRDD instance into the Spark dataflow (described in Section 4).

Mentions: Figure 4 lists the transformations that LineageRDD supports. The goBackAll and goNextAll methods can be used to compute the full trace backward and forward respectively. That is, given some result record(s), goBackAll returns all initial input records that contributed—through the transformation series leading to—the result record(s); goNextAll returns all the final result records that a starting input record(s) contributed to in a transformation series. A single step backward or forward is supported by the goBack and goNext respectively.


Titian: Data Provenance Support in Spark.

Interlandi M, Shah K, Tetali SD, Gulzar MA, Yoo S, Kim M, Millstein T, Condie T - Proceedings VLDB Endowment (2015)

LineageRDD methods for traversing through the data lineage in both backward and forward directions. The native Spark compute method is used to plug a LineageRDD instance into the Spark dataflow (described in Section 4).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4697929&req=5

Figure 4: LineageRDD methods for traversing through the data lineage in both backward and forward directions. The native Spark compute method is used to plug a LineageRDD instance into the Spark dataflow (described in Section 4).
Mentions: Figure 4 lists the transformations that LineageRDD supports. The goBackAll and goNextAll methods can be used to compute the full trace backward and forward respectively. That is, given some result record(s), goBackAll returns all initial input records that contributed—through the transformation series leading to—the result record(s); goNextAll returns all the final result records that a starting input record(s) contributed to in a transformation series. A single step backward or forward is supported by the goBack and goNext respectively.

Bottom Line: Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort.Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result.Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders-of-magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

View Article: PubMed Central - HTML - PubMed

Affiliation: University of California, Los Angeles.

ABSTRACT

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance-tracking data through transformations-in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds-orders-of-magnitude faster than alternative solutions-while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

No MeSH data available.