# Reproduction of "Scalability! But at what COST!"

This post summarizes the results of the experiments in the COST paper that I reproduced for my CPSC 508, Graduate Operating Systems class assignment earlier in the year (January 2019).

## Goal

The goal of this post is to reproduce and verify the results of the experiments presented in the COST paper.

## Summary of the paper

In the COST paper, the authors provide a new, unbiased metric for measuring scalability of distributed graph processing systems. The metric provided, called COST i.e. Configuration that Outperforms a Single Thread, measures the hardware configuration required by systems to outperform a single-threaded implementation of the same task.

The goal of this paper is to measure the real scalability of distributed graph processing systems without rewarding systems that bring substantial but parallelizable overheads. The authors additionally aim to show that single-threaded implementations can significantly outperform “scalable” distributed frameworks and that they should be used as benchmarks for measuring scalability.

The hypotheses of this paper is two-fold. Firstly, the distributed graph-processing frameworks don’t use a good baseline while measuring their scalability and that this scalability comes at a cost of the overheads that these systems introduce. Secondly, the single-threaded implementation of the same tasks performed by the system are a great baseline for measuring the true scalability of these systems.

## Experimental Setup

All experiments were carried out on an Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz with 32GB RAM and 250GB SSD. The machine was running Ubuntu 16.04 LTS. For producing the results for the experiment, I used the following tools :

- Rust implementations of the graph algorithms provided by the authors to re-run all the single-thread experiments.
- WebGraph to convert uk-2007-05 dataset into the format required by the rust implementation

The versions of cargo, rust, and java for pre-processing steps and are presented in the following Table

Language/Software | Version |
---|---|

Cargo | 1.3.0 |

Rustc | 1.3.1 |

Java | 1.8.0 u191 |

**Table 1 : Versions of various softwares and languages used in the study.**

## Preprocessing

Before, one can run the experiments shown in the original paper, one must complete a bunch of tasks to process the datasets into the correct format required by the provided code of COST. The timing results for each of these tasks are shown in Table 2 below . The tasks are listed in the order they must be performed.

Preprocessing Task | twitter_rv | uk-2007-05 |
---|---|---|

Convert to Input Format | 0s | 16946s |

Convert to Vertex Order | 156s | 402s |

Convert to Hilbert Order | 147s | 276s |

Finding Max Vertex ID (Vertex) | 1.8s | 14.9s |

Finding Max Vertex ID (Hilbert) | 1.5s | 13.3s |

Total | 306.3 | 17652.2 |

**Table 2 : Computation Time (in s) for various preprocessing tasks for Twitter and uk datasets.**

The twitter_rv dataset is already available in the Input format required by the COST framework to convert the dataset to vertex and hilbert order. Thus, the preprocessing time required for converting this dataset to input format was labelled as 0.

The uk-2007-05 dataset is only available in binary format which is readable by WebGraph. The conversion of uk-2007-05 dataset takes a long time as the code I wrote is not fully optimized to convert the dataset into the format required. The code is a modification of the BreadthFirst Example provided by the WebGraph developers. The modified code prints every edge to stderr and the script running the code pipes the output on stderr to a file, thus the code pays a double-write penalty. A faster implementation would be to directly write to the output file in a buffered fashion.

## Reproduced Experiment Results

Name | twitter_rv | uk-2007-05 |
---|---|---|

nodes | 41,652,230 | 105,896,555 |

edges | 1,468,365,182 | 3,738,733,648 |

size | 5.76GB | 14.72GB |

txt format size | 25GB | 63GB |

encoded size | 5.8GB | 14.71GB |

**Table 3 shows the size of the datasets used in the experiments. This table is a reproduction of Table 1 from the original paper**

Scalable System | cores | twitter_rv | uk-2007-05 |
---|---|---|---|

GraphChi~\cite{kyrola2012graphchi} | 2 | 3160s | 6972s |

Stratosphere~\cite{ewen2012spinning} | 16 | 2250s | - |

X-Stream~\cite{roy2013x} | 16 | 1488s | - |

Spark~\cite{gonzalez2014graphx} | 128 | 857s | 1759s |

Giraph~\cite{gonzalez2014graphx} | 128 | 596s | 1235s |

GraphLab~\cite{gonzalez2014graphx} | 128 | 249s | 833s |

GraphX~\cite{gonzalez2014graphx} | 128 | 419s | 462s |

Single Thread (SSD) | 1 | 300s | 651s |

Single Thread (RAM) | 1 | 275s | - |

Single Thread Reproduced (SSD) | 1 | 289.6s | 119.7s |

**Table 4 compares the performance of the scalable systems with that of a single-threaded implementation of 20 iterations of the pagerank algorithm. This table is a reproduction of Table 2 from the original paper. Only the rows labeled as “Reproduced” were reproduced.**

Scalable System | cores | twitter_rv | uk-2007-05 |
---|---|---|---|

Stratosphere~\cite{ewen2012spinning} | 16 | 950s | - |

X-Stream~\cite{roy2013x} | 16 | 1159s | - |

Spark~\cite{gonzalez2014graphx} | 128 | 1784s | >=8000s |

Giraph~\cite{gonzalez2014graphx} | 128 | 200s | >=8000s |

GraphLab~\cite{gonzalez2014graphx} | 128 | 242s | 714s |

GraphX~\cite{gonzalez2014graphx} | 128 | 251s | 800s |

Single Thread (SSD) | 1 | 153s | 417s |

Vertex Reproduced (SSD) | 1 | 93s | 58.8s |

Hilbert Reproduced (SSD) | 1 | 30s | 57s |

**Table 5 compares the performance of the scalable systems with that of a single-threaded implementation of label propagation algorithm. I have added the results from using both orderings of vertices as well to the table.This table is a reproduction of Table 3 from the original paper. Only the rows labeled as “Reproduced” were reproduced.**

Scalable System | cores | twitter_rv | uk-2007-05 |
---|---|---|---|

GraphLab~\cite{gonzalez2014graphx} | 128 | 249s | 833s |

GraphX~\cite{gonzalez2014graphx} | 128 | 419s | 462s |

Vertex Order (SSD) | 1 | 300s | 651s |

Vertex Order (RAM) | 1 | 275s | - |

Hilbert Order (SSD) | 1 | 242s | 256s |

Hilbert Order (RAM) | 1 | 110s | - |

Vertex Order Reproduced (SSD) | 1 | 289.6s | 119.7s |

Hilbert Order Reproduced (SSD) | 1 | 84s | 141.6s |

**Table 6 compares the performance of the scalable systems with that of a single-threaded implementation of 20 iterations of the pagerank algorithm for different orderings of vertices. This table is a reproduction of Table 4 from the original paper. Only the rows labeled as “Reproduced” were reproduced.**

Scalable System | cores | twitter_rv | uk-2007-05 |
---|---|---|---|

GraphLab~\cite{gonzalez2014graphx} | 128 | 242s | 714s |

GraphX~\cite{gonzalez2014graphx} | 128 | 251s | 800s |

Single thread (SSD) | 1 | 153s | 417s |

Union-Find (SSD) | 1 | 15s | 30s |

Vertex LabelProp Reproduced (SSD) | 1 | 93s | 58.8s |

Hilbert LabelProp Reproduced (SSD) | 1 | 30s | 57s |

Vertex Union-Find Reproduced (SSD) | 1 | 24.2s | 11.744s |

Hilbert Union-Find Reproduced (SSD) | 1 | 5.9s | 13.528s |

**Table 7 compares the performance of the scalable systems with that of single-threaded implementations of label propagation and union-find algorithms. I have added the results from using both orderings of vertices as well to the table. This table is a reproduction of Table 3 from the original paper. Only the rows labeled as “Reproduced” were reproduced.**

## Comparison with the Original

Here is the comparison of the reproduced results with the original results.

Label_Prop :

Scalable System | uk | |
---|---|---|

Original SSD | 153s | 417s |

New Hilbert SSD | 30s | 57s |

New Vertex SSD | 93s | 58.5s |

PageRank :

Scalable System | uk | |
---|---|---|

Original SSD | 300s | 651s |

New Vertex SSD | 289.6s | 119.7s |

New Hilbert SSD | 84s | 141.6s |

Union_Find :

Scalable System | uk | |
---|---|---|

Original SSD | 15s | 30s |

New Vertex SSD | 24.2s | 11.744s |

New Hilbert SSD | 5.9s | 13.528s |

The reproduced results are clearly better than the original results. The primary reasoning behind this is that since the original paper was published, the hardware and the program toolchain has improved so that we see this increase in performance. Additionally, as the original results were produced on a Mac which does not have access to Big Pages that Linux does have access to.

After I brought this to the author’s attention via a github issue, the author did reproduce the results on the original Macbook used for the experiments, as well as a powerful desktop with and without Big Pages. The results were as follows for the twitter dataset :

Algorithm | Vertex Ordering | Laptop | Vaastav | Desktop | Desktop+ |
---|---|---|---|---|---|

pagerank20 | vertex | 440.58s | 289.6s | 253.19s | 174.31s |

pagerank20 | hilbert | 141.72s | 84.0s | 70.44s | 63.50s |

label_prop | vertex | 158.34s | 93.0s | 86.68s | 73.24s |

label_prop | hilbert | 57.11s | 30.0s | 27.99s | 26.09s |

union_find | vertex | 34.34s | 24.2s | 26.56s | 21.43s |

union_find | hilbert | 12.58s | 5.9s | 7.14s | 6.69s |

The author was very kind to give me a shoutout on Twitter(No pun intended) :

That third column contains the measurements from @vaastav05, who went out to see if the COST measurements could be reproduced (Science!) and reached the surprising conclusion that the measurements have *improved* in the past four years.https://t.co/QQelcj0tAm

— Frank McSherry (@frankmcsherry) June 15, 2019

## Bugs Found

I found a couple of issues while preprocessing the graphs.

- The stats binary provided by the authors had an overflow bug where the variable reporting the number of edges reporting was overflowing as the rust compiler was inferring the type of that variable to be a 32-bit signed integer. Because of this issue, the stats binary wasreporting the number of edges for uk-2007-05 dataset to be -556233648 instead of its correct value of 3738733648 . I fixed this (very simple) issue and submitted a Pull Request. The PR was merged on Monday morning 21st January, 2019 at about 2.00 am.
- All the iterators implicitly assume that the vertex IDs are limited by the max value of an unsigned 32-bit integer (~4.2Billion). For the datasets in question, the provided code would work perfectly, however if a dataset contains a vertex ID greater than the max value it would result in overflow errors and result in incorrect results being produced. Due to time constraints and unfamiliarity of rust, I was unable to provide a fix for this issue, however I have notified the authors of the problem by creating an issue on github. The author replied to me and encouraged me to explore this in a fork. Sadly, I haven’t had the time to fix this bug yet!

## Conclusion

In conclusion, COST is a great piece of scientific research not only in terms of contribution but also due to the fact that it is highly reproducible.