Apache Spark : for those who starred Spark in Github, what else projects were starred?

Published in

Analytics Vidhya

3 min readFeb 26, 2021

Recently, I wrote a tool “universe-lite”: https://github.com/GuandataOSS/universe-lite.

Which is a lightweight ELT & ETL tool, based on Duckdb and Apache Parquet, seamless integration with Python & Java plugins. Which also describe ETL steps in plain config file (TypeSafe Config format).

Since I already have a hammer, I want to find some nails to practice.

During daily work, I use Apache Spark a lot. So, first idea is I want to know: For those who starred Apache Spark Github repo, which else repos are they interested in? And whether there are some great project which I haven’t learned?

So, let’s start. The goal is very clear, but the road is tough. But “there are more solutions than problems”, after solve several unexpected issues, I finally get all data, which is a dataset with 12,115,030 rows. (I will talk later about “universe-lite” tool and how to use it to get Github data later).

1. Spark’s star count change trending

Bar Chart stands for the monthly added new count, the y axis is at left side
Line Chart stands for the total accumulated count, the y axis is at right side

2. What else projects are starred at the same time

The top 5 projects are:

tensorflow (10684)
kubernetes (7152)
elasticsearch (6768)
moby (5786) (former name is: docker）
react (5624)

All of them are very successful and popular open source projects.

3. What else “Spark Related” projects are starred

The top projects listed in section 2 are all popular open source projects. But what I want to know more is those “Spark Related” projects.

So, how can I tell which projects are spark related or not? First try was to check whether that project’s “name” or “description” contains keyword “spark”. But the result is not very promising.

Then, another way comes: I will filter the data in above table, and only keep what “percent” is above 20%. By this way, it means: Very high percent of that projects’ followers are also starred Spark. So, those projects are very likely to be in Spark ecosystem.

The result table is:

The top 5 projects are:

kafka (5382)
flink (4955)
hadoop (4473)
scala (3693)
akka (3236)

Yes, those are same as expected.

4. Among people who starred Spark, what is the “total starred project number” distribution

My Github account starred about 700 projects. I want to know: how many projects are starred by other people.

Based on this chart:

there are 3634 people, who only starred less than 10 projects in Github
most people only starred less than 1000 projects
people “who give out most stars” starred near 50K projects

Ending Words

Data is valuable, Charts can talk. I will also bring more Data Stories later, stay tuned!