Pause, reflect and Plan ahead!

June 27, 2018

Hey all! I am sorry for been gone for so long.

Welcome to Week-7 of Google Summer of Code. First of all, I’d like to thank my mentors: Jesus and Valerio for passing me in the first evaluations! I am very happy about it. They’ve been really great and supportive throughout and I’d like to Thank them for that.

Since I skipped the last week, this post might be a little longer than usual.

In Brief

Task-1 [COMPLETED]:

This task was to make manuscripts use the elasticsearch_dsl library where ever it was querying Elasticsearch for data. The PR for this has been merged with the main repo by Jesus, thus finishing this task.

Task-2 [UNDER-PROGRESS]:

This task was to open issues for the metrics still missing in Manuscripts. There is actually a great discussion going on in the Mailinglists about how the to’s of the Metrics. Once the community comes to a conclusion, I think I’ll have a better understanding of what needs to be done. Till then, only the metrics under Code Development remain which I’ll be adding this week (hopefully)? More about this below.

NOTE: There is going to be a meeting regarding the GMD metrics on Thursday at 11 AM CDT where the above discussion will be continued. So please be there if you have any suggestions or if you want to just follow along.

Task-3 [UNDER-PROGRESS]:

This task is still underway. In this, we had redesign the functions calculating the Metrics.

Task-4 [UNDER-PROGRESS]:

I was able create notebooks for Altair and Seaborn. They are statistical Visualisation libraries and that means they depend upon how varied the data is and what we are trying to plot. They need data in a certain format known as “tidy data”. Read about tidy-data here. It’s quite an interesting read. I’ll explain more about tidy data and about Altair & Seaborn below.

Task-5 [UNDER-PROGRESS]:

This task is to make PRs on the Metrics which can be calculated. I was able to make a PR calculating time to first response duration.

Extra:

I was able to solve another old Issue regarding changing the hardcoded start date in manuscripts. The default start date for any data source is 2015-01-01. This PR calculates the minimum date from all the data sources being queried and sets it to the start date if it’s not provided.

Let’s dive in deeper!


In Detail

Tasks-2 & 5:

Since Task-5 was derived from the issues created in Task-2, it’ll be appropriate to describe them both together. These tasks mainly require adding code to Grimoire-elk and Perceval.

A considerable amount of my time this week went into understanding both of these code bases. (That was fun!)

Perceval is used to fetch data from different data sources such as git, github, mailinglists, gerrit and so on.

Grimoire-elk, using perceval, fetches this data and saves it as raw indices into elasticsearch. Perceval’s job is then done. After that, Grimoire-elk queries elasticsearch to retrieve this raw data and enriches it by calculating the fields needed to calculate the metrics and saves this enriched data back into elasticsearch in a second index (with possibly a similar name). This is, in short, what happens when we run the command:

$ p2o.py --enrich --index <raw index name> --index-enrich <enriched index name> -e <URI of where elasticsearch is running> --no_inc --debug <data source> <corresponding parameters for the data source as given to perceval>

The raw index name is where the data fetched from perceval will be stored. The enriched index name is where the processed data will be stored after it’s enriched by grimoire-elk.


A bit about how PERCEVAL works

Perceval supports a variety of data sources and has specific set of functions for each of them to fetch the data. To query any data source, we give the command:

$ perceval <datasource> [-options] [arguments for that data source]

All the classes for each data source are derieved from Backend classes.

Here, we were dealing with the GitHub data source to calculate the metrics under Code Development. As we can see that almost all these metrics depend on it. The general command to fetch from github data source is:

$ perceval github <user/org-name> <repo-name> --token <github token>

So, perceval, by default fetches all the items (issues and prs in the repository) as if they were Issues. This behavour is defined by the category parameter that is passed to the fetch function when querying the repo for data. We can change this behaviour and fetch the data about pull requests (code reviews, commits made in that PR and such) by passing the category parameter to the perceval command:

$ perceval github --category pull_requests <user/org-name> <repo-name> --token <github token>

The issue and PR data that is fetched has a data field, apart from the other meta data fields, which contains the data that we need and which is enriched by grimoire-elk.

The data fetched under issues has the following fields: [‘assignee’, ‘assignee_data’, ‘assignees’, ‘assignees_data’, ‘author_association’, ‘body’, ‘closed_at’, ‘comments’, ‘comments_data’, ‘comments_url’, ‘created_at’, ‘events_url’, ‘html_url’, ‘id’, ‘labels’, ‘labels_url’, ‘locked’, ‘milestone’, ‘node_id’, ‘number’, ‘pull_request’, ‘reactions’, ‘reactions_data’, ‘repository_url’, ‘state’, ‘title’, ‘updated_at’, ‘url’, ‘user’, ‘user_data’]

And the data fetched under pull_requests has the following fields: [’_links’, ‘additions’, ‘assignee’, ‘assignees’, ‘author_association’, ‘base’, ‘body’, ‘changed_files’, ‘closed_at’, ‘comments’, ‘comments_url’, ‘commits’, ‘commits_data’, ‘commits_url’, ‘created_at’, ‘deletions’, ‘diff_url’, ‘head’, ‘html_url’, ‘id’, ‘issue_url’, ‘labels’, ‘locked’, ‘maintainer_can_modify’, ‘merge_commit_sha’, ‘mergeable’, ‘mergeable_state’, ‘merged’, ‘merged_at’, ‘merged_by’, ‘merged_by_data’, ‘milestone’, ‘node_id’, ‘number’, ‘patch_url’, ‘rebaseable’, ‘requested_reviewers’, ‘requested_reviewers_data’, ‘requested_teams’, ‘review_comment_url’, ‘review_comments’, ‘review_comments_data’, ‘review_comments_url’, ‘state’, ‘statuses_url’, ‘title’, ‘updated_at’, ‘url’, ‘user’, ‘user_data’]

As we can see, the latter one can provide us with a lot more insights.


Now, back to grimoire-elk. Grimoire-elk is a little more complex than perceval. It has a bunch of utilities such as p2o.py which we saw earlier which can be used to do certain tasks. Like perceval, grimoire-elk also has specific sets of functions for each of the data sources. These functions are responsible to enrich the raw data stored in elasticsearch and creating enriched data out of them.

For github, the caveat is that there are no functions in enriched/github.py file that can parse the pull_requests raw data that can be fetched from perceval using the --category flag as we saw above. We need change this add add functions to calculate the raw pull request data so that the metrics under Code Development can be calculated.

We can modify the get_rich_item function to reflect pull requests and issues as separate items or as the same item with the fields set to None in issues which are not present.

This is what I will be focusing on this coming week.

I was able to create a PR and add the time_to_first_attention field to the github data source, though. This field calculates the first time a reaction was made to a PR/issue from a user other than the one who created that PR/issue.

Other than that, discussion is still going on about the GMD-metrics and I think we’ll have good progress and come to definate conclusions in tomorrow’s meeting. That is it for Tasks-2 & 5.


Tasks-4:

Task 4 was to play with visualizations. We are planning to create interactive visualizations, using Plotly and Altair, in Jupyter Notebooks so the users can also dive in deeper using the visuals and get a better understanding of the metrics. You might have seen the Plotly Notebook that I created the previous week. This week I focused on Altair and Seaborn (Work in progress) notebooks.


Important Note: Initially we were thinking of converting these interactive graphs and charts into png/jpeg/pdf files so that they can be directly included into the PDF report generated by manuscripts. But I think that won’t be possible because:

  • Plotly doesnot actually provide the offline users with the functionality to directly get the visualization as images or pdf files. Read about it here. What Plotly does provide is an HTML file. That HTML file actually contains the JS functions and data that is used to create the interactive visualisation. We can use this HTML file and take a screenshot of the visualization generated as done by a Plotly user. But this technique is a bit hacky , will not always work (I’ve tried and I’ve failed) and will require the users to download additional libraries like selenium, PhantomJS and also the chrome or geko drivers which will actually run open the HTML file to take the screenshot.

  • Altair, although has a function to export a chart as image, uses a similar approach, as described above, which works brilliantly. But this too requires the users to download the chrome/geko drivers and selenium. This is still doable if we think about it because it’s not hacky. Read about how to do this here.

  • The reason I think that all (I also tried Bokeh to see if it was different) interactive visualisation libraries require all these extra libraries is because unlike Matplotlib, these libraries create visualisations using JS functions and data in JSON format. This requires extra processing to convert into static images and hence cannot be done easily.

All in all, we might end up using the Pandas in built visualisations and the classic Matplotlib or even Seaborn to get the static visualisations and limit Altair and Plotly to the Notebooks and HTML pages for interactive visualisations. I hope to discuss this more in today’s meeting.


Altair and Seaborn require data in a specific format, known as tidy-data. So, what is tidy data exactly? The human readable data contains parameters as Column names and the data for those parameters as the rows. Example:

date				opened_issues		closed_issues		remaining_issues
2016-01-01 00:00:00+00:00		1			1			NaN
2016-02-01 00:00:00+00:00		2			1			1.0
2016-03-01 00:00:00+00:00		8			4			4.0
2016-04-01 00:00:00+00:00		1			1			0.0
2016-05-01 00:00:00+00:00		0			0			0.0
2016-06-01 00:00:00+00:00		3			1			1.0
2016-07-01 00:00:00+00:00		2			2			0.0
2016-08-01 00:00:00+00:00		1			0			0.0
2016-09-01 00:00:00+00:00		7			5			2.0
2016-10-01 00:00:00+00:00		12			7			2.0
2016-11-01 00:00:00+00:00		7			4			1.0
2016-12-01 00:00:00+00:00		4			6			1.0
....

We can see that each row requires the headings to be specified so that we can determine the parameter that we are looking at. We need to provide extra data to distinguish between rows and fields.

Let’s have a look at the data which is tidy:

Date				issue_type		count
2016-01-01T00:00:00+00:00	opened_issues		1.0
2016-01-01T00:00:00+00:00	closed_issues		1.0
2016-02-01T00:00:00+00:00	opened_issues		2.0
2016-02-01T00:00:00+00:00	closed_issues		1.0
2016-02-01T00:00:00+00:00	remaining_issues	1.0
2016-03-01T00:00:00+00:00	opened_issues		8.0
2016-03-01T00:00:00+00:00	closed_issues		4.0
2016-03-01T00:00:00+00:00	remaining_issues	4.0
2016-04-01T00:00:00+00:00	opened_issues		1.0
2016-04-01T00:00:00+00:00	closed_issues		1.0
2016-04-01T00:00:00+00:00	remaining_issues	0.0
2016-05-01T00:00:00+00:00	opened_issues		0.0
2016-05-01T00:00:00+00:00	closed_issues		0.0
2016-05-01T00:00:00+00:00	remaining_issues	0.0
2016-06-01T00:00:00+00:00	opened_issues		3.0
2016-06-01T00:00:00+00:00	closed_issues		1.0
2016-06-01T00:00:00+00:00	remaining_issues	1.0
....

In the table above, we can see how tidy data looks like. Each row is unique by itself even though some of the fields in the rows might be the same in successive rows, each row is unique by it self. That is the emphasis here.

Each row contains the exact information required to plot it and thus makes it easier to plot the data. The data in all the Altair plots has been first converted into tidy format and then plotted.

You can view the Altair plots here.

Altair is a fun library, we’ll just have to preprocess the data a bit to take full advantage of that library.


For week-8 I am in for a spin!

  • Modify the Query class in new_functions so as to use the minimum number of auxillary variables.
  • Implement Altair interactive and static visualizations for some of the GMD metrics so as to test it properly
  • Create the whole tool chain for manuscripts2 with the modified new classes such that we can implement metrics and visualisations and create a unstructured report with the necessary static charts.