If you are happy that you are growing clap your hands!!

July 11, 2018

I am trying to get into Stoicism and last night I read this quote:

“Nothing is slower than the true birth of a man.”

Well, this week I understand perfectly what it means.

Hello and welcome to Week-9 of Google Summer of Code!

In Brief

Last week, we left off with the following tasks.

Task-2 [Under Progress]:

Try to participate in the discussions regarding GMD and other metrics (Task-2). This task is moving a little slowly, but only a few metrics remain to be properly defined so it can wait a little longer until we set up the remaining infrastructure.

Task-3 [Completed]:

Yes, Task-3 in which I was trying to create new functions to calculate the metrics, has been completed. Jesus merged the pull request and we can finally move on from this.

<Rant> Although this PR was merged, I still had to improve the tests as they were… well, not that great. I’ve been reading about Test Driven Development and I find it quite fascinating. In TDD, we write the tests first and then we create the functions which pass those tests. But here, I had already written the code so I had to write tests which checked if the functions worked properly or not. I’ll expand below on how I was able to write tests which I think are pretty good. </Rant>

Task-5 [Almost Done]:

We have made progress in adding more fields into enriched indices which can be used to calculate the GMD Metrics. This task was to add code to grimoire-elk. My mentor, Valerio made the initial PR which added the infrasctucture for me to add the enriched fields for the Pull Requests category for the GitHub backend in grimoire-elk. More on this below.

Related PRs:

By Valerio: adding initial infrastructure to get pull requests data
PR to add pull requests data by defining github_prs as a different data source [connector in gelk]. This PR was not used in favour of the above PR.
PR adding fields in GitHub pull requests enriched indices.

Task 7-B [Under Progress]:

To create the reports for the metrics manuscripts produces right now, using the new functions. I read the code base for manuscripts again, trying to figure out how to get the different metrics and calculate CSV files from those. I am hoping to discuss this with Jesus and Valerio and finish this task by Friday.

In Detail

I’ll start with describing the Tests that I wrote:

Tests

Initially the tests that I added only checked if the functions created the appropriate aggregations, fields and so on. They didn’t actually called the methods defined in the Query class.

I had to try to run them using actual data fetched from elasticsearch. Grimoirelab consists of a lot of tools and so I looked into Perceval’s tests because I figured that it also has to fetch the data from the internet using different backend APIs so there might be something useful there. Perceval uses httpretty which taps into the In Built Socket module of Python. We can then setup the URL to be queried and the data to be returned from that using httpretty.register_uri and it’ll mimic the actual URL. Perceval uses it brilliantly, but I couldn’t because the data that I was dealing with had to be first processed by elasticsearch (aggregations) and to mimic that data would have been another devil to deal with.

I then thought of directly using Elasticsearch, but it would have failed the Continuous Integration that TravisCI does. I contacted Jesus (When in doubt, Jesus will help you out :P) and he explained to me that TravisCI allows setting up Elasticsearch and I should look into Mordred. That hit the spot for me.

The next problem came when I had to insert data for testing purposes into elasticsearch. At first I was trying to use the helper class that elasticsearch provides, which gives us the functionality to bulk index the data. I used the Mappings for an enriched github index, but the Mappings for the data kept throwing an error as some of the fields could not be aggregated. When I removed the Mappings, I couldn’t access the data and there was an empty response from Elasticsearch.

I was stuck here for quite a while, but thanks to Task-5, I had been reading the source code of grimoire-elk and thus I decided to use the methods that gelk uses to insert data into elasticsearch: the feed_backend and enrich_backend functions. They worked like a charm!!

Currently, the tests fetch data from a git repo and analyse it to test the functions. I couldn’t query the issues data as it would require a GitHub API token for smooth functioning.

All in all, this took quite a bit of my time this week, but the tests work now so I am glad.

Task-5:

This task was to add methods to calculate GitHub Pull Requests data and enrich it so that GMD-Code Development metrics, along with some others, can be calculated.

In this task, we decided to create separate raw and enriched indices for the GitHub Pull Requests data. I did create a PR for this task but it was highly enfficient would have used a separate backend (github_prs) to calculate the data. Valerio made a better suggestion by creating a PR which used waaay less code and is highly efficient. We decided that I work on adding the pull requests enriched fields on top of that. This PR adds the code to calculate the enriched fields for Pull Requests data.

NOTE:

This task taught me that I should never be attached to my code and that there is almost always a better method to do what you are doing.
It also taught me that I should be more careful when reading the source code of projects. For example: the solution that Valerio proposed used approx. 5X less number of lines than the solution I proposed.

There are some metrics still remaining in Code development which I’ll discuss today with my mentors:

Code Reviews: What is the number of code reviews?
Code Review Efficiency: What is the number of merged code changes/number of abandoned code change requests?
Maintainer Response to Merge Request Duration: What is the duration of time for a maintainer to make a first response to a code merge request?
Code Review Iteration: What is the number of iterations that occur before a merge request is accepted or declined?
Pull Request Comment Duration: The difference between the timestamp of the pull request creation date and the most recent comment on the pull request.
Pull Request Comment Diversity: Number of each people discussing each pull request.
Pull Request Comments: Number of comments on each pull request.

Once these are finalised, I’ll add the code for the corresponding fields and Task-5 can be marked as complete.

Task-7-B:

Here, I had to start with creating the infrastructure to generate the reports using the new functions added. I couldn’t progress on this as much as I would have liked to, but I plan on getting this done by Friday. Atleast the CSV files.

Earlier I was thinking that we can use a similar structure of defining different classes for each of the metric, as Manuscripts does currently, but that might disrupt the whole purpose of us creating the new functions to calculate these metrics. We need to find a better approach than what is currently used. That method is clever, but it will still require a lot of repetition, which we want to avoid.

This task might also lead us to change the manuscripts2 proposed earlier.

All in all, I struggled a lot this week, but I learned and grew as well.

Tasks for week-10:

We have added the functionality to get the enriched data for Pull Requests in grimoirelab-elk. However the comments for each of the PRs are not being fetched here. They are available normally when we try to get issues and prs together, but not for only PRs. So we’ll have to look into that. I am opening an issue in grimoirelab-elk to discuss this. The comments are important since some of the GMD code development metrics depends on PR comments. (Still Task-5)
The second is to find a way to to be able to calculate the number of code review iterations, i.e the number of times the code is asked to changed and new code is added by the contributor submitting the PR (New Task-8). I am opening an issue in Perceval for the same.
The third task (Task-7-B) is to make a PR calculating the CSV files produced in the report and follow up on them.