We are there.

August 11, 2018

Welcome! To the final week of Google Summer of Code-18.

This is the last in the series of blog posts that I’ve been writing about my progress during GSoC-18.

I am greatful to my mentors: Jesus M. Gonzalez-Barahona and Valerio Cosentino for guiding and helping me through out the summer.

This blog post also serves as the final report of the work that I’ve done during GSoC. The second part of this post is divided into the tasks that I performed and the Issues, PRs, files that were created to complete those tasks.

For those of you who are reading this blog for the first time: Bold text is often a link to something important, so please click on it :)


Summary

Show me what you got!

  • The main functionality that I’ve added to the Manuscripts project is the manuscripts2 module. It is an iteration over manuscripts. Manuscripts2 adds the functionality to query elasticsearch using chainable functions & classes. These functions allow the user to focus on caculating the metrics rather than think about how to go about fetching the data. To learn more about manuscripts2, please read the README.md for the module.

  • An example of the final report generated using manuscripts2 can be found here.

  • Wrote tests for manuscripts and manuscripts2 which were eariler not present. These tests help us maintain the proper functioning of manuscripts.

  • Refactored esquery.py and metrics.py files in manuscripts to eliminate querying elasticsearch directly and instead using elasticsearch_dsl instead. This exercise gave me a good look at the inner functionings of manuscripts and what we needed to improve on.

  • Other than this, I researched on the different ways we can analyse and visualise the GMD metrics and add it to the report. These changes are not completely included into manuscripts yet but I’ll be working on them with Jesus in the coming months.

There are some more smaller contributions that will be covered below when we go task by task.

What remains to be done?

  • Experimentation with Markdown and Notebooks so as to provide the user with a more hands-on approach to look at the metrics.

  • Extend the Manuscripts project to create reports for the GMD metrics using manuscripts2 (the module was created keeping those metrics in mind). This Notebook does a decent job in calculating the metrics but we still need to be able to generate a PDF report for the GMD metrics.

  • We also plan to add interactive HTML pages to manuscripts when a report is created. These pages will contain visualizations using Altair and Plotly libraries.

  • Currently, manuscripts2 supports the 3 basic data sources: git, github_issues, github_prs. I need to add support for gerrit, stackexchange, mailinglists, issue tracking systems (its) and jira. But, since the whole initial infrastructure for the reports has been created, adding these data sources will be just a matter of replicating the necessary code for each of them.

  • Make manuscripts2 mordred compatible. We need to add functionality to take as input a config file containing project data and create a report using that file.

  • Add functionality to create Project reports. Currently, we can create reports for individual repositories only.

The good thing about these tasks is that the foundation for most of them has already been layed. We will just have to build on top of that.


Tasks Undertaken

Jesus had the brilliant idea at the start that we divide the project into different tasks that were to be completed. This helped me in keeping track of what was to be done and how things progressed each week.

Task 1 [COMPLETED]:

Related blog posts: Week-2, Week-3, Week-4, Week-5.

Task 1-A was to convert the functions in manuscripts/esquery.py file into elasticsearch_dsl based functions. These functions were using the requests library to query elasticsearch.

Task 1-B was to update the code in manuscripts/metrics/metrics.py because of the changes in esquery.py.

Related issues:

Related Pull Requests:

Task 2 [PARTIALLY COMPLETE]:

Related blog posts: Week-2, Week-3, Week-4, Week-5, Week-7

The major theme for this task was to figure out which metrics were missing or their definitions were unclear or needed improvement from the list of GMD metrics and try to make them more structured. The main objective of this task was to get the community to talk about the missing definitions of the metrics. I opened some issues for the metrics which were unclear for me and we got to see some quite interesting discussions on how these metrics should be further defined.

Task-5 was created to submit pull requests for most of the issues related to this task.

Related Notebooks:

Related issues:

  • wg-gmd#1: Better and complete descriptions of the metrics under GMD. This issue describes in brief the problem of weak and unclear definitions of metrics.
  • wg-gmd#5: GMD: issue resolution efficiency - What are the abandoned issues?. Possible PR: wg-gmd#12.
  • wg-gmd#6: How to calculate open issue age. The conclusion from the discussion on this issue was to calculate the open issue age in the Notebook mentioned above.
  • wg-gmd#7: How to calculate closed issue resolution duration.
  • wg-gmd#8: First response to issue duration. Closed by grimoirelab-elk#383.
  • wg-gmd#9: New contributors metrics. Closed by wg-gmd#13.
  • wg-gmd#10: How to calculate new contributing organizations?
  • wg-gmd#11: What are sub-projects?
  • wg-gmd#14: Who can be considered a maintainer of a project/repo?
  • grimoirelab-elk#364: general discussion about which metrics need more attention. Closed by grimoirelab-elk#399 and grimoirelab-elk#401.

Related Pull Requests:

  • grimoirelab-elk#419: Add the enrich_pull_requests studies to grimoirelab-elk.
  • grimoirelab-elk#383: Add the time to first response field to the issues being enriched in grimoirelab-elk.
  • wg-gmd#12: Adds detailed definition of Issue Resolution Efficiency metric.
  • wg-gmd#13: Refine metrics about new contributors.

Task 3 [COMPLETED]:

Related blog posts: Week-2, Week-3, Week-4, Week-5, Week-7, Week-8, Week-9

This task was the basis of the manuscripts2 module. In this task, I had to experiment with creating chainable functions able to calculate the metrics. I first started off with calculating the GMD metrics in notebooks which are linked below. These metrics were calculated using the elasticsearch_dsl module. Then the common parts were converted into functions and with further iterations, we were able to create the manuscripts2 module.

Related Notebooks:

Related issues:

Related Pull Requests:

Task 4 [COMPLETED]:

Related blog posts: Week-4, Week-5, Week-7

This, I think, was the most interesting of all tasks. I had to look at different visualization modules (Altair, Plotly, Seaborn) and figure out how we could use their power to take a better look at the metrics. The idea was to figure out such visualisations that would compare the different metrics and provide a contrast between certain fields. This would lead to some interesting analysis.

Related Notebooks:

Task 5 [COMPLETED]:

Related blog posts: Week-5, Week-7, Week-8, Week-9, Week-10, Week-11, Week-12, Week-13.

This is the task I am most proud of since, in this, I had to understand the internals of Perceval, ELK and Mordred. It was fun! This task was generated after we were able to analyse what all fields were required to calculate the metrics pointed out in Task-2. In this task, we had to figure out how to add the fields to the enriched indices via grimoirelab-elk. The GMD metrics are mostly all dependent on the GitHub Issues and GitHub Pull Requests data that is fetched by Perceval. We had to figure out how to separate this data for each of the data sources and how to extract the required fields from that data.

Jesus, Valerio and I came to the conclusion to create a study able to generate the additional fields that we wanted in the pull-requests-only index.

Related issues:

Related Pull Requests:

Task 6 [COMPLETED]:

Related Blog Posts: Week-8

This task was to find out the full potential of the Altair Library. This notebook does some justice to the Altair library. I was able to create interactive HTML pages using Altair. The idea is to integrate these visualisations by adding the HTML pages and the Notebooks to the reports generated for the GMD metrics.

Related Notebooks:

Related HTML files

Task 7 [COMPLETED]:

Related Blog Posts: Week-8, Week-9, Week-10, Week-11, Week-12, Week-13, Week-14.

A

This task was to play with static visualisations that can be created for the CHAOSS metrics that were being calculated using mansucripts. I created a PDF as a demonstration of what can be visualised.

Related Notebooks:

B

The second part of Task 7 was to start on the final PDF that can be generated using the manuscripts2 module and the visualisations that I was able to create in Task 7A. For this task, I was stuck for a while. Jesus helped me out big time by suggesting the current structure of the functions to calculate the metrics. You can find the related PR and commits here

In this task, I struggled with testing a lot. The majority of complex testing was to be done for the functions created in this task.

Related issues:

Related Pull Requests:

All the above PRs combined add the functionalities to create the different sections of the reports (activity, community, overview, process) for the git, github_prs and github_issues data sources. These PRs also add tests for the data sources as well as the PDF reports being generated.

Task 8 [ONGOING]:

In this task I had to figure out a way to calculate the code review iteration on the Pull Request created by the submitter. Being a low priority task of calculating the num review iterations (1 metric), it was abandoned for now but I will be working on this task in the coming weeks (after GSoC).

Miscellaneous

Pull Requests:

Issues:

Behind the scenes

The beautiful code that you see in manuscripts required a fair bit of experimentation. I was not able to capture the experimentation process completely, but if you wish to see some hacks and ugly code, you can find it in the gsoc-manuscripts repository.


Experience

I think this is the most that I’ve grown in a short period of time. Time flew by fast and I got to learn from great people and got to be a part of a wonderful and supportive community. I won’t lie, it has been difficult. When this started, I knew nothing about linters or how branches in Git work. I’ve struggled my fair share but in the end, it has been worth it.

Working for CHAOSS also lead me to submit a proposal to PyCon India 2018. I will be talking (if my proposal gets selected) about what CHAOSS does, the tools that are used to analyse communities and how you can participate.

I will continue to contribute to CHAOSS and also look for other open source communities which need help.

Chao!