We are there.

August 11, 2018

Welcome! To the final week of Google Summer of Code-18.

This is the last in the series of blog posts that I’ve been writing about my progress during GSoC-18.

I am greatful to my mentors: Jesus M. Gonzalez-Barahona and Valerio Cosentino for guiding and helping me through out the summer.

This blog post also serves as the final report of the work that I’ve done during GSoC. The second part of this post is divided into the tasks that I performed and the Issues, PRs, files that were created to complete those tasks.

For those of you who are reading this blog for the first time: Bold text is often a link to something important, so please click on it :)

Summary

Show me what you got!

The main functionality that I’ve added to the Manuscripts project is the manuscripts2 module. It is an iteration over manuscripts. Manuscripts2 adds the functionality to query elasticsearch using chainable functions & classes. These functions allow the user to focus on caculating the metrics rather than think about how to go about fetching the data. To learn more about manuscripts2, please read the README.md for the module.
An example of the final report generated using manuscripts2 can be found here.
Wrote tests for manuscripts and manuscripts2 which were eariler not present. These tests help us maintain the proper functioning of manuscripts.
Refactored esquery.py and metrics.py files in manuscripts to eliminate querying elasticsearch directly and instead using elasticsearch_dsl instead. This exercise gave me a good look at the inner functionings of manuscripts and what we needed to improve on.
Other than this, I researched on the different ways we can analyse and visualise the GMD metrics and add it to the report. These changes are not completely included into manuscripts yet but I’ll be working on them with Jesus in the coming months.

There are some more smaller contributions that will be covered below when we go task by task.

What remains to be done?

Experimentation with Markdown and Notebooks so as to provide the user with a more hands-on approach to look at the metrics.
Extend the Manuscripts project to create reports for the GMD metrics using manuscripts2 (the module was created keeping those metrics in mind). This Notebook does a decent job in calculating the metrics but we still need to be able to generate a PDF report for the GMD metrics.
We also plan to add interactive HTML pages to manuscripts when a report is created. These pages will contain visualizations using Altair and Plotly libraries.
Currently, manuscripts2 supports the 3 basic data sources: git, github_issues, github_prs. I need to add support for gerrit, stackexchange, mailinglists, issue tracking systems (its) and jira. But, since the whole initial infrastructure for the reports has been created, adding these data sources will be just a matter of replicating the necessary code for each of them.
Make manuscripts2 mordred compatible. We need to add functionality to take as input a config file containing project data and create a report using that file.
Add functionality to create Project reports. Currently, we can create reports for individual repositories only.

The good thing about these tasks is that the foundation for most of them has already been layed. We will just have to build on top of that.

Tasks Undertaken

Jesus had the brilliant idea at the start that we divide the project into different tasks that were to be completed. This helped me in keeping track of what was to be done and how things progressed each week.

Task 1 [COMPLETED]:

Related blog posts: Week-2, Week-3, Week-4, Week-5.

Task 1-A was to convert the functions in manuscripts/esquery.py file into elasticsearch_dsl based functions. These functions were using the requests library to query elasticsearch.

Task 1-B was to update the code in manuscripts/metrics/metrics.py because of the changes in esquery.py.

Related issues:

grimoirelab-manuscripts#57: Describe in brief what Task-1A was about.
grimoirelab-manuscripts#60: Describe what changes had to be made to metrics.py and the tests so that all the code used elasticsearch_dsl only.

Related Pull Requests:

grimoirelab-manuscripts#58: Make esquery.py use elasticsearch_dsl module and add tests for the functions in the file.
grimoirelab-manuscripts#63: Update metrics.py file to use the modified esquery.py file. Cleans up code.
grimoirelab-manuscripts#64: Update tests for specific functions in metrics.py.

Task 2 [PARTIALLY COMPLETE]:

Related blog posts: Week-2, Week-3, Week-4, Week-5, Week-7

The major theme for this task was to figure out which metrics were missing or their definitions were unclear or needed improvement from the list of GMD metrics and try to make them more structured. The main objective of this task was to get the community to talk about the missing definitions of the metrics. I opened some issues for the metrics which were unclear for me and we got to see some quite interesting discussions on how these metrics should be further defined.

Task-5 was created to submit pull requests for most of the issues related to this task.

Related Notebooks:

GMD-metrics-from-scratch.ipynb

Related issues:

wg-gmd#1: Better and complete descriptions of the metrics under GMD. This issue describes in brief the problem of weak and unclear definitions of metrics.
wg-gmd#5: GMD: issue resolution efficiency - What are the abandoned issues?. Possible PR: wg-gmd#12.
wg-gmd#6: How to calculate open issue age. The conclusion from the discussion on this issue was to calculate the open issue age in the Notebook mentioned above.
wg-gmd#7: How to calculate closed issue resolution duration.
wg-gmd#8: First response to issue duration. Closed by grimoirelab-elk#383.
wg-gmd#9: New contributors metrics. Closed by wg-gmd#13.
wg-gmd#10: How to calculate new contributing organizations?
wg-gmd#11: What are sub-projects?
wg-gmd#14: Who can be considered a maintainer of a project/repo?
grimoirelab-elk#364: general discussion about which metrics need more attention. Closed by grimoirelab-elk#399 and grimoirelab-elk#401.

Related Pull Requests:

grimoirelab-elk#419: Add the enrich_pull_requests studies to grimoirelab-elk.
grimoirelab-elk#383: Add the time to first response field to the issues being enriched in grimoirelab-elk.
wg-gmd#12: Adds detailed definition of Issue Resolution Efficiency metric.
wg-gmd#13: Refine metrics about new contributors.

Task 3 [COMPLETED]:

Related blog posts: Week-2, Week-3, Week-4, Week-5, Week-7, Week-8, Week-9

This task was the basis of the manuscripts2 module. In this task, I had to experiment with creating chainable functions able to calculate the metrics. I first started off with calculating the GMD metrics in notebooks which are linked below. These metrics were calculated using the elasticsearch_dsl module. Then the common parts were converted into functions and with further iterations, we were able to create the manuscripts2 module.

Related Notebooks:

Related issues:

grimoirelab-manuscripts#59: Issue for discussion related to the new functions that had to be created.
grimoirelab-manuscripts#62: Issue for discussing how the chainable functions should be created and what classes were needed.

Related Pull Requests:

grimoirelab-manuscripts#67: Add basic classes and chainable methods to generate a query.
grimoirelab-manuscripts#71: Add tests for these chainable methods.
grimoirelab-manuscripts#73: Structure the classes for the sections and metrics included in the report.
grimoirelab-manuscripts#74: Include the independent functions get_aggs and get_timeseries in the Query method.

Task 4 [COMPLETED]:

Related blog posts: Week-4, Week-5, Week-7

This, I think, was the most interesting of all tasks. I had to look at different visualization modules (Altair, Plotly, Seaborn) and figure out how we could use their power to take a better look at the metrics. The idea was to figure out such visualisations that would compare the different metrics and provide a contrast between certain fields. This would lead to some interesting analysis.

Related Notebooks:

Task 5 [COMPLETED]:

Related blog posts: Week-5, Week-7, Week-8, Week-9, Week-10, Week-11, Week-12, Week-13.

This is the task I am most proud of since, in this, I had to understand the internals of Perceval, ELK and Mordred. It was fun! This task was generated after we were able to analyse what all fields were required to calculate the metrics pointed out in Task-2. In this task, we had to figure out how to add the fields to the enriched indices via grimoirelab-elk. The GMD metrics are mostly all dependent on the GitHub Issues and GitHub Pull Requests data that is fetched by Perceval. We had to figure out how to separate this data for each of the data sources and how to extract the required fields from that data.

Jesus, Valerio and I came to the conclusion to create a study able to generate the additional fields that we wanted in the pull-requests-only index.

Related issues:

grimoirelab-elk#394: Get pull requests only data and calculate the respective metrics from that data. Closed by grimoirelab-elk#399 and grimoirelab-elk#401.
grimoirelab-elk#405: Get all the comments for pull requests.

Related Pull Requests:

grimoirelab-elk#383: Add code to ELK to calculate the duration for first response to issue.
grimoirelab-elk#399 - Add the functionality to enrich the data from the issues and pull requests categories separately (by Valerio).
grimoirelab-elk#398 - Attempt to add the functionality above which was not merged.
grimoirelab-elk#401: Add fields to the pull-requests-only index. These fields could be calculated from the raw data that was in the pull requests only index.
grimoirelab-elk#419: Add the enrich_pull_requests study to the grimoirelab-elk repository.
grimoirelab-sirmordred#90: Add the functionality of micro-mordred to enrich the indices and run studies. This PR helped me write the code to calculate the remaining fields for pull-requests-only index. It is yet to be merged into the mordred codebase (by Valerio).
grimoirelab-sirmordred#191: Add enrich_pull_requests as one of the studies that could be run using mordred.
grimoirelab-sirmordred#194: Update config.py for the same.
grimoirelab-sirmordred#193: Update tests for the same.

Task 6 [COMPLETED]:

Related Blog Posts: Week-8

This task was to find out the full potential of the Altair Library. This notebook does some justice to the Altair library. I was able to create interactive HTML pages using Altair. The idea is to integrate these visualisations by adding the HTML pages and the Notebooks to the reports generated for the GMD metrics.

Related Notebooks:

Related HTML files

Task 7 [COMPLETED]:

Related Blog Posts: Week-8, Week-9, Week-10, Week-11, Week-12, Week-13, Week-14.

A

This task was to play with static visualisations that can be created for the CHAOSS metrics that were being calculated using mansucripts. I created a PDF as a demonstration of what can be visualised.

Related Notebooks:

Static Visualisations

B

The second part of Task 7 was to start on the final PDF that can be generated using the manuscripts2 module and the visualisations that I was able to create in Task 7A. For this task, I was stuck for a while. Jesus helped me out big time by suggesting the current structure of the functions to calculate the metrics. You can find the related PR and commits here

In this task, I struggled with testing a lot. The majority of complex testing was to be done for the functions created in this task.

Related issues:

grimoirelab-manuscripts#81: Configure data for running the tests. Closed by grimoirelab-manuscripts#82 and grimoirelab-manuscripts#84.

Related Pull Requests:

grimoirelab-manuscripts#73: Add the structure to the data source files that were to be used to calculate the sections of the report generated.
grimoirelab-manuscripts#79: Update tests because of change in the Perceval repository. This was while the data for the tests was dynamically fetched from the online perceval report.
grimoirelab-manuscripts#80: Add the overview section of the report for the git data source.
grimoirelab-manuscripts#82 - Add the functionality to use frozen data from the enriched indices so that the different data sources could be tested properly. This PR changed the way testing was being done till now for Manuscripts. (By Valerio)
grimoirelab-manuscripts#84: Update the test base class proposed in the above PR by converting the setUp functions into setUpClass functions to save time.
grimoirelab-manuscripts#85: Update test data for git data source.
grimoirelab-manuscripts#86: Update test_git.py to use test base class proposed in grimoirelab-manuscripts#82.
grimoirelab-manuscripts#88: Add overview section of the report for the github_issues data source.
grimoirelab-manuscripts#90: Add overview section of the report fot the github_prs data source.
grimoirelab-manuscripts#91: Clean and Rearrange the test data into specific folders.
grimoirelab-manuscripts#92: Add project activity section of the report.
grimoirelab-manuscripts#93: Add project activity section of the report for the github_issues data source.
grimoirelab-manuscripts#94: Add project activity section of the report for the github_prs data source.
grimoirelab-manuscripts#95: Add project activity section of the report for the git data source.
grimoirelab-manuscripts#97: Add get_list method to Query class.
grimoirelab-manuscripts#99: Add project community section to the report and the corresponding metrics to the git data source.
grimoirelab-manuscripts#100: Add project community section to the report.
grimoirelab-manuscripts#101: Change field for timeseries aggregations in ClosedIssues metrics class. ClosedIssues date-histogram aggregations should be made on the closed_at field.
grimoirelab-manuscripts#102: Add functions to calculate the project process section of the report for github_issues data source.
grimoirelab-manuscripts#103: Add functions to calculate the project process section of the report for github_prs data source.
grimoirelab-manuscripts#104: Add the project process section of the report.
grimoirelab-manuscripts#106: Add the latex template of the report and add functions to generate the actual PDF report.
grimoirelab-manuscripts#107: Separate github_issues and github_prs latex template data for Overview section of the report.
grimoirelab-manuscripts#108: Configure bin/manuscripts2 script to take input the required arguments for the report.
grimoirelab-manuscripts#109: Update README.md file in manuscripts2 with the instructions of how to generate the report and how to use manuscripts2 properly.
grimoirelab-manuscripts#110: The report also had to index the data for each of the data sources. This PR sets up module level test fixtures which index the required data for all the data sources once during the starting of the tests. This decreased the amount of time it took to run the tests.
grimoirelab-manuscripts#113: Add basic tests for the functions generating the report.

All the above PRs combined add the functionalities to create the different sections of the reports (activity, community, overview, process) for the git, github_prs and github_issues data sources. These PRs also add tests for the data sources as well as the PDF reports being generated.

Task 8 [ONGOING]:

In this task I had to figure out a way to calculate the code review iteration on the Pull Request created by the submitter. Being a low priority task of calculating the num review iterations (1 metric), it was abandoned for now but I will be working on this task in the coming weeks (after GSoC).

Miscellaneous

Pull Requests:

grimoirelab-manuscripts#115: Add the Notebooks containing GMD analysis (Visualisations and ways to calculate the GMD metrics) to Manuscripts Project Repository.
grimoirelab-manuscripts#55: Add the functionality to show the helo for manuscripts when no parameter is passed. This closes grimoirelab-manuscripts#55.
grimoirelab-manuscripts#69: Add the functionality to set the default start date for manuscripts to the minimum date in all the data sources given. This closes grimoirelab-manuscripts#48.

Issues:

grimoirelab-manuscripts#49: Show help for manuscripts.
grimoirelab-manuscripts#48: Change the default start date of the report generated if no start date is given.
grimoirelab-manuscripts#56: Restructure the GitHub issues class in manuscripts.

Behind the scenes

The beautiful code that you see in manuscripts required a fair bit of experimentation. I was not able to capture the experimentation process completely, but if you wish to see some hacks and ugly code, you can find it in the gsoc-manuscripts repository.

Experience

I think this is the most that I’ve grown in a short period of time. Time flew by fast and I got to learn from great people and got to be a part of a wonderful and supportive community. I won’t lie, it has been difficult. When this started, I knew nothing about linters or how branches in Git work. I’ve struggled my fair share but in the end, it has been worth it.

Working for CHAOSS also lead me to submit a proposal to PyCon India 2018. I will be talking (if my proposal gets selected) about what CHAOSS does, the tools that are used to analyse communities and how you can participate.

I will continue to contribute to CHAOSS and also look for other open source communities which need help.

Chao!