Come with me.. and you'll be.. in a world of visualizations, tests and documentation
June 13, 2018
NOTE: Bold text apart from headings are links to files/images/notebooks. Kindly click on them for better explainations.
This week (WEEK 5) in Manuscripts: A quest for better reporting of your projects, we will look at some visualizations depecting the various Metrics. We will also see a lot of tests and some documentation describing the new_functions created.
In-Brief
Introduction to the week. We’ll look at some tasks and their status.
Task 1 [Completed]:
This task was to cleanup the esquery.py file and use elasticsearch_dsl to get aggregations and query elasticsearch. This task has been continued for too long, but it is finally finished. All the tests pass. The last PR that I added for this task splits some tests into smaller more manageable tests.
Something to remember and always rememeber from this task: If you do:
test_dict = OrderedDict({"a": 1, "b": 2, "c": 3, "d": 4})
This might still not give you an ordered dict in the form that you want. The dictionary that you’ve passed in the OrderedDict can still reorder it self and you’ll get a different dict rather than what you hoped for. Do this instead:
test_dict = OrderedDict()
test_dict['a'] = 1
test_dict['b'] = 2
test_dict['c'] = 3
test_dict['d'] = 4
And you’ll surely get what you want.
Task 2 [On going]:
Originally this task was about trying to calculate the GMD metrics and creating Issues for the ones which have not yey been defined properly. I created this Notebook tracking all the metrics that can be calculated currently and calculated them using the new_functions that you will read about below.
The discussion on the definitions of some metrics is still on going. The committee is coming to conclusions on how the Metrics should be calcualted and Jesus (my mentor) is creating PRs to add the final definitions to the Metrics under discussion. The issues related to this task are as follows:
- wg-gmd#10.
- wg-gmd#9. Jesus will be creating a PR adding the definitions which can then be finalised.
- wg-gmd#8.
- wg-gmd#7. I was able to calculate this, you can find this metric in the notebook.
- wg-gmd#6. This too can be found in the notebook.
- wg-gmd#5. After a very fruitful discussion, We came to the conclusion that this can be best described by the formula:
issues_closed / (issues_opened + issues_backlog)
. - Jesus created a PR related to this issue which is under review.
That is the progress of this task, for now.
Task 3 [Complete: Version-1]:
This task was to complete the first acceptable version of the new functions and classes that can be used to calculate the GMD and other Metrics. We will be expanding on this below.
Related Issues: issue#59, issue#62
PRs Created: PR#67
Task 4 [On going: More research required]:
This task was the dragon that I wanted to slay for a very long time. I love visualisations and for this task I had to look into Plotly (which is awesome, btw), Seaborn and Altair.
I created this notebook for Plotly, please have a look. I expand on this Task more, below.
Task 5 [On going: Studying the code to create PRs]:
For this task, I have to create PRs for the Metrics which are not present in the enriched Indices, for manuscripts. I am still in the middle of reading code for grimoire-elk and understanding the workflow. I have to create PRs for the Metrics the definitions for which are pretty clear and simple.
Now, let’s dive in!!
In-Detail
Task 3:
Readme: to understand the classes better.
In Task 3, I had to create a PR to Manuscripts: adding the new_functions into it. I had a lot of testing and reordering to do for the new functions.
The idea behind these functions is to implement all the functionality that the user will require and have the user calculate the metrics in a minimalistic manner. We have focused on chainability of methods and objects which makes the code look beautiful too! Link to file
The basic class is EQCC
. It is an acronym for Elasticsearch Query Connect and Compute because this class provides the Querying, Connection and Computation of aggregations from the required elasticsearch Index.
This class takes in an Index object which contains the details of the Index to be queried.
Analogy:
Think of a Search object(which fetches the required data from elasticsearch) as a stack of aggregations. This stack of aggregations has some general properties or filters which can be added around this stack as Query objects. Inside the stack, we have aggregations which have to be fetched from the index in elasticsearch. We can add the aggregations one by one and nest them too by poping the last added aggregation, placing it as a child aggregation under a covering(parent) aggregation and putting this aggregation inside the stack it self. We can then repeat this process so that we get a doubly nested aggregation. That is the general idea behind this calss.
The aggregations
variable, an OrderedDict, inside an EQCC object is what keeps track of the aggregations that are added.
The queries
variable, a dict, contains the filters (properties of the stack) which have to be applied when querying elasticsearch.
Different aggregations such as get_terms
, get_sum
, get_percentiles
can be added to the stack to be calculated.
I’ve explained the working of these functions and the EQCC object, here in the README.md, Please have a look at it for more details. This Readme covers most of the technical functionality that the classes provide.
PullRequests and Issues are the subclasses that have been created using the EQCC class as their base. These classes currently only differentiate them selves by having fixed queries: {“pull_request”:“true”} and {“pull_request”:“false”}. But we will be adding more class specific functions in them once the definitions for more Metrics are made clear.
Moving on to the tests
The previous version of new_functions were sloppy, as it was still in it’s initial stage, but I worked on them and tested them thoroughly to find edge-cases and loop holes. Tests have been added to the basic functions and I will be adding some more tests to test the complex functions created from these basic functions. The complex functions will have to be tested using specific data from git or github data sources.
Apart from this, some implementations have been added into the PR, in the form of this notebook. I am calculating the metrics that Manuscripts produces currently using the new_functions created.
And that is Task 3: Version-1. Once this PR is reviewed and accepted, we can start integrating the code into manuscripts change the report.py file to use these functions to create the reports.
Task 4:
Okay, here is the deal with visualisations. They need data: and not all data can be visualised using all the design patterns. I was successful in creating some diagrams using Plotly but not so lucky with Seaborn and Altair.
Seaborn and Altair are statistical visualisation libraries which depend on the variation of data and different categories that the values are divided into.
The problem that I faced when trying to use Seaborn and Altair is the data being generated by the Metrics currently available is very simple, of a same pattern and lots of time it is single valued data. For example the opened and closed issues, the number of commits created by authors or by period, the total number of lines changed by authors, the trend of open and closed PRs and Issues and so on. We can repeat the same methods such as bar graphs and point plots to plot that data, but that is of less use and will suck the fun out of the whole visualisation process. So I think it is needed that we try to add more variance and convert the data generated by the Metrics and simultaneously look for more patterns by which the data can be plotted.
Using Plotly, I was able to create:
- Scatter plots showing the distribution of Open, Closed and Opened Issues per month.
- Number of Closed and still Open issues seggregate by authors.
- Bar and line graphs showing the open issue age for each issue still open.
- Moving average per week for time_to_close_days of the closed issues.
- Also another scatter plot shwowing the time_to_close_days over all the issues created
- A scatter plot with a sliding window showing the commit distribution per day from start to current commit.
- A pie chart showing the number of contributors and the lines changed/added/removed by them showing the impact made by each contrinutor on the project
- And finally a comparision of the number of lines added vs lines removed per author at the end of the notebook.
This week, I will be working towards creating variation in the current data and look at more types of patterns that we can plot the data in.
Tasks for week-6
- Task-1: Jesus will be merging the last PR finishing this task.
- Task-2: Discuss more about the metrics and advance on what can be advanced.
- Task-3: Improve on the comments for the new functions and have the first implementation ready by this week.
- Task-4: Work on create mote interesting visualisations using Altair, if possible.
- Task-5: Create PRs in grimoirelab-elk/perceval for the missing metrics in the enriched index.
- Task-6: Create visualisations using Plotly for GMD metrics.
And that’s all folks!!
Adios.