A week full of hiccups

May 30, 2018

Hello and welcome to the third week of Google Summer of Code. I hit a couple of speed bumps this week.

Highlights:

Task-1:

Complete cleaning esquery.py file, add tests and make the whole manuscripts project use elasticsearch_dsl.py module.

Progress: I was able to complete writing tests for esquery.py and make the comments more detailed. The changes can be found here. This is the part of the previous PR that I made in the previous week.

I was also able to update the functions in metrics.py so that they too use only elasticsearch_dsl module and not the request module to query elasticsearch directly. The PR for that is ready, but I haven’t made it yet because it will have to go on top of the previous PR that is yet to be merged. This issue represents the problem.

Task-2:

Add more metrics to the jupyter-notebook. Implement the metrics which can directly be calculated from the enriched indices. Open issues for the metrics the data for which is yet to be added to the enriched index and start working on the PRs.

Progress: I did create some issues regarding clarification on the metric definitions: new contributing orgs, new contributors, first response to issue duration. Apart from this, I updated the Notebook and added some more metrics such as closed issue resolution duration, and New Contributors. I also added tracking information about the metrics.

I was unable to open any issue about the Review metrics under Code Development which requires additional data to be added to the enriched index. For that we’ll have to update gelk to get that data, parse it and store it in the eiriched index. I’ll be working on this for week-4, hopefully!

Task-3:

Create chainable functions and objects which can be used to seggregate the metrics on the basis of: users, period, organizations. Create a preliminary python file which will eventually contain the functions and classes to calculate the metrics and do the analysis.

Progress: I was asked to implement the chainable functions for authors, organizations and periods[month, year, week or any other that elasticsearch date_histogram aggregation supports] inside the notebook. I did some initial analysis for open and closed issues in the Notebook and then implemted them under a Metrics class in new_functions.py under the Metrics class file. I was actually just asked to create a simple design of how the classes and functions should look, but I created the Metrics class and the functions to see the practical difficulties that might arise when implemting the metrics in a similar manner as I’ve done in the Notebook. I’ll expand on this further below.

The issue tracking this task is here.

Task-4:

Experiment with Visualizations: Plotly, Seaborn and Altair.

Progress: I wasn’t able to complete this task. I did look into Plotly to create bar graphs, but Plotly requires a user authentication to use it seamlessly otherwise it keeps throwing authentication errors which I do not think the users of this application will be happy with.


Details

Task-1:

So, in Task-1, I added tests to all the functions in esquery.py file. Basically, for each function, I created the ideal dictionary(query) that should be returned by that function and tested it against the output of the function. If it passed the test then that meant that the function was working as expected.

NOTE: Always try to create tests first before writing the function corresponding to that test so you’ll know exactly what kind of a function you need to make, what it will return and which parameters it requires.

I am mainly using assertDictEquals to compare the different dictionaries. The good thing is that all the functions are working successfully. I also updated the comments with details about what each function returned as input. We are using elasticsearch_dsl objects in each function and thus all the objects are either Aggregation, Query or Search objects from the elasticsearch_dsl module.

The second subtask was to make the whole manuscripts project use elasticsearch_dsl. This line inside get_agg function in esquery.py returns a json object. This is because the function get_metrics_data uses the json object to query elasticsearch directly which is unacceptable. It should instead use a es_dsl Search object which can then call the .execute() function to get the results. Those results can then be converted into a dict and the rest of the file will work as it is. Ex:

from elasticsearch_dsl import A, Search, Q
from elasticsearch import Elasticsearch

es = Elasticsearch()
s = Search(using=es, index="<index-name>")
q1 = Q("match", field=value)
q2 = Q("match", field2=value2)
q = q1 & q2 # ANDing the two queries
s = s.query(q)
agg = A("<aggregation-type>", field=value, <other aggreagtion specific parameters>)
s.aggs.bucket("<aggregation name>", agg)

# execute the query:
response = s.execute()
aggregations = response.aggregations.to_dict()

And then we can just use the aggregations like a dictionary object.

The PR for the second sub task is ready by is not made yet because it will come on top of the first PR as described above.

All in all, with this task, esquery.py file now has tests, is cleaned of using elasticsearch directly and the whole manuscripts project uses elasticsearch_dsl.py module.

Task-2:

In this task, I had to add more metrics to the Notebook and look at making issues and PRs to grimoirelab-elk for the metrics that are not available directly in the enriched index. I analysed the raw index that is created when data about a github repo is indexed into elasticsearch and found that there is little to no correlation between issues and their corresponding PRs. The metrics under Community Growth depend heavily on the data available about the PRs and the reviews on thos PRs.

We will have to add the code to add this data to the enriched indices. I plan on working on this in week-4.

Apart from that, I added some more metrics: new contributors, open issue age, closed issue resolution duration(added a moving average aggregation as asked for in this issue) and contributing organizations.

I also refined how I was calculating the metrics and worked on creating different filters for the metrics which can be applied to them. The analysis is in the Notebook.

I added a progress tracking file to see which issues have been implemented and which issues still need a PR to be able to calculate them.

Task-3:

I was able to create the chainable by_authors, by_organizations, by_period functions. But these functions are a part of the Metric class implemented in the new_functions.py file. Because these functions had to be chainable, I created a object to make them chainable methods of that object. The Metric class is just an initial idea and will be modified later. These functions give the metric count on the WRT the authors, organizatons and period of analysis. I’ve showed these functions in action in the Notebook for open and closed issues.

The new classes for the metrics will also be created in this new_functions.py file.

While creating the functions and testing them in the notebook, I came to understand the difficulties that will arise in implementing these functions. Here is a little description of what the functions do:

  • Metric class: This is the base class which will act as the parent to other Metrics classes that have to be calculated.
  • add_query: is a function which adds the given key-value pair as a query to the Search object.
  • add_inverse_query: is the same as add_query, just adds an inverse query instead of a normal one.
  • show_queries shows which query filters have been added to the Search object.
  • increment_parent increments the aggregation_id or the parent id. Here, when multiple aggregations have to be fetched from easlticsearch, giving names to each of them is not possible. So, for that we are using counts from 0 till the number of aggregations applied - 1. Each time an aggregation is applied to the Search object, this count will be increased by one, for the next aggregation.
  • get_results: executes the query in the Search object and returns a list of aggregations (as pandas DataFrame) that were applied to the Search object.
  • by_authors: the first chainable method which is applied to the Search object and used with get_results() to get the DataFrame for that aggregation.
  • by_organizations: similar to by_authors, returns aggregations based on the organizations.
  • by_period: gives date_histogram aggregations.
  • is_open: adds a “state”:“open” query filter to the search object.
  • is_closed: adds a “state”:“closed” query filter to the search object.
  • buckets_to_df: takes in a list of buckets and returns a DataFrame with key equals to the datetime objects(if available) or the key in the bucket.

Example: Closed issues by organizations for perceval_github

import new_functions as nf
issues = nf.Metric(index=github_index)
issues.add_query({"item_type":"issue"}) # add a query to get the issues
issues.is_closed() # add a query to make status:closed

print(issues.by_organizations().get_results()[0])

OUTPUT:

                    doc_count   value
key                                  
others                      44     44
Bitergia                    39     39
@Bitergia                   22     22
GNUmedia                     2      2
Geeky Engineer               2      2
@amrita-university           1      1
CMU                          1      1
Samsung                      1      1
T-Systems Iberia             1      1

Alot of improvement is still required in these functions. Some of the difficulties that i’ll face while implementing the Classes:

  • Different aggregations have different bucket names and value types which have to be parsed and converted into DataFrames.
  • Chainable functions need to work on all the metrics because different metrics will have different field names.
  • Not all functions can be applied to all the metrics, need errors to fail gracefully.
  • Multiple aggregations with multiple child aggregations? We need a method of keeping track of the nested aggregation names.
  • How should results be represented?
  • How do I know which bucket is for which result if we are going to use numbers for aggregation names?
Task-4:

I wasn’t able to test any of the visualization tools I wanted to test.

I managed to look into how Plotly works and found out that it requires the user to authenticate themselves to the plotly server and then be able to create the visualizations. I plan on working on this task this week.


I actually wasn’t able to do much this week because a family emergency, but I plan on completing most of the remaining tasks from this week in week-4.

Till then, adios!


Tasks for week-4:

  • Task-1: Make a PR splitting the tests into smaller more managable tests in test_esquery.py.
  • Task-2: Add issues for the metrics that are missing from the notebook and code for which is needed to be added to grimoirelab-elk. Create a first version of the Notebook with all the metrics(which are present in the enriched index) in the notebook.
  • Task-3: use the new functions and Metric class implemented in new_functions.py file to calculate the current Metrics that manuscripts produces. This will act as a testing ground for the new functions and classes and will help defining the structure of these classes further.
  • Task-4: Work on visualisations using Plotly, Seaborn and Altair. Create a notebook for each of them showing the visualisations.
  • Task-5: Add code in grimoirelab-elk for the metrics which are not readily available.