I see light at the end of this tunnel

August 3, 2018

This is the third last week. I am running on RedBull, Coffee and an occasional meal in between. This is so exciting!

Hello everyone. Welcome to Google Summer of Code-18: Week-12’s blog. As I write this, only 12 days remain. Here is where we stand as of now:

In Brief

Last time you were reading this blog, 3 tasks remained.

Task-5 [Progressing..still!]:

Task: To check which metrics still need the relevant fields to be generated so that those metrics can be calculated.

Progress: The difficulty in this task was that some of the relevant fields that we needed from the github_pull_requests index were not available to us. To read about why: please head over to Week-10’s blog and read Task-5 under In-Detail section.

So, during week-10, Jesus, Valerio and I decided that to calculate the remaining metrics, we will need to use studies. I wasn’t able to grasp the full concept of studies until now (there are still some rough edges, but I’ll make it work). So in short, we are on track and I am gonna use studies to calculate the remaining fields. More on this below.

Relevant PRs: - grimoirelab-sirmordred/pull/190 - By Valerio.

Task-7-B [On Track]:

Task: To create reports using the functions and classes in manuscripts2

Progress: I am on track with this task. We should have a working version of reports using manuscripts2 by next week. (In case you are wondering, I am working on the tests too which are taking up majority of the time.)

Relevant PRs:

Task-8 [Low priority]:

To be honest, I’ve completely ignored this task this week. This task calculates with one metric code review iterations, so I’ll deal with this later.

Let’s go a bit more inside the tunnel.


In Detail

Task-5:

So, continuing from above, we decided that we’ll use studies to get the remaining github_prs metrics. The problem was that I had no clue how studies worked, but luckly Valerio helped me out. We discussed this over email and he asked me to try this command out:

p2o.py --studies-list enrich_areas_of_code --enrich --index git_raw --index-enrich git -e http://localhost:9200 --no_inc --debug git https://github.com/chaoss/grimoirelab-perceval.git

What this command should’ve done was to run an enrich-areas-of-code study on the git index and add generate extra fields related to areas of code. It should’ve also created a git-aoc_enriched index containing this data. But there was some problem with p2o.py, which Valerio figured out:

I have checked the code of p2o.py. The problem is that it doesn’t accept ad-hoc params for any study. Furthermore, if you try to execute multiple studies, there is no way to separate the params for different studies.

So to remedy that, he is proposing a micro-mordred via a PR to grimoirelab-sirmordred which basically allows us to run multiple studies and do the enrichment of indices properly without much hassel. Description of the PR:

This code proposes a tiny version of mordred to execute raw and/or enrich tasks on a given backend (defined in a cfg file passed as input). The micro-mordred overcomes the current limitations of the p2o script in ELK, which is not able to execute multiple studies for the input backend and requires constant changes to align its logic with the ELK one. The micro-mordred isn’t supposed to require constant changes, since there is not gap between its logic and the mordred one.

I am in the midst of using this functionality to calculate Studies for github pull_requests data. The study here will allow us to use github_issues raw index and github_prs raw index to calculate a github_prs enriched index which will only contain data related to the pull requests of a github repository.

NOTE: Using the micro-mordred will throw an error if you already have an ES index named git since it tries to create an alias by the name of git for the enriched index. And ES does not allow naming an alias with the same name of an index which is already present.


Task-7-B:

Moving on, firstly I am sorry (Georg) that I couldn’t write an explicit report on how we have structured the reports this time. Here are some details:

We are still following the similar pattern to have a separate file for each data source.

/manuscripts2
	|- /metrics
		|- git.py
		|- github_issues.py
		|- github_prs.py

These files contain the metrics classes that we use to calculate the various metrics. Let’s have a look at how github_issues.py file is structured: (I know the black and white code looks hideous, but I am working on making it beautiful. Please bear with me, kthnx)

from manuscripts2.elasticsearch import Issues, calculate_bmi
from manuscripts2.utils import get_prev_month

Importing the Issues class from which all the queries will be derived

(The description will be in blue, thank you.)

class GitHubIssuesMetrics():
    """Root of all metric classes based on queries to a github
    enriched issues index.
    This class is not intended to be instantiated, but to be
    extened by child classes that will populate self.query with real
    queries.

    :param index: index object
    :param start: start date to get the data from
    :param end: end date to get the data upto
    """

    def __init__(self, index, start, end):
        self.query = Issues(index)
        self.start = start
        self.end = end
        self.query.since(self.start).until(self.end)

    def timeseries(self, dataframe=False):
        """Obtain a time series from the current query."""
        return self.query.get_timeseries(dataframe=dataframe)

    def aggregations(self):
        """Obtain a single valued aggregation from the current query."""
        return self.query.get_aggs()

This class is the root of all the metrics being calculated. We define a self.query variable which is an instance of the Issue class which allows us to query all issue related data. The aggregations() method returns a single valued aggregation. The timeseries() method returns a timeseries related to an aggregation (Ex: cardinality) for the given duration of time and for a specified interval (month, quarter, year) which is set by the Query class in manuscripts2/report.py file for all the Metrics.

class OpenedIssues(GitHubIssuesMetrics):
    """Class for computing opened issues metrics.

    :param index: index object
    :param start: start date to get the data from
    :param end: end date to get the data upto
    """

    def __init__(self, index, start, end):
        super().__init__(index, start, end)
        self.id = "opened"
        self.name = "Opened tickets"
        self.desc = "Number of opened tickets"
        self.query.get_cardinality("id").by_period()

This class represents all the metrics that have to be calculated for the issues that were opened. This class inherits from GitHubIssuesMetrics so by default it has the timeseries and aggregtions methods. Here, when we initialize the class, we set the required aggregations in the self.query variable. When we call aggregations or timeseries in the report, we will get the required values which can be used to create the report.

Each class has 3 extra variables namely: id, name and desc which help us name the files and the labels of figures when creating csv files and images in the reports.

class ClosedIssues(GitHubIssuesMetrics):
    """Class for computing closed issues metrics.

    :param index: index object
    :param start: start date to get the data from
    :param end: end date to get the data upto
    """

    def __init__(self, index, start, end):
        super().__init__(index, start, end)
        self.id = "closed"
        self.name = "Closed tickets"
        self.desc = "Number of closed tickets"
        self.query.is_closed()\
                  .since(self.start, field="closed_at")\
                  .until(self.end, field="closed_at")
        self.query.get_cardinality("id").by_period()

Here, as you can see, we modify the self.query variable because the Closed issues require the range to be set to a different date field (default is grimoire_creation_date). We can use timeseries and aggregations to calculate the metrics here too.

class DaysToCloseMedian(GitHubIssuesMetrics):
    """Class for computing the metrics related to median values
    for the number of days to close a github issue.

    :param index: index object
    :param start: start date to get the data from
    :param end: end date to get the data upto
    """

    def __init__(self, index, start, end):
        super().__init__(index, start, end)
        self.id = "days_to_close_ticket_median"
        self.name = "Days to close tickets (median)"
        self.desc = "Number of days needed to close a ticket (median)"
        self.query.is_closed()
        self.query.get_percentiles("time_to_close_days")

    def aggregations(self):
        """Get the single valued aggregations for current query
        with respect to the previous time interval."""

        prev_month_start = get_prev_month(self.end, self.query.interval_)
        self.query.since(prev_month_start)
        agg = super().aggregations()
        if agg is None:
            agg = 0  # None is because NaN in ES. Let's convert to 0
        return agg

    def timeseries(self, dataframe=False):
        """Get the date histogram aggregations.
        :param dataframe: if true, return a pandas.DataFrame object
        """

        self.query.by_period()
        return super().timeseries(dataframe=dataframe)

Here, the single valued aggregations have to be calculated for the previous interval only, so we modify the start date here.

class BMI():
    """The Backlog Management Index measures efficiency dealing with tickets.

    :param index: index object
    :param start: start date to get the data from
    :param end: end date to get the data upto
    """
    def __init__(self, index, start, end):
        self.start = start
        self.end = end
        self.id = "bmi_tickets"
        self.name = "Backlog Management Index"
        self.desc = "Number of tickets closed out of the opened ones in a given interval"
        self.closed = ClosedIssues(index, start, end)
        self.opened = OpenedIssues(index, start, end)

    def aggregations(self):
        """Get the aggregation value for BMI with respect to the previous
        time interval."""

        prev_month_start = get_prev_month(self.end,
                                          self.closed.query.interval_)
        self.closed.query.since(prev_month_start,
                                field="closed_at")
        closed_agg = self.closed.aggregations()
        self.opened.query.since(prev_month_start)
        opened_agg = self.opened.aggregations()
        if opened_agg == 0:
            bmi = 1.0  # if no submitted issues/prs, bmi is at 100%
        else:
            bmi = closed_agg / opened_agg
        return bmi

    def timeseries(self, dataframe=False):
        """Get BMI as a time series."""

        closed_timeseries = self.closed.timeseries(dataframe=dataframe)
        opened_timeseries = self.opened.timeseries(dataframe=dataframe)
        return calculate_bmi(closed_timeseries, opened_timeseries)

For BMI, the case is a little different. BMI is the ratio of closed issues / opened issues. Thus it requires two completely different queries to be used to generate the aggregations. So, instead of Inheriting from the GitHubIssuesMetrics class, we create a closed and an opened variable which are instances of the classes we discussed above. The methods are similar to GitHubIssuesMetrics class, for consistency and easier use. Here too, the aggregations have to be calculated for the previous interval, so we set the required start date and use the values returned to calculate the metrics. Similar approach is used in the timeseries method.

def overview(index, start, end):
    """Compute metrics in the overview section for enriched github issues
    indexes.

    Returns a dictionary. Each key in the dictionary is the name of
    a metric, the value is the value of that metric. Value can be
    a complex object (eg, a time series).

    :param index: index object
    :param start: start date to get the data from
    :param end: end date to get the data upto
    :return: dictionary with the value of the metrics
    """

    results = {
        "activity_metrics": [OpenedIssues(index, start, end),
                             ClosedIssues(index, start, end)],
        "author_metrics": [],
        "bmi_metrics": [BMI(index, start, end)],
        "time_to_close_metrics": [DaysToCloseMedian(index, start, end)],
        "projects_metrics": []
    }
    return results

Earlier, each data source had a primary class, as seen in manuscripts/metrics/github_prs.py. This class contained a method called get_section_metrics which would return a dict containing keys representing different sections of the report and the values containing different classes which were used to actually calculate the metrics.

We are not using that method. Instead, each section of the report: overview, project_activity, project_overview and such, will have it’s own function which will be called in the report and the necessary classes will be extracted from the dict returned by that function. aggregations and timeseries methods of these classes will then be called to generate the required mettics.

This is the general format in which the metrics will be calculated. In the next week we’ll take a look at how the report is being generated. Stay Tuned for more ugly looking code :P Chao!