All hands on deck!

July 18, 2018

I am very happy to inform y’all that I’ve passed my second evaluation. Jesus and Valerio helped me a ton in this phase so I’d like to thank them. I’d also like to thank all of the CHAOSS community for their active participation in defining and discussing the details of the remaining metrics.

Aaaaaaaaaand we are in the final phase of GSoC. This is now or never people!! Welcome to Google Summer of Code-18 Week-10.

In Brief

Task-5 [Almost Done]:

In the remaining of this task, we are trying to figure out how to calculate the metrics which depend on Pull Requests Comment data. These are the metrics under Code-Development in GMD. More on this below.

Task-7.B [Under Progress]:

This task is to create the CHAOSS reports using new functions created. I’ve been working on improving the tests and trying to figure out how to create different classes. I created a test PR with an initial method to calculate the CSV files. Right now I have finished the overview section of the Reports and I am writing tests for the same. I’ll make a PR with these changes once I am done. More on this below.

Related PR: grimoirelab-manuscripts/pull/73

Task-8 [Research Phase]:

This is a new task in which we have to figure out a way to calculate the code review iteration on the Pull Request created by the submitter. Right now, I have no clue how to calculate it.

Extras:

A while back Jesus commented on one of the PRs to make get_aggs and get_timeseries functions a part of Query class. This is the right approach and makes calculating the Metrics easy. I’ve created a PR for the same.

Related PR: grimoirelab-manuscripts/pull/74

I also revamped the tests to use actual data from elasticsearch in testing all the functions. The PR is yet to be merged.

Related PR: grimoirelab-manuscripts/pull/71

In Detail

Task-5

Pull Requests comments data:

Here are the facts: When we get the data from Perceval using no category (issues by default) then the data related to Issues and Pull Requests is fetched. All the items (issues, pull requests) are treated the same and thus the data specific to Pull Requests(review_comments, merged_by, reviewers and so on) is not fetched. To tackle this problem, we added the functionality to calculate enriched data for Pull Requests specifically (specifying category=pull_requests in Perceval when fetching the data). The problem here is that when we fetch data for pull_requests, Perceval does not fetch the comments on the pull requests. It fetches review comments, number of comments but not comments_data.

As Valerio pointed out in the discussion in the related issue that adding the functionality to fetch comments, in Perceval, for pull_requests will deplete the token (used to query data from GitHub data source) faster and that is a valid concern.

The metrics dependent on Pull Requests comments are:

Maintainer Response to Merge Request Duration: This can be a comment or a review on the Pull Request. If we have data from both the issue and pull_request category, then we can take minimum of time_to_first_attention and time_to_merge_request_response form both the indices, respectively.
Pull Request Comment Duration: We can calculate this from the issues index.
Pull Request Comment Diversity: same as above.
Pull Request Comments: This too can be calculated directly from the issues index.

The only problem with the last 3 metrics is that they are pull_request specific metrics which are being calculated from the issues index. It is a problem because we might want all the pull_request specific data in the pull_request es index.

Apart from the above, these two metrics might need some more discussion:

New Contributors of Initiated Code Reviews What is the number of persons initiating a code review for the first time?
New Contributors of Reviews for Code: What is the number of persons contributing with reviews of code for the first time?

The definitions of the above are not yet given. After the meeting today and discussion with Jesus and Valerio (and any other community members that are present in the meeting) I can decide how to proceed.

Task-7.B

We are primarily focused here on the CHAOSS metrics. I’ll start with GMD metrics by the end of this week, once CHAOSS metrics are done (Tests included). I wrote about how reports are structured in Manuscripts, in brief in week-4’s post. We know that the Reports are divided into:

overview
com_channels (communication channels)
project activity
project community
project process

Using different classes form each data source, we calculate the metrics. Please read the week-4 to get the general idea.

In manuscripts (not manuscripts2 which contains the new functions), we use the __get_config method in report.py to get all the sections from all the data sources and create a configuration dictionary containing the classes from each of the data sources. These classes can be called with various parameters and the functions generating the report use these classes to calculate the metrics under that section.

We are using a similar approach here. Each data source will have a file of their own and a <DataSource>Metrics class containing the different sections of the report.

For Example:

For GitHub Issues data source, we call the get_section_metrics function from the IssuesMetrics class to get the configuration and sections of the report.

In the Report class in manuscripts2/report.py, we have a __get_config function which fetches the different sections of the Report.

The method is different from what manuscripts currently uses because rather than creating classes for each of the metrics that we have to calculate (as is done in manuscripts, currently), we use instances of the items that we have to get the information about. Here item can be one of Open-PullRequests, Closed-PullRequests, Open-Issues, Closed-Issues, Commits and so on and all these items inherit the properties of the Query class.

In the get_secion_metrics method of each of the <DataSource>Metrics class, we set up these Items and call them with the required parameters. Continuing with the example above, the github_issues.py file looks like this:

from manuscripts2.derived_classes import Issues

DATAFRAME = True
INDEX_NAME = "github_issues"


def metric(metric_, name, id_):

    return {
        "metric": metric_,
        "name": name,
        "id": id_,
        "index": "github_issues"
    }


class IssuesMetrics():

    def __init__(self, index, start_date, end_date, interval=None):

        self.opened_issues = Issues(index).since(start_date).until(end_date)
        self.closed_issues = Issues(index).since(start_date).until(end_date)
        self.closed_issues.is_closed()

    def get_section_metrics(self):

        return {
            "overview": {
                "activity_metrics": [metric(self.closed_issues.get_cardinality("id")
                                                              .by_period(field="closed_at")
                                                              .get_timeseries(dataframe=DATAFRAME),
                                            "Closed tickets", "closed"),
                                     metric(self.opened_issues.get_cardinality("id")
                                                              .by_period()
                                                              .get_timeseries(dataframe=DATAFRAME),
                                            "Opened tickets", "opened")],
                "author_metrics": [],
                "bmi_metrics": [metric((self.closed_issues.get_cardinality("id")
                                                          .get_aggs(),
                                        self.opened_issues.get_cardinality("id")
                                                          .get_aggs()),
                                       "Backlog Management Index", "bmi_tickets")],
...

As we can see, the items here are represented by the opened_issues and closed_issues which represent the tickets opened and the tickets closed respectively. The range for them has been set. Now using these instances of items, we calculate the different sections of the report.

In activity_metrics, we get the number closed tickets and the number opened tickets in the given time period.

You should’ve also noticed a metric function. To calculate the report, we will require some data about the metric being calculated. So to remedy that, each item in the activity_metrics list is a dictionary containing name, id, index and finally metric keys. This dict is used to get different information about the metric and the metric itself.

The good thing about this design is that no metric will be calculated until the an instance of the IssuesMetric class is created by passing the corresponding Index information. Once the instance is created, then we can call the get_section_metrics method to calculate all the metrics for that data source.

This also reduces the number of classes that we have to create to get the metircs (Earlier it was one class for one metric).

Now, in the Report class, when we call the __get_config method, all the metrics that have to be calculated will be genarated. We pass the Index, start and end date to the respective class for each data source and call the get_section_metrics function which queries elasticsearch for all the data.

Then using the config dictionary, we can calculate the different sections of the Report as done in manuscripts.

This is the current approach and might need some improvement. As I write this, I am working on the tests for overview and will make a PR which will add the functionality for calculating all the files (CSV and image) for Overview section along with tests for them.