I can almost taste it

August 8, 2018

Control room!! This is the reporting ship: All tasks are complete. I repeat: All tasks are complete.

This is the final week of GSoC. I am happy to tell you that we have the first version of the reports done. The PDF looks wonderful and everything is in order. Welcome to the start of the final week of Google Summer of Code.

In Brief

Last week we were left with only 2 tasks.

Task-5 [DONE]:

Task: To check which metrics still need the relevant fields to be generated so that those metrics can be calculated.

Progress: Last week Valerio created a PR that added a micro mordred capable of running studies and enriching indices. On the basis of that functionality, I implemented an enrich_pull_requests study on the grimorelab-elk repo. This study allows us to create a pull requests index containing all the data related to the pull requests in a repository.

Related PRs:

Task-7-B [DONE (tests remain)]:

Task: To create reports using the functions and classes in manuscripts2

Progress: You might have seen The PDF report. I am still cleaning up the code (I actually needed a break, other wise this would’ve been done earlier) and I’ll be submitting a PR later in the evening integrating the code to generate the report to the main manuscripts repo.

Related PRs:

These PRs add the remaining sections of the report. And with these, we are able to generate all the data to generate a report for the git, github_issues and github_prs data sources.

Moving on..

In Brief

Task-5:

Continuing from above, I was able to actually figure out how the studies code works and create a study to enrich the pull requests only index. This is amazing to me because I couldn’t do it earlier, and then I was able to. I think the secret to this is that you have to sit with the code for a while, turn the knobs and read the manual carefully. “Intuition works but tests actually confirm that everything is fine”. Thanks to Valerio and the PR that he made. Okay, here is how it works:

The study takes in as input the name of the github_issues raw index. This index contains data regarding the comments on pull requests that are not fetched by perceval when pull requests only data is fetched. Now, we already know the name of the pull reqeuests only enriched index. We fetch the id_in_repo of all these pull requests. Remember, this repo only contains pull_requests data. So every item in that repo is a pull request.

Once we have the ids of all these pull requests: for each item in the pull requests index, we fetch the corresponding item from the issues index using the id that we got before. This will fetch us the pull requests from the github issues raw index. Now using the data from that index, we can get the required fields. These fields are:

  • Time to first reaction to the pr: this can be a comment or a review. The pull requests only index already has a value for this field. We calculate if there was a comment on the PR and the time different between that comment and the time when the PR was made. The first review time and the first comment time are compared and the lesser of the two is set as the new value.
  • Num comments: these are directly fetched from the github issues raw index.
  • Comment diversity: These are the number of people discussing the PR. This can be calulcated by getting the unique users for that PR who commented on the PR.
  • Comment duration: this is the time difference between the last commit and when the PR was created. This was calculated by getting the difference between the creation and the last comment date.

By using this study, we will be able to get the complete data for pull reqeusts of a repo. The configuration file showing how to use this study can be found here. This study can be used once the PR adding micro mordred is added to Sirmordred(https://github.com/chaoss/grimoirelab-sirmordred).

And with that, we now have almost all the fields (except num code iterations) that can be used to calculate all the metrics.

Task-7-B:

This task was to create the grimoirelab-manuscripts reports using the new functions (manuscripts2). The first successful report was generated at 16:27 IST. 3 minutes before today’s meeting.

In the last blog post we saw how the data source (git, github_pes) files have been stuctured. These files contain classes which are used to calculate the Metrics.

Now, each section of the Report is a function in the data source file. That function can be used to get the appropriate metric values using the classes. The four major sections of the report, taking the example from the previous post itself, are as follows:

In manuscripts2/metrics/github.py:

def overview(index, start, end):
    """Compute metrics in the overview section for enriched github issues
    indexes.
    Returns a dictionary. Each key in the dictionary is the name of
    a metric, the value is the value of that metric. Value can be
    a complex object (eg, a time series).
    :param index: index object
    :param start: start date to get the data from
    :param end: end date to get the data upto
    :return: dictionary with the value of the metrics
    """

    results = {
        "activity_metrics": [OpenedIssues(index, start, end),
                             ClosedIssues(index, start, end)],
        "author_metrics": [],
        "bmi_metrics": [BMI(index, start, end)],
        "time_to_close_metrics": [DaysToCloseMedian(index, start, end)],
        "projects_metrics": []
    }

    return results

This function returns the Overview section of the report.

In manuscripts2/report.py:

def get_sec_overview(self):
        """
        Generate the "overview" section of the report.
        """

        logger.debug("Calculating Overview metrics.")

        data_path = os.path.join(self.data_dir, "overview")
        if not os.path.exists(data_path):
            os.makedirs(data_path)

        overview_config = {
            "activity_metrics": [],
            "author_metrics": [],
            "bmi_metrics": [],
            "time_to_close_metrics": [],
            "projects_metrics": []
        }

        for ds in self.data_sources:
            metric_file = self.ds2class[ds]
            metric_index = self.get_metric_index(ds)
            overview = metric_file.overview(metric_index, self.start_date, self.end_date)
            for section in overview_config:
                overview_config[section] += overview[section]

        overview_config['activity_file_csv'] = "data_source_evolution.csv"
        overview_config['efficiency_file_csv'] = "efficiency.csv"

The overview function from each of the data source files is called and all the metrics for that section are added to a dictionary.

        # ACTIVITY METRICS
        metrics = overview_config['activity_metrics']
        file_name = overview_config['activity_file_csv']
        file_name = os.path.join(data_path, file_name)

        csv = "metricsnames, netvalues, relativevalues, datasource\n"

        for metric in metrics:
            (last, percentage) = get_trend(metric.timeseries())
            csv += "{}, {}, {}, {}\n".format(metric.name, last,
                                             percentage, metric.DS_NAME)
        create_csv(file_name, csv)

Create a CSV file containing the different activity metrics’ trend (relative to the previous time interval). These contain the number of PRs submitted and closed, the number of issues opened and closed and the number of commits made in this time period as compared to the previous time period.

        # AUTHOR METRICS
        """
        Git Authors:
        -----------
        Description: average number of developers per month by quarters
        (so we have the average number of developers per month during
        those three months). If the approach is to work at the level of month,
        then just the number of developers per month.
        """

        author = overview_config['author_metrics']
        if author:
            authors_by_period = author[0]
            title_label = file_label = authors_by_period.name + ' per ' + self.interval
            file_path = os.path.join(data_path, file_label)
            csv_data = authors_by_period.timeseries(dataframe=True)
            # generate the CSV and the image file displaying the data
            self.create_csv_fig_from_df([csv_data], file_path, [authors_by_period.name],
                                        fig_type="bar", title=title_label, xlabel="time_period",
                                        ylabel=authors_by_period.id)

Calculate the number of authors per time period (month here) for the range of time we are calculating the metrics for.

The function create_csv_fig_from_df is a very cool function which I describe below.

        # BMI METRICS
        bmi = []
        bmi_metrics = overview_config['bmi_metrics']
        csv = ""
        for metric in bmi_metrics:
            bmi.append(metric.aggregations())
            csv += metric.id + ", "

        # Time to close METRICS
        ttc = []
        ttc_metrics = overview_config['time_to_close_metrics']
        for metric in ttc_metrics:
            ttc.append(metric.aggregations())
            csv += metric.id + ", "

        # generate efficiency file
        csv = csv[:-2] + "\n"
        csv = csv.replace("_", "")
        bmi.extend(ttc)
        for val in bmi:
            csv += "%s, " % str_val(val)
        if csv[-2:] == ", ":
            csv = csv[:-2]

        file_name = os.path.join(data_path, 'efficiency.csv')
        create_csv(file_name, csv)
        logger.debug("Overview metrics generation complete!")

The last part of the function gets the BMI and time to close metrics values for the previous time intervals. The queries for these values are configured on the data source file and the aggregaions() function generates those values here.

These metrics are then calculated by calling the timeseries() or aggregations() methods of the Metric classes. The values returned from these methods are stored in files which will then be used when the report is being created using LaTex.

Other sections are as follows:

def project_activity(index, start, end):
    """Compute the metrics for the project activity section of the enriched
    github issues index.
    Returns a dictionary containing a "metric" key. This key contains the
    metrics for this section.
    :param index: index object
    :param start: start date to get the data from
    :param end: end date to get the data upto
    :return: dictionary with the value of the metrics
    """

    results = {
        "metrics": [OpenedIssues(index, start, end),
                    ClosedIssues(index, start, end)]
    }

    return results


def project_community(index, start, end):
    """Compute the metrics for the project community section of the enriched
    github issues index.
    Returns a dictionary containing "author_metrics", "people_top_metrics"
    and "orgs_top_metrics" as the keys and the related Metrics as the values.
    :param index: index object
    :param start: start date to get the data from
    :param end: end date to get the data upto
    :return: dictionary with the value of the metrics
    """

    results = {
        "author_metrics": [],
        "people_top_metrics": [],
        "orgs_top_metrics": [],
    }

    return results


def project_process(index, start, end):
    """Compute the metrics for the project process section of the enriched
    github issues index.
    Returns a dictionary containing "bmi_metrics", "time_to_close_metrics",
    "time_to_close_review_metrics" and patchsets_metrics as the keys and
    the related Metrics as the values.
    time_to_close_title and time_to_close_review_title contain the file names
    to be used for time_to_close_metrics and time_to_close_review_metrics
    metrics data.
    :param index: index object
    :param start: start date to get the data from
    :param end: end date to get the data upto
    :return: dictionary with the value of the metrics
    """

    results = {
        "bmi_metrics": [BMI(index, start, end)],
        "time_to_close_metrics": [DaysToCloseAverage(index, start, end),
                                  DaysToCloseMedian(index, start, end)],
        "time_to_close_review_metrics": [],
        "patchsets_metrics": []
    }

    return results

The function create_csv_fig_from_df

def create_csv_fig_from_df(self, data_frames=[], filename=None, headers=[], index_label=None,
                               fig_type=None, title=None, xlabel=None, ylabel=None, xfont=20,
                               yfont=20, titlefont=30, fig_size=(10, 15), image_type="png"):
        """
        Joins all the datafarames horizontally and creates a CSV and an image file from
        those dataframes.
        :param data_frames: a list of dataframes containing timeseries data from various metrics
        :param filename: the name of the csv and image file
        :param headers: a list of headers to be applied to columns of the dataframes
        :param index_label: name of the index column
        :param fig_type: figure type. Currently we support 'bar' graphs
                         default: normal graph
        :param title: display title of the figure
        :param filename: file name to save the figure as
        :param xlabel: label for x axis
        :param ylabel: label for y axis
        :param xfont: font size of x axis label
        :param yfont: font size of y axis label
        :param titlefont: font size of title of the figure
        :param fig_size: tuple describing size of the figure (in centimeters) (H x W)
        :param image_type: the image type to save the image as: jpg, png, etc
                           default: png
        :returns: creates a csv having name as "filename".csv and an image file
                  having the name as "filename"."image_type"
        """

        if not data_frames:
            logger.error("No dataframes provided to create CSV")
            sys.exit(1)
        assert(len(data_frames) == len(headers))
        dataframes = []

        for index, df in enumerate(data_frames):
            df = df.rename(columns={"value": headers[index]})
            dataframes.append(df)
        res_df = pd.concat(dataframes, axis=1)

        if "unixtime" in res_df:
            del res_df['unixtime']
        if not index_label:
            index_label = "Date"

        # Create the CSV file:
        csv_name = filename + ".csv"
        res_df.to_csv(csv_name, index_label=index_label)
        logger.debug("file: {} was created.".format(csv_name))

        # Create the Image:
        image_name = filename + "." + image_type
        figure(figsize=fig_size)
        plt.subplot(111)

        if fig_type == "bar":
            ax = res_df.plot.bar(figsize=fig_size)
            ticklabels = res_df.index
            ax.xaxis.set_major_formatter(matplotlib.ticker.FixedFormatter(ticklabels))
        else:
            plt.plot(res_df)

        if not ylabel:
            ylabel = "num " + " & ".join(headers)
        if not xlabel:
            xlabel = index_label

        plt.title(title, fontsize=titlefont)
        plt.ylabel(ylabel, fontsize=yfont)
        plt.xlabel(xlabel, fontsize=xfont)
        plt.grid(True)
        plt.savefig(image_name)
        logger.debug("Figure {} was generated.".format(image_name))

The function create_csv_fig_from_df does that the name says. We pass on to it a list of data frames (timseries data generated from the different metrics). This data is usually connected to each other or is in comparision with each other (time to close days median VS time to close days average). These data frames are 2 columned (one is the Date column which is also the index and the other is a value column). The column names are changed using values from the header variable. These dataframes are then concatinated together to create one dataframe containing all the values and same index (Date).

In the second part of this function, the concatinated dataframe is then used to create a csv file with the file name that was provided earlier (the extension is csv).

In the third part of this function, using the dataframes created in the first part, a eps(default) file is generated. These CSV and EPS files are stored in the folder for that section.

The good thing about the new reports is that is uses a lot of already available code (pandas, elasticsearch_dsl) which allows us to focus on the functionality rather than focus on the implementation.

We are using the old latex format and directory structure only. I’ve tweaked that format a bit so that the images and CSV files are rendered properly. I’ll make a PR adding the method to generate the report by today.


The final tasks that remain are:

  • Add good testing suite for the reports
  • Add documentation on how the reports can be generated
  • Add the Notebooks for GMD and visualizations for the users to have a better hands on approach in looking at the GMD metrics.
  • setup.py should be updated so that manusripts2 can be used properly.

And that is it. I’ll be wrting a final report that will contain all the details about the project under GSoC and summarizing the report generation.

Chao!