How do you Debug? I use a lot of *print* statements and patience
June 7, 2018
Welcome to Week 4 of Google Summer of Code. It said in one of the emails from Google, that time will move pretty fast. Yep, 4 weeks over!
Let’s look at the tasks to be completed in this week.
Brief introduction
TASK-1: I created the last PR which completes the migration of manuscripts to elasticsearch_dsl module. The main purpose of this PR was to change esquery.py and metrics.py so that the functions do not directly query elasticsearch. Specifically removing json.loads from get_agg in esquery.py and changing get_metrics_data to use Search.execute() to get data from elasticsearch.
I Also created PR#64 which cleans up the tests a bit.
TASK-2: We are still discussing on how to define and calculate some metrics, but apart from that, all the metrics that can be calcualted from the enriched index have been calculated. They can be found here in the Notebook. Apart from that, I opened this issue to add the Implementations of some metrics into grimoirelab-elk so that they can be added to the index while enrichment. You can follow the discussions here:
- Issue resolution efficiency
- New Contributors
- First response to issue duration: I’ll make a PR for this soon.
- Other metrics under Code Development
There is still some discussion to be held before we start making PRs and closing these issues though.
TASK-3: In task-3 I had to implement the current reports that manuscripts produces using the new functions that I’ve been working on. Using these new functions, I did create a Notebook which analyses aima-python github repo. I’ve tried to calculate all the metrics using the report pdf as reference.
This task was particularly interesting because I got an indepth view of the report.py file. I’ve written about the analysis below. I was able to implement all the current metrics for GitHub and Git in the notebook, so check it out! The new functions still need improvements and tests have to be written for them.
TASK-4: This task was to visualise the metrics using Plotly, Seaborn and Altair.
TASK-5: This task was to make PRs for the issues that were opened under TASK-2 for Code Development Metrics.
Oh boy, Here I go Debugging again!
Tasks-1 and 2 need no further explaination, I think. So I am starting with TASK-3.
NOTE: Please install sortinghat and set up a MySql or MariaDB(>10.1) when enriching the indices using p2o.py
script. THis is because grimoire-elk calibrates the IDs of the authors using sorting hat and adds some specific fields to the enriched index, which wont be there if we do not use sortinghat and we will end up with an incomplete analysis.
Instructions to install sortinghat.
Usage.
I learned this the hard way and got stuck in this because I thought that there was a problem in enrichment.
TASK-3
Here, I had to recreate the current reports that manuscripts produces currently using new_functions.py library. We will analyse parts of the report.py file and see how the reports are being generated.
Introduction to data sources being used and the related classes:
So, all the Metrics are derived from the Metric class in metrics.py file. Each data source (github, git) has a base class, for example GithubPRs class: it has a get_section metrics function which returns a configuration file containing different sections that the report is divided into. These sections are:
- overview
- com_channels (communication channels)
- project activity
- project community
- project process
As you can see in the file.
Now, the file github_prs which contains the GithubPRs class has a few other classes too, which derive them selves from a base class of the form
Other classes have this class as their parent class. Each class in the file represents a Metric as and since the base class is derived from the Metrics class from metrics.py module, each of them have all the functionalities that Metric class provides.
So, in the get_section_metrics function inside GithubPRs class, the return dict has these Classes (containing fields for the metrics they represent) to calculate the metrics from them. If I am not clear, this is how the return dict looks like: (from get_section_metrics)
return {
"overview": {
"activity_metrics": [ClosedPR, SubmittedPR],
"author_metrics": [],
"bmi_metrics": [BMIPR],
"time_to_close_metrics": [DaysToClosePRMedian],
"projects_metrics": [Projects]
},
"com_channels": {
"activity_metrics": [],
"author_metrics": []
},
"project_activity": {
"metrics": [SubmittedPR, ClosedPR]
},
"project_community": {
"author_metrics": [],
"people_top_metrics": [],
"orgs_top_metrics": [],
},
"project_process": {
"bmi_metrics": [BMIPR],
"time_to_close_metrics": [],
"time_to_close_title": "Days to close (median and average)",
"time_to_close_review_metrics": [DaysToClosePRAverage, DaysToClosePRMedian],
"time_to_close_review_title": "Days to close review (median and average)",
"patchsets_metrics": []
}
}
And the SubmittedPR class will look something like this:
class SubmittedPR(GitHubPRsMetrics):
id = "submitted"
name = "Submitted reviews"
desc = "Number of submitted code review processes"
FIELD_NAME = 'id'
FIELD_COUNT = 'id'
filters = {"pull_request": "true"}
Going by the current design that manuscripts uses.
All the datasources have a similar structure.
Reports.py
Now, coming back to our main file generating the reports. We have mappings relating the base classes we saw above to their corresponding indices in elasticsearch.
After the initialization and the required calibration of datasources, we generate the main donfiguration dictionary that will desctibe which Metrics will be a part of which sections using the get_config function. This is the configuration file that is generated when we are using git and github data sources to create the reports:
'overview': {
'activity_metrics': [<class 'manuscripts.metrics.github_prs.ClosedPR'>,
<class 'manuscripts.metrics.github_prs.SubmittedPR'>,
<class 'manuscripts.metrics.github_issues.Closed'>,
<class 'manuscripts.metrics.github_issues.Opened'>,
<class 'manuscripts.metrics.git.Commits'>],
'author_metrics': [<class 'manuscripts.metrics.git.Authors'>],
'bmi_metrics': [<class 'manuscripts.metrics.github_prs.BMIPR'>,
<class 'manuscripts.metrics.github_issues.BMI'>],
'time_to_close_metrics': [<class 'manuscripts.metrics.github_prs.DaysToClosePRMedian'>,
<class 'manuscripts.metrics.github_issues.DaysToCloseMedian'>],
'projects_metrics': [<class 'manuscripts.metrics.github_prs.Projects'>,
<class 'manuscripts.metrics.github_issues.Projects'>,
<class 'manuscripts.metrics.git.Projects'>],
'activity_file_csv': 'data_source_evolution.csv',
'efficiency_file_csv': 'efficiency.csv'
},
'com_channels': {
'activity_metrics': [], 'author_metrics': []
},
'project_activity': {
'metrics': [<class 'manuscripts.metrics.github_prs.SubmittedPR'>,
<class 'manuscripts.metrics.github_prs.ClosedPR'>,
<class 'manuscripts.metrics.github_issues.Opened'>,
<class 'manuscripts.metrics.github_issues.Closed'>,
<class 'manuscripts.metrics.git.Commits'>,
<class 'manuscripts.metrics.git.Authors'>],
'ds1_metrics': [<class 'manuscripts.metrics.github_prs.SubmittedPR'>,
<class 'manuscripts.metrics.github_prs.ClosedPR'>],
'ds2_metrics': [<class 'manuscripts.metrics.github_issues.Opened'>,
<class 'manuscripts.metrics.github_issues.Closed'>],
'ds3_metrics': [<class 'manuscripts.metrics.git.Commits'>,
<class 'manuscripts.metrics.git.Authors'>]
},
'project_community': {
'author_metrics': [<class 'manuscripts.metrics.git.Authors'>],
'people_top_metrics': [<class 'manuscripts.metrics.git.Authors'>],
'orgs_top_metrics': [<class 'manuscripts.metrics.git.Organizations'>]
},
'project_process': {
'bmi_metrics': [<class 'manuscripts.metrics.github_prs.BMIPR'>,
<class 'manuscripts.metrics.github_issues.BMI'>],
'time_to_close_metrics': [<class 'manuscripts.metrics.github_issues.DaysToCloseAverage'>,
<class 'manuscripts.metrics.github_issues.DaysToCloseMedian'>],
'time_to_close_title': 'Days to close (median and average)',
'time_to_close_review_metrics': [<class 'manuscripts.metrics.github_prs.DaysToClosePRAverage'>,
<class 'manuscripts.metrics.github_prs.DaysToClosePRMedian'>],
'time_to_close_review_title': 'Days to close review (median and average)',
'patchsets_metrics': []
}
Each entry in the lists in the dictionary is a Class having the required fields to calculate the metrics and have been derived from the Metrics class hence have all the functionality to access elasticsearch and get the results. This is really smart programming, I must say. As we can see, the Report generated is divided into 4 parts, listed as primary keys of the dict above.
OVERVIEW:
Activity metrics: we have to get the trend for these:
- Closed PRs
- Open PRs
- Issues Open
- Issues Closed
- Commits created
Authors per interval selected: description: average number of developers per month by quarters (so we have the average number of developers per month during those three months). If the approach is to work at the level of month, then just the number of developers per month.
BMI metrics: a little introduction about BMI- here, BMI calculates the efficiency of creating/closing Issues and PRs.
- BMI of PRs: closed PRs/ submitted PRs in total and a trend showing the same ratio over the said interval(month, week, year) in the given range of time.
- BMI for issues: same as PRs but for issues.
Time to close metrics:
- Median for Days to close a PR.
- Median for Days to close an issue.
Project Metrics: What are they? The field inside the enriched index for project doesnot exists. There is an issue about discussing this metric here.
COMM CHANNELS? There are none for github and git. All the communication takes place through the PRs and the Issues.
ACTIVITIES:
Here, under the project_activity
key, we have 4 key val pairs as you can see. Here we calculate the number of commits made per month (in the given range) and the number of contributors per month (in the given range).
COMMUNITY: In community, we calculate the active contributors per month and the Most active contributors in the previous period of analysis(month/quarter). We also calculate the most active org.
PROCESS:
Under Process, we calculate the BMI or the ratio of Total closed/ Total created for issues and PRs per month in the range of analysis given.
We also calculate the average and median for time_to_close days
of the PRs that were created for each month.
Some useful functions to look at:
- sec_overview: This function gets the fields from the config dict and then calls the get_trend function from the classes. It creates a csv file containing these calculated Metrics.
- create_csv_eps: This function takes in 2 metrics and generates a timeseries for those metrics for the given fields and then displays these fields in a bar graph.
- sec_project_activity: it is the same as sec_overview, but it also creates bar graphs for the metrics.
- sec_project_community: same as sec_project_activity, but to calculate the commits and author related metrics.
- sec_project_process: calculates the BMI related data for PRs and creates bar graphs plotting that data.
- create_pdf: generates the final report.
The exact analysis that is being generated in the report is shown in the Notebook using the new functions.
This was all for week-4.
In week-5 of GSoC, Our hero(me) will be facing the daunting mountains of bar graphs and pie charts while he tries to make PRs to the issues that make sense to him. He will be facing the following challenges:
- Task1: It’s done, son!
- Task2: Help finalize how to calculate the metrics still missing.
- Task3: Create a PR in manuscripts adding a file containing the new functions and one or two implementations of them.
- Task4: Finally work on the visualization part of manuscripts and experiment with Plotly, Seaborn and Altair.
- Task5: Create PRs in grimoirelab-elk for the metrics still needed to be added into the enriched indices. Only for the metrics that have no active discussion going on on how to calculate them and are properly defined.
Will our hero suceed in completing these tasks? What challenges will he face? Will he finally defeat the ultimate dragon(looking at you Task#4)? Find out next week.
Adios!