Week-11

July 25, 2018

If you have a good testing infrastructure, you can do anything with your code. Everything becomes easy because you know that as long as the tests are passing, everything is A-Okay. The only downside is that testing is hard, man.

Welcome to Google Summer of Code-18: Week-11.

In Brief

Task-5 [Almost Done (still..)]:

Task: Try to figure out how to calculate the metrics which depend on Pull Requests Comment data and any other that are remaining.

Progress: Last week, during our weekly GSoC meeting, Jesus, Valerio and I discussed about what can be done to calculate the metrics related to pull_requests category. The discussion is going on in this issue and as Jesus pointed out in his last comment:

The assumption is that we will have two raw (Perceval) indexes:

github, the current GitHub index, from the GitHub issues API. githubpr, the new GitHub PR index, from the GutHub Pull Requests API. Now, with those, we would be producing a new enriched index, githubpr, with all the data needed to deal with PRs. That data would come from the two previous raw indexes, and yes, it may contain duplicated data with the current github enriched index.

I think at some point we would do the same with a new githubissues index, which will carry data only on issues, deprecating the current github enriched index.

During our discussion on IRC, Valerio suggested that I use studies to perform the enrichment, so that the extra data related to the PRs can be added to the githubpr index. Right now, I am reading about how studies work.

I also read about how aliases work and from what I’ve read, we can create an alias for the raw indices for github and githubpr and then using that alias, we can pick the data that we want to be enriched and generate enriched githubpr index. I hope to further discuss this today and reach a conclusion ASAP.

Task-7-B [Under Progress]:

Task: To generate the reports using manuscripts2

Progress: After a lot of trial and error and some major help from Jesus, in the form of this PR explaining exactly how the reports should be structured (Thanks again, Jesus!), I submitted another PR for the OVERVIEW section of the metrics for the git data source. The tough part was to figure out how to write the tests for these functions. More on this below.

Task-8 [Researching]:

This week too, I couldn’t progress on this task much.

In Detail

Task-7-B:

Jesus suggested that we have a separate session on how to structure the reports. It was because the structure, that I previously suggested based on the old manuscripts code, was too much like JS and not at all Object Oriented. After the discussion I had some clarity as to what I need to do and I made a PR with the changes. But, I still couldn’t follow exactly what Jesus was trying to explain to me so he made a PR explaining exactly what he had in mind. The idea is to have a class for each of the metrics that are being calculated. And each section of the Report (overview, communication, etc) will have a separate function which will be called as needed.

I worked on top of that PR and submitted an initial structure of the OVERVIEW section of the report for the git data source. I believe that once that structure is finalized then the rest of the data sources can be added in a similar fashion. I plan on describing the new structure once it gets finalised so it might be in the next week’s blogpost or if I am feeling ambitious, I’ll do a bonus post describing exactly what we are doing.

Tests: Let’s come to the tests now. For testing the reports, we need enriched data for the various data sources we support and that data is to be queried by manuscripts2’s functions through elasticsearch. I am focusing on git, github_issues and github_prs data right now.

For testing the functions, we’ll need to mimic that data and create the whole infrastructure that is required to generate the report. I was confused as to what to do initially as the tests for elasticsearch.py fetch the git data directly from the repository using perceval and the tests for the report could not be created in the similar manner. For these tests, we would require data that will not change as the time passes and the data which can be controlled by us.

Then I realised CHAOSS has a lot of tools and one of them: grimoirelab-elk actually uses something like this. It creates raw and enriched indices using frozen raw data (that is not queried using the internet) in the form of JSON files and uploads it to elasticsearch. It then uses this raw index to create an enriched index and also saves it into elasticsearch.

I used a similar approach in which I got all the raw data for the perceval git repository and used that raw data to write the tests for the reports. The downside of this technique being that tests have to do twice the amount of work: first upload the raw data then enrich the data and upload the enriched data into elasticsearch. This technique works for now, but is enefficient.

So during a separate session, we also talked about getting frozen enriched data and using it directly. I had previously tried it but I wasn’t successful as I kept getting an error due to the mappings related to the data. On the bright side, Valerio recently created a Pull Request which only uses enriched data and uploads it into elasticsearch to run the tests. I think this is the solution for the testing problems that I’ve been facing.

All in all, the movement is slow but we are on track with this task.


That is all for now. This week I am going to focus on finishing the reports (once the final structure is decided) and creating a PR for the enriched githubprs data.