Reproducing "A Large-Scale Study of Programming Languages and Code Quality in GitHub: A Reproducibility Study"

This is an attempt to reproduce the data processing and analysis from this paper, from its associated Github repo, itself a reproducibility study on an earlier paper. The initial method used to create this notebook is to process all of the .Rmd (R markdown) files in the repo's root directory into one file (done here), and then import that file into Nextjournal. The result will have data and library codes filled in, formatting issues corrected, and any required code changes made to get processing running and results displayed will be annotated.

1. Building the Artifact of the original paper (original_artifact.Rmd)

The following command extract the contents of the original artifact into the original directory alongside the archive. The original artifact contains two folders, the R_code folder in which the original R scripts can be found as they were given to us and the sqlDump folder, which contains the data.

cd original
tar -zxf artifact.tgz
cd ..

After the original artifact is extracted, the sql data dump must be combined together to produce the two files the rest of the paper is concerned with, namely the _everything.csv file containing th entire dataset and th _newSha.csv containing filtered data. The original artifact contained the rebuild.sh script to produce these files, which can be found in the original/sqlDump/DATA directory:

cd original/sqlDump
./rebuild.sh
cd ../..

2. A Large Scale Study of Programming Languages and Code Quality in Github --- Repetition (repetition.Rmd)

Initially we load the implementation and initialize the environment making sure that all outputs of this script will be stored in artifact/repetition subfolder:

# load the file containing the actual implementation details
knitr::opts_chunk$set(echo = FALSE)
source("implementation.R")
initializeEnvironment("./artifact/repetition")

2.1. Data collection (chapter 2.2 of the FSE manuscript)

We staert with data acquisition. The artifact obtained from the original authors contained two CSV files. The file everything contains one row per unique commit and filename pair. The file newSha contains one row per unique commit and language pair. Thus, one row of newSha summarizes one or more rows of everything. By looking at the code of the artifact we have determined that the contents of the newSha file was used for the analysis.

everything <- loadEverything()
newSha <- loadNewSha()

2.1.1. Number of projects analyzed

The study claims that 729 projects were used and that each one of them had more than 27 commits in any language used:

newSha_project_names = sort(unique(newSha$project))
number_of_projects <- length(newSha_project_names)
check(number_of_projects==729) 
projects_with_too_few_commits = newSha %>% group_by(project, language) %>% dplyr::summarize(n = n()) %>% filter(n <= 27)
check(nrow(projects_with_too_few_commits) == 0)

Let's see how many projects are there in everything:

# There are projects in the larger file that are missing from the smaller file, with no explanation in the paper
everything_project_names <-  sort(unique(everything$project))
assert_that(length(everything_project_names)==877)
# some projects are not used in the paper
projects_not_in_paper    <-  setdiff(everything_project_names, newSha_project_names)
number_of_projects_not_included <- length(projects_not_in_paper)
check(number_of_projects_not_included == 148)
# some commits are not used
everything %>% filter(project %in% projects_not_in_paper) -> everything_unused
# and the rest is used
everything %>% filter( project %in% newSha_project_names)  -> everything_used
assert_that(nrow(everything_used)==4956282)

Ok, that's more, check if the discrepancy is explained by too few commits for the missing projects:

number_of_large_projects_not_included <- nrow(everything %>% dplyr::filter(project %in% projects_not_in_paper) %>%  group_by(project,tag) %>% dplyr::summarize (n=n()) %>% filter(n>27))
check(number_of_large_projects_not_included == 101)
out("numberOfProjectsIncluded", number_of_projects )
out("numberOfProjectsNotIncluded", number_of_projects_not_included)
out("numberOfLargeProjectsNotIncluded", number_of_large_projects_not_included)

We found data for 148 projects that have been ignored in the study. Although the study filtered out projects with fewer than 28 commits, 101 of the ignored projects have 28 commits or more.

2.1.2. Sizes of the code analyzed

The study claims that 63M SLOC were analyzed and uses the difference betweeen insertions and deletions as reported by git to calculate this. Although this metric is not precise, we follow with it:

sloc <- sum(newSha$insertion) - sum(newSha$deletion)
sloc_mio <- round(sloc / 1000000, 1)
check(sloc_mio == 63) # what the paper claims
out("slocMio", sloc_mio)
check(sloc_mio == 80.7)

In fact, there are There are 80.7 mio lines of code

2.1.3. Claim: 29K authors

number_authors <- length(unique(everything_used$author))
check(number_authors / 1000 == 29) # what the paper says
check(number_authors == 47297)
number_committers <- length(unique(everything_used$committer))
check(number_committers == 29082)
out("numberCommitters", round(number_committers/1000, 0))
out("numberAuthors", round(number_authors/1000, 0))

everything %>% filter(committer == "Linus Torvalds") -> lt
everything %>% filter(author == "Linus Torvalds") -> lt2
out("linusCommitter", nrow(lt))
out("linusAuthor", nrow(lt2))

There are 29K committers, but the study is interested in active developers and not just people with commit rights. There are 47K developers.

2.1.4. Claim: 1.5M commits

number_of_commits <- length(unique(newSha$sha))
check(number_of_commits == 1478609)
out("numberCommitsMio", round(number_of_commits/1000/1000,1))

There are 1.5m commits.

2.1.5. Claim: 17 languages

number_of_languages = length(unique(newSha$language))
check(number_of_languages == 17)

The authors have indeed operated with 17 languages.

2.1.6. Project-Language pairs with fewer than 20 commits are ignored

The authors claim that they have excluded project-language pairs for which there was fewer than 20 commits in project. Let's see if there are any such pairs in newSha:

newSha %>% group_by(project, language) %>% dplyr::summarize(n=n()) %>% filter(n < 20) -> too_small_to_keep
check(nrow(too_small_to_keep) == 0)

There are none, so the data has already been cleaned.

2.1.7. Claim: 564,625 buggy commits.

newSha %>% filter(isbug == 1) -> bugFixes
numberOfBugFixes <-  length(unique(bugFixes$sha))
check(numberOfBugFixes == 530777)
out("numberOfBugFixes", numberOfBugFixes)

The number of bug fixes is close but not quite matching. We have about 30K fewer.

2.1.8. everything to newSha summary

Let's now summarize everything which contains only the projects used in newSha to check for any further discrepancies wrt newSha. First we only select the columns we will need and convert discrete data into factors to limit memory footprint and speedup execution:

# The data frames have redundancy. Delete columns we can derive and remove columns we don't use. Turn character columns into factors.
everything %>% dplyr::select(tag,         # language assigned by authors
                           author,             # name of author
                           committer,          # name of committer (not same as author)
                           commit_date,        # when was the commit made
                           commit_age,         # ???
                           insertion,          # number of lines inserted
                           deletion,           # number of lines deleted
                           sha,                # commit id
                           project,            # project name
                           file_name,           # commit filename
                           domain,
                           isbug
                          )  -> everything_clean

# turn discrete data into factors  
everything_clean %>% mutate(language = as.factor(tag),
                               committer = as.factor(committer),
                               project = as.factor(project)
                               ) -> everything_clean

Next, there are some redundancies in everything, such as rows that differ in github_languag column, but describe the same file (we found 23210 such rows). Since we are not interested in the github_language column any more, we can simply ignore such redundancies:

# We remove keep only one row per unique columns we are interested in
everything_clean %>% distinct(sha,file_name, project,.keep_all=TRUE)  -> everything_clean

To ease our life down the road, we also rename the languages factor to use same names as does the newSha version, and order the factor in alphabetical order for better visualization:

# sanity: use the same labels as in newSha
everything_clean$language <-  
  revalue(everything_clean$language, c( "c"="C",
                                        "clojure"="Clojure",
                                        "coffeescript"="Coffeescript",
                                        "cpp"="C++","csharp"="C#",
                                        "erlang"="Erlang",
                                        "go"="Go",
                                        "haskell"="Haskell",
                                        "java"="Java",
                                        "javascript"="Javascript",
                                        "objc"="Objective-C",
                                        "perl"="Perl",
                                        "php"="Php",
                                        "python"="Python",
                                        "ruby"="Ruby",
                                        "scala"="Scala",
                                        "typescript"="Typescript"))

# sanity: levels in alphabetical order
everything_clean$language <- 
    factor(everything_clean$language, levels=sort(levels(everything_clean$language)))

Finally, let's summarize everything_clean per language:

checkSha = everything_clean %>% 
    dplyr::select(-(file_name)) %>% 
    group_by(sha) %>% 
    mutate(files = n()) %>% 
    distinct(sha, .keep_all = T) %>% 
    dplyr::select(language, project, sha, files, committer, author, commit_age, insertion, deletion, isbug, domain)

Let's see the number of unique commits found in newSha and the version constructed from everything:

newSha_distinct = unique(newSha$sha) 
checkSha_distinct = unique(checkSha$sha) 
in_both = intersect(newSha_distinct, checkSha_distinct)
out("notInNewShaRaw", length(checkSha_distinct) - length(in_both))
out("notInEverythingRaw", length(newSha_distinct) - length(in_both))

This is promising, what we can do now is basically use author field from checked and append it to newSha which lacks the author (the original paper used committers instead as we discovered above).

stuff = checkSha %>% dplyr::select(sha, author)
newSha = inner_join(newSha, stuff, "sha")

Now in order to properly compare newSha and its version we obtained from everything we should remove from it any project-language pairs that have fewer than 20 commits:

checkSha = checkSha %>%
    group_by(project, language) %>%
    mutate(n = n()) %>%
    filter(n >= 20) %>%
    dplyr::select(-(n));

And redo the verification that it contains everything newSha does:

newSha_distinct = unique(newSha$sha) 
checkSha_distinct = unique(checkSha$sha) 
in_both = intersect(newSha_distinct, checkSha_distinct)
out("notInNewSha", length(checkSha_distinct) - length(in_both))
out("notInEverything", length(newSha_distinct) - length(in_both))

OUCH. So there are now commits that are in newSha, but are not in everything summarized. Furthermore, we should be comparing based on projects + commits since a commit may be shared with more projects:

newSha_distinct = unique(paste0(newSha$project, newSha$sha))
checkSha_distinct = unique(paste0(checkSha$project, checkSha$sha)) 
in_both = intersect(newSha_distinct, checkSha_distinct)
out("notInNewShaWithProject", length(checkSha_distinct) - length(in_both))
out("notInEverythingWithProject", length(newSha_distinct) - length(in_both))

OUCH^2. It would seem there is lot of duplication in newSha that is not present in everything. Spoiler alert: More on duplication in newSha in the reproduction section.

As last thing in the data collection section, let us give newSha the same treatment we gave to everything, i.e. add factors, remove unused columns, etc.

1.5s
Analysis (R)
R TOPLAS19
# newSha also has redundancies
newSha %>% dplyr::select(language,
                  typeclass,
                  langclass, 
                  memoryclass,
                  compileclass,
                  project,
                  sha,
                  files,
                  committer,
                  author,
                  commit_age,
                  commit_date,
                  insertion,
                  deletion,
                  isbug,
                  #bug_type,
                  domain,
                  btype1,
                  btype2
                ) %>%
            mutate(language = as.factor(language),
                   typeclass = as.factor(typeclass),
                   langclass = as.factor(langclass),
                   memoryclass = as.factor(memoryclass),
                   compileclass = as.factor(compileclass),
                   project = as.factor(project),
                   #committer = as.factor(committer),
                   #bug_type = as.factor(bug_type),
                   domain = as.factor(domain)
            ) -> newSha

# sanity: levels in alphabetical order
newSha$language <- 
    factor(newSha$language, levels=sort(levels(newSha$language)))

2.2. Categorizing Languages (section 2.3)

While we did not have the code that categorizes the languages into groups, the classes are well described in the paper and the classification data is available in the newSha file. As a first step we thus verify that the categorization in newSha conforms to the categories explained in the paper. First, we bring the class format to single column compatible with what we use further down the study.

We reclassify hybrid memory model languages (Objective-C) to Unmanaged, which is what the paper later suggests (the claim is that while Objective-C can use both, unmanaged is the default case).

newSha$typeclass = revalue(newSha$typeclass, c("strong" = "Str", "weak" = "Wea"))
newSha$langclass = revalue(newSha$langclass, c("functional" = "Fun", "proc" = "Pro", "script" = "Scr"))
newSha$memoryclass = revalue(newSha$memoryclass, c("unmanaged" = "Unm", "managed" = "Man", "hybrid" = "Unm"))
newSha$compileclass = revalue(newSha$compileclass, c("dynamic" = "Dyn", "static" = "Sta"))

newSha$combinedOriginal = paste(newSha$langclass, newSha$compileclass, newSha$typeclass, newSha$memoryclass)

Now, classify the languages according to the classification as described in the FSE paper:

newSha$combined = classifyLanguageAsOriginal(newSha$language)

Make sure that the original and our recreated classification is identical:

check(identical(newSha$combinedOriginal, newSha$combined))

So no errors in this section. Let's remove combine column:

newSha = newSha %>% dplyr::select(-(combined))

Next, the paper actually ignores TypeScript from the language category analysis saying that TypeScript does not properly fit into the categories, so we assign a special category (Other) to it so that we can easily remove it later. Finally, we convert the combinedOriginal column into a factor.

newSha$combinedOriginal[newSha$language == "Typescript"] = "Other"
newSha = newSha %>% mutate(combinedOriginal = as.factor(combinedOriginal))

2.2.1. Better Language Classification

Although the language categorization in the data is same as in the paper (modulo the hybrid memory class for Objective-C and categories given for Typescript), it is wrong. We have both reclassified the obvious mistakes and marked two other languages, that do not properly fit the categories (Scala and Objective-C). Furthermore we categorize Typescript as scripting, dynamic, weak and managed, since it fully conforms to this category:

Category | Per Paper | Per Reality ----------------------------------|--------------------------------|------------- Functional-Static-Strong-Managed | haskell scala | haskell
Functional-Dynamic-Strong-Managed | clojure erlang | (empty) Proc-Static-Strong-Managed | java go cs | java go c# Proc-Static-Weak-UnManaged | c cpp | C C++ Script-Dynamic-Weak-Managed | coffee js perl php | Python Perl Ruby JavaScript Php CoffeeScript Typescript Script-Dynamic-Strong-Managed | python ruby | (empty) Functional-dynamic-weak-managed | (empty) | clojure erlang Other | Typescript | Scala Objective-C

newSha$combinedCorrected = as.factor(classifyLanguageCorrected(newSha$language))

2.3. Identifying Project Domain (2.4)

The artifact available to us does not contain the code for the identification and the paper does not specify it in detail enough for us to repeat it from scratch. The data however contain the domain column which corresponds to the domain associated in this section. We will use this in subsequent analysis.

2.4. Categorizing Bugs (2.5)

The artifact again did not contain any code towards this goal, nor did we find the results of such categorization anywhere in the data. We will revisit this in RQ4.

2.5. Saving the processed data

This is not actual part of the repetition, but we save the newSha and checkSha for future use:

write.csv(newSha, paste0(WORKING_DIR, "/Data/newSha.csv"))
write.csv(checkSha, paste0(WORKING_DIR, "/Data/checkSha.csv"))

2.6. Are some languages more defect prone than others? (chapter 3, RQ 1)

Get the data, summarize the commits information per project & language and log transform according to the paper (in the code we have obtained log10 and log are used for different rows of the model)

newSha$combined = newSha$combinedOriginal
newSha$devs = newSha$committer
X = summarizeByLanguage(newSha)
Y = logTransform(X, log10, log)

2.6.1. Fit the Negative Binomial Regression

The Negative Binomial model is fit on the original scale of bcommits. However, the predictors were log-transformed to avoid influential observations (as was in the study). First verify the weighted contrasts to conform to the original:

contr.Weights(Y$language)
0 items

We see diagonal with ones and the last line (Typescript) contains the weighted contrasts, this looks good.

Let's fit the model, and then the releveled model to get the parameter for the last language:

nbfit = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language, contrasts = list(language = contr.Weights(Y$language)), data=Y)
nbfit_r = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language_r, contrasts = list(language_r = contr.Weights(Y$language_r)), data=Y)
# combine them into single result table
result = combineModels(nbfit, nbfit_r, Y$language)
result
0 items

Let's now juxtapose this to the FSE paper's results. The ok column contains TRUE if there was a claim in the original with a significance and we support that claim at the same significance level, FALSE if there was a claim and we do not support it, or support it at worse significance, and NA if there was no claim in the original to begin with.

juxt = merge(result, baselineFSE_RQ1(), by = 0, all = T, sort = F)
juxt$ok = checkPValues(juxt, "FSE_pv", "pVal")
juxt
0 items

Outside of the control variables, the only value we were not able to repeat is PHP. We believe this is a simple typo in the original paper (the authors themselves have noticed this in the CACM reprint of the paper), so we have created updated baseline:

juxt = merge(result, baselineFSE_RQ1_fixed(), by = 0, all = T, sort = F)
juxt$ok = checkPValues(juxt, "FSE_pv", "pVal")
juxt
0 items

Better. We were able to repeat the RQ1 convincingly.

2.6.2. Deviance Analysis

The authors also display numbers for deviance analysis. They find that log commits is the single most contributing factor, with second most important being the languages at less than 1%:

anova(nbfit)
0 items

Well, this looks different indeed. This is because we have different order of the control varibles and anova checks type-I tests so order matters. If we change the order to correspond to the original manuscript, we can replicate:

nbfit_order_fixed = glm.nb(bcommits~lcommits+lmax_commit_age+ltins+ldevs+language, contrasts = list(language = contr.Weights(Y$language)), data=Y)
anova(nbfit_order_fixed)
0 items

However, proper way would be to use Anova function which is order independent as it does type-II tests:

Anova(nbfit)
0 items

2.7. Which Language Properties Relate to Defects (chapter 3, RQ 2)

First, let's drop Typescript, which now belongs to the Other category since we are not interested in it:

newSha$combined = newSha$combinedOriginal
newSha$devs = newSha$committer
X = summarizeByLanguage(newSha)
Y = logTransform(X, log10, log)
Y = Y %>% filter(combined != "Other")
Y = droplevels(Y)

Let's check the weights first, as we did in RQ1:

contr.Weights(Y$combined)
0 items

Fit the model, similarly to RQ1:

nbfit = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+combined, contrasts = list(combined = contr.Weights(Y$combined)), data=Y)
nbfit_r = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+combined_r, contrasts = list(combined_r = contr.Weights(Y$combined_r)), data=Y)
# combine them into single result table
result = combineModels(nbfit, nbfit_r, Y$combined)
result
0 items

And let's juxtapose this against the original FSE paper:

juxt = merge(result, baselineFSE_RQ2(), by = 0, all = T, sort = F)
juxt$ok = checkPValues(juxt, "FSE_pv", "pVal")
juxt
0 items

Using the original classification, we can repeat the RQ2 results properly. The only difference is the large coefficient we see in commits. This is because apparently this time the FSE paper used natural logarithms everywhere, so we try that too:

Now let's look if this changes when we use the updated classification:

newSha$combined = newSha$combinedCorrected
Xcor = summarizeByLanguage(newSha)
Ycor = logTransform(Xcor, log, log)
Ycor = Ycor %>% filter(combined != "Other")
Ycor = droplevels(Ycor)
nbfit = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+combined, contrasts = list(combined = contr.Weights(Ycor$combined)), data=Ycor)
nbfit_r = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+combined_r, contrasts = list(combined_r = contr.Weights(Ycor$combined_r)), data=Ycor)
# combine them into single result table
resultReclassified = combineModels(nbfit, nbfit_r, Ycor$combined)
juxt = merge(resultReclassified, baselineFSE_RQ2(), by = 0, all = T, sort = F)
juxt$ok = checkPValues(juxt, "FSE_pv", "pVal")
juxt
0 items

We can observe that the log commits coefficient now corresponds to the original value due to the log change. Furthermore we see that the Pro Sta Str Man and Scr Dyn Wea Man categories are no longer significant and we cannot say anything about the Fun Dyn Str Man category since there are no such languages (but these almost exclusively went to Fun Dyn Wea Man, which is signifcant under the reclassification).

To summarize, we were able to replicate RQ2 properly. However, if we correct the errors in language classification into categories, the claims are much weaker - notably procedural managed languages and scripting languages are no longer significantly more likely to contain bugs.

Finally, create the table we will use in the paper:

output_RQ2_table(result, resultReclassified)

2.8. Does Language Defect Proneness Depend on Domain (chapter 3, RQ3)

The artifact we obtained contained no code to answer this question, so will attempt to figure the code out ourselves.

Let's try to recreate the heatmaps from RQ3 original paper. First prepare the data:

langs = length(unique(Y$language))
domains = length(unique(Y$domain))

ratio = matrix(0, langs, domains)
projects = matrix(0, langs, domains)
commits = matrix(0, langs, domains)
bcommits = matrix(0, langs, domains)

for (i in 1:nrow(Y)) {
    r = Y[i,]
    langIndex = as.numeric(r$language)
    domainIndex = as.numeric(r$domain)
    projects[langIndex, domainIndex] = projects[langIndex, domainIndex] + 1
    commits[langIndex, domainIndex] = commits[langIndex, domainIndex] + r$commits
    bcommits[langIndex, domainIndex] = bcommits[langIndex, domainIndex] + r$bcommits
}

ratio = bcommits / commits

Now do the heatmap:

heatmap.2(ratio, dendrogram = "none", cellnote = round(ratio,2), trace = "none", labRow = levels(Y$language), labCol = levels(Y$domain), Rowv = NA, Colv = NA, notecol = "red", col = gray(256:0 / 256 ))

Visual comparison, since this is all we have because no read data on the heatmaps was in the artifact either does confirm that this is very similar to the first heatmap in RQ3 of the original paper. After that the paper argues that outliers needed to be removed, where outliers are projects with bug densities below 10th percentile and above 90th percentile. We therefore remove these and create the second heatmap, first creating the data in the same way as before:

ratioVect = Y$bcommits / Y$commits
q10 = quantile(ratioVect, 0.1)
q90 = quantile(ratioVect, 0.9)
Y__ = Y[(ratioVect > q10) & (ratioVect < q90),]

ratio_ = matrix(0, langs, domains)
projects_ = matrix(0, langs, domains)
commits_ = matrix(0, langs, domains)
bcommits_ = matrix(0, langs, domains)

for (i in 1:nrow(Y__)) {
    r = Y__[i,]
    rr = r$bcommits / r$commits 
    langIndex = as.numeric(r$language)
    domainIndex = as.numeric(r$domain)
    projects_[langIndex, domainIndex] = projects[langIndex, domainIndex] + 1
    commits_[langIndex, domainIndex] = commits[langIndex, domainIndex] + r$commits
    bcommits_[langIndex, domainIndex] = bcommits[langIndex, domainIndex] + r$bcommits
}

ratio_ = bcommits_ / commits_

How much data did we shed?

cat(paste("Records:          ", round(100 - nrow(Y__) / nrow(Y) * 100, 2), "%\n"))
cat(paste("Buggy commits:    ", round(100 - sum(Y__$bcommits) / sum(Y$bcommits) * 100, 2), "%\n"))
cat(paste("Commits:          ", round(100 - sum(Y__$commits) / sum(Y$commits) * 100, 2), "%\n"))

Ok, that is kind of substantial. But that happens when you go 0.1 and 0.9%, should be ~ 20% of the data.

The heatmap:

heatmap.2(ratio_, dendrogram = "none", cellnote = round(ratio_,2), trace = "none", labRow = levels(Y__$language), labCol = levels(Y__$domain), Rowv = NA, Colv = NA, notecol = "red", col = gray(256:0 / 256 ))

So, this is different to me. In the original paper, C is strongest in Code Analyzer, while we see C being strongest in Database. Also Perl is very strong in Database, but we see no such thing in the paper.

Note: The paper ignores the other category and instead shows the overall results, but this has no effect on the actual domains.

Frustrated by the fact that we cannot replicate the second heatmap, we were thinking if there are other means by which we can support or deny the claims of the original RQ3. This claim is that there is no general relationship between domain and language defect proness. To demonstrate this, we can try the glm from RQ1 and RQ2, but this time see domains. If we do not find significant correlation, then since we have found significant correlation for languages already, it would follow that domains have no general effect.

Let's first summarize the data again:

newSha$combined = newSha$combinedCorrected
newSha$devs = newSha$committer
X = summarizeByLanguage(newSha)
Y = logTransform(X, log, log)
# discard the other domain
Y = Y %>% filter(domain != "Other")
Y = droplevels(Y)

And fit the model, this time showing the domain instead of languages or language categories. However, we do two extra things - we change the contrasts to zero sum contrasts, which makes more sense, and we adjust for multiple hypothesis using the Bonferroni method - see re-analysis.Rmd or the paper for more details on why.

nbfit = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+domain, contrasts = list(domain = "contr.sum"), data=Y)
nbfit_r = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+domain_r, contrasts = list(domain_r = "contr.sum"), data=Y)
# combine them into single result table
result = combineModels(nbfit, nbfit_r, Y$domain, pValAdjust = "bonferroni")
result
0 items

Looking at the results above, none of the domains is significant at level 0.01 so we can say that our tests supports the claim made by the original authors and move to the last RQ. But before doing just that, we should generate a lateX table from our result since this table goes into the paper:

result$pVal = lessThanPvCheck001(round(result$pVal, 2))
for (i in 1:nrow(result)) {
    out(paste0("rqIIIname",as.roman(i)), rownames(result)[[i]])
    out(paste0("rqIIIcoef", as.roman(i)), result$coef[[i]])
    out(paste0("rqIIIpv", as.roman(i)), result$pVal[[i]])
}

2.9. What is the relation between language and bug category? (chapter 3, RQ4)

There was no column in the dataset with the same bug type labels in Table 5 in the paper. Our closest approximation to the authors' bug categories classifies the commits with Algo, Concurrency, Memory, and Programming in btype1 with those respective labels; it classifies the commits with Security, Performance, and Failure from btype2; and it classifies commits as Other according to whether they are labeled as such in btype1 and btype2.

btype_list <- mapply(c, newSha$btype1, newSha$btype2, SIMPLIFY=F)

nb_fun <- function(l) {
  !("NB" %in% l)
}

btype_justbugs <- Filter(nb_fun, btype_list)
numjustbugs    <- length(btype_justbugs)
check(numjustbugs == 559186)

# Make sure btype1 != btype2
# This will rule out list("Other", "Other")
different_btypes <- function(l) {
  length(unique(l)) == 2
}

# rows with one btype listed as "Other" are
# not overlapping
other_fun <- function(l) {
  !("Other" %in% l)
}

overlap    <- Filter(other_fun, btype_justbugs)
overlap    <- Filter(different_btypes, overlap)
numoverlap <- length(overlap)
check(numoverlap == 7417)

percent_overlap <- round((numoverlap / numjustbugs)*100, digits=2)

It is unclear whether the bug categories are meant to be mutually exclusive--but the sum of their percent counts is 104.44%.

Of the 559186 buggy commits in newSha, 1.33% of them have two different bug categories assigned to them, neither of which is "Other".

We attempted to recreate the numbers in Table 5 with the unmodified data in newSha.

T5_bug_type <- c("Algorithm", "Concurrency", "Memory", "Programming", "Security", "Performance", "Failure", "Unknown")
bugtype_count <- c(606, 11111, 30437, 495013, 11235, 8651, 21079, 5792)
percent_count <- c(0.11, 1.99, 5.44, 88.53, 2.01, 1.55, 3.77, 1.04)
# Calculate highest possible value for each category in our data set

btype_algo <- nrow(newSha[newSha$btype1 == "Algo",])
btype_conc <- nrow(newSha[newSha$btype1 == "Concurrency",])
btype_mem  <- nrow(newSha[newSha$btype1 == "Memory",])
btype_prog <- nrow(newSha[newSha$btype1 == "Programming",])
btype_sec  <- nrow(newSha[newSha$btype2 == "Security",])
btype_perf <- nrow(newSha[newSha$btype2 == "Performance",])
btype_fail <- nrow(newSha[newSha$btype2 == "Failure",]) # There is no "FAILURE" category in bug_type
btype_unknown <- nrow(newSha[newSha$btype1 == "Other" & newSha$btype2 == "Other",])

our_bugtype_count <- c(btype_algo, btype_conc, btype_mem, btype_prog, btype_sec, btype_perf, btype_fail, btype_unknown)

btype_total = sum(newSha$isbug)
our_percent = round((our_bugtype_count / btype_total)*100, digits=2)
factordiff = mapply(function(X, Y) { round((Y / X), digits=4) }, bugtype_count, our_bugtype_count)
countdiff   = mapply(function(X, Y) { Y - X }, bugtype_count, our_bugtype_count)

bugpercenttotal_known <- sum(percent_count[1:7])
check(bugpercenttotal_known == 103.4)
bugpercenttotal <- sum(percent_count)
check(bugpercenttotal == 104.44)

t5_bug_categories <- data.frame(T5_bug_type, btype_count=bugtype_count, percent=percent_count, our_btype_count=our_bugtype_count, our_percent, diff_factor=factordiff, count_difference=countdiff)
t5_bug_categories
0 items

These differences in the composition of the data shared with us versus that reported by the authors make the data incomparable.

remove(WORKING_DIR)

2.10. Summary

We are mostly able to verify the basic data collection steps, although we have found discrepancies in some numbers reported as well as two-way errors in the summarization of the unique project-commit-file data in everything and project-commit-language in newSha, but overall the data in the artifact seem to be reasonably complete to attempt to repeat the analysis.

We were not able to repeat the project domain and bug category classifications since the artifact contain no code towards these and the description provided in the paper was not detailed enough for us to repeat from scratch.

The paper had 4 claims, of which:

We were able to completely replicate the first claim (language bug proneness) if we assume the typo in one significance level. We have found minor inconsistencies in the paper which do not have effect on the claim.

We were able to replicate the second claim (language categories), however we found mistakes in the language classification. When we corrected for them the original claim was only partially valid.

We were not able to replicate the third claim (language domains) because the data cleaning described in the paper provided not complete description and our data afterward differed significantly from the original. We have however used another method to the same end.

We were not able to replicate the fourth claim (bug categories). Input data to this problem was not available and no detailed description on how to recreate them was available in the artifact or the paper.

3. R Notebook (re-analysis.Rmd)

This is our attempt at re-analysis. We only concentrate on RQ1.

First, because this Rmd file can be executed by external script to allow automatic generation of all possible permutations of our re-analysis cleaning steps, we define certain variables that conditionally enable features of this document. By default they are all set to TRUE:

if (! exists("REMOVE_DUPLICATES"))
    REMOVE_DUPLICATES = T
if (! exists("REMOVE_TYPESCRIPT"))
    REMOVE_TYPESCRIPT = T
if (! exists("REMOVE_V8"))
    REMOVE_V8 = T
if (! exists("UNCERTAINTY"))
    UNCERTAINTY = T
if (! exists("USE_AUTHORS_INSTEAD_OF_COMMITTERS"))
    USE_AUTHORS_INSTEAD_COMMITTERS = T
# sanity check prints
cat(paste("REMOVE_DUPLICATES:              ", REMOVE_DUPLICATES, "\n"))
cat(paste("REMOVE_TYPESCRIPT:              ", REMOVE_TYPESCRIPT, "\n"))
cat(paste("REMOVE_V8:                      ", REMOVE_V8, "\n"))
cat(paste("UNCERTAINTY:                    ", UNCERTAINTY, "\n"))
cat(paste("USE_AUTHORS_INSTEAD_COMMITTERS: ", USE_AUTHORS_INSTEAD_COMMITTERS, "\n"))

Next, we setup working dir. Working dir can also be provided by external script, however if it is not provided, we verify that all features are enabled and set it to the default, i.e. artifact/re-analysis, where the paper expects the graphs to be found.

if (! exists("WORKING_DIR")) {
    if (! REMOVE_DUPLICATES & REMOVE_TYPESCRIPT & REMOVE_V8 & UNCERTAINTY & USE_AUTHORS_INSTEAD_COMMITTERS)
        stop("If non-default flags are submitted to the re-analysis notebook, WORKING_DIR must be set.")
    WORKING_DIR = "./artifact/re-analysis"
}

Now initialize the script's environment:

# load the file containing the actual implementation details
knitr::opts_chunk$set(echo = FALSE)
source("implementation.R")
initializeEnvironment()

3.1. Data Collection

We reuse the cleaning and optimization passes from the repetition and just load the result of that:

data = read.csv("./artifact/repetition/Data/newSha.csv")
initial_data = data
initial_number_of_commits = length(unique(data$sha))
initial_number_of_rows = nrow(data)
everything =loadEverything()

3.2. Data Cleaning

3.2.1. Removing Duplication

First, we remove any duplicates, i.e. same commits that are present in multiple projects. We remove all occurences of commits that have duplicates since we do not know which one to pick to stay and different picks may bias the later analysis.

if (REMOVE_DUPLICATES) {
    sha_proj = data %>% dplyr::select(sha, project) %>% group_by(sha, project) %>% dplyr::summarize(n = n())
    duplicates_sha = sha_proj %>% group_by(sha) %>% dplyr::summarize(n = n()) %>% filter(n > 1)
    # this is the number of commits that are present in multiple commits
    num_duplicates_sha = nrow(duplicates_sha)
    check(num_duplicates_sha == 27450)
    # how many projects are affected? 
    dup_repos = data %>% filter(sha %in% duplicates_sha$sha)
    dup_repos = unique(dup_repos$project)
    num_dup_repos = length(dup_repos)
    check(num_dup_repos == 33)
    # determine how many commits are there in the affected projects in total
    commits_by_dup_repos = data %>% filter(project %in% dup_repos) %>% group_by(sha)
    num_commits_by_dup_repos = nrow(commits_by_dup_repos)
    # since we can't do better, exclude all duplicate commits from the dataset
    data = data %>% filter(! sha %in% duplicates_sha$sha);
    # some reporting for the paper
    out("numberOfProjectsWithDuplicates", num_dup_repos)
    out("numDuplicateCommits", num_duplicates_sha)
    out("percentageDuplicateCommits", round(num_duplicates_sha/initial_number_of_commits * 100,2))
    out("percentageDuplicateRowsLost", round(100 - nrow(data) / initial_number_of_rows * 100, 2))
    hist(duplicates_sha$n, breaks = 100)
}

By removing the duplicates we may have created project-language pairs that have fewer than 20 commits. The original study excluded such pairs and we see no reason not to repeat this. Let's find any such rows and delete them

if (REMOVE_DUPLICATES) {
    num_rows_before_cleanup = nrow(data)
    not_keep = data %>% group_by(project, language) %>% dplyr::summarize(n = n()) %>% filter(n < 20)
    keep = data %>% group_by(project, language) %>% dplyr::summarize(n = n()) %>% filter(n >= 20)
    data = keep %>% inner_join(data, by=c("project", "language"))
    out("smallProjectCommits", nrow(not_keep))
    out("percentageDuplicationReadjustmentRowsLost", round(100 - nrow(data)/num_rows_before_cleanup * 100, 2))
}

3.2.2. Removing Typescript

The Typescript language was released in October 1, 2012 and is the smallest language in the study. However, its first commit in the study is dated 2003. What is happening here is that that the .ts extension is used for translation files, not Typescript, which was not detected by the original study.

if (REMOVE_TYPESCRIPT) {
    ts = data %>% filter(language == "Typescript")
    ts_rows_num = nrow(ts)
    ts_proj_num = length(unique(ts$project))
    check(ts_proj_num == 53)
    out("initialNumTSProjects", ts_proj_num)
    ts_commits_num <- length(unique(ts$sha))  
    check(ts_commits_num == 10105)
    out("initialNumTSCommits", ts_commits_num)
    ts_dates <- sort(unique(ts$commit_date))  # "2003-03-01"
    check(ts_dates[1] == "2003-03-21")
    out("tsFirstCommit", ts_dates[1])
}

We now remove from TS what we manually confirmed to be translations and not actual TypeScript:

if (REMOVE_TYPESCRIPT) {
  translation_projects <- c("mythtv", "hw", "SpeedCrunch", "qBittorrent", "mumble", "pokerth", "hydrogen", "unetbootin",
                          "tiled", "goldendict", "zf2", "Cockatrice", "tagainijisho", "focuswriter", "LibreCAD", 
                          "razor-qt", "qupzilla", "tomahawk", "ppcoin", "mirall", "MuseScore", "shotcut")
  ts = ts %>% filter(! project %in% translation_projects)
  check(length(unique(ts$sha)) == 4456)
  ts = ts %>% 
      filter(!str_detect(project, "coin")) %>%  # remove all the *coin  projects
      filter(!str_detect(project, "Coin")) %>%  # remove all the *Coin  projects
      filter(!str_detect(project, "change"))  # remove all the *change  projects
  # TODO why are these in remaining??? 
  remaining_translations <- c("antimicro", "identifi", "octopi", "pcbsd", "subsurface", "ttrss", "wps_i18n")
  ts = ts %>% filter(!project %in% remaining_translations)
  real_ts_num_proj <- length(unique(ts$project)) 
  check(real_ts_num_proj == 16)
  out("realTSProjNum", real_ts_num_proj)
  real_ts_num_commit <- length(unique(ts$sha))
  check(real_ts_num_commit == 3782)
  out("realTSCommitsNum", real_ts_num_commit)
  real_ts_proj <- unique(ts$project)
}

Another issue with Typescript is that a lot of its files are only type definitions, and contain no actual code in Typescript but headers for Javascript functions and classes elsewhere. We should exclude these as well:

if (REMOVE_TYPESCRIPT) {
  # typescript-node-definitions, DefinitelyTyped, tsd
  # DEPRECATED: TSD is deprecated, please use Typings and see this issue for more information.

  tdefs1 = ts %>% filter(project=="typescript-node-definitions") 
  tdefs2 = ts %>% filter(project=="DefinitelyTyped") 
  tdefs3 = ts %>% filter(project=="tsd")
  sts <- length(unique(ts$sha))
  st1 <- length(unique(tdefs1$sha))
  st2 <- length(unique(tdefs2$sha))
  st3 <- length(unique(tdefs3$sha))

  ratio <- round((st1 + st2 + st3)/sts*100, 1)
  check(ratio == 34.6)
  out("ratioOfTypeDefTSCommits", ratio)
  
  ts = ts %>% filter(project !="typescript-node-definitions") %>% filter(project !="DefinitelyTyped") %>% filter(project !="tsd")
}

So what are we left with?

if (REMOVE_TYPESCRIPT) {
   ts_valid_rows_num = nrow(ts)
   ts_valid_commits = length(unique(ts$sha))
   ts_valid_projects = length(unique(ts$project))
   out("tsValidRows", ts_valid_rows_num)
   out("tsValidCommits", ts_valid_commits)
   out("tsValidProjects", ts_valid_projects)
}

We are down to only 13 projects out of 50 and given the fact Typescript was the smallest language to begin with and how much of its commits we had to remove with minimal effort, the only safe thing we can do is to exclude Typescript from the analysis completely:

if (REMOVE_TYPESCRIPT) {
    num_rows_before_ts = nrow(data)
    data = data %>% filter(language != "Typescript")
    out("percentageTypescriptRowsLost", round(100 - nrow(data) / num_rows_before_ts * 100, 2))
    data$language = factor(as.character(data$language))
}

3.3. Removing V8

The V8 project is plagued with errors and inaccuracies:

if (REMOVE_V8) {
    v8 = everything %>% filter(project == "v8")
    out("vCCommits", length(unique(v8[v8$tag == "c",]$sha)))
    out("vCppCommits", length(unique(v8[v8$tag == "cpp",]$sha)))
    out("vJavascriptCommits", length(unique(v8[v8$tag == "javascript",]$sha)))
    out("vPythonCommits", length(unique(v8[v8$tag == "python",]$sha)))
}

The paper ignores all of V8's C++ files (the .cc and .h file extensions are ignored) and classifies V8 as a JavaScript project. This is obviously wrong and since V8 is one of the larger projects, may skew the analysis substantially. To avoid this, we remove V8 from the data:

if (REMOVE_V8) {
    num_rows_before_v8 = nrow(data)
    data = data %>% filter(project != "v8")
    out("percentageVEightRowsLost", round(100 - nrow(data) / num_rows_before_v8 * 100, 2))
}

3.3.1. Summary

Let's summarize the final numbers after all cleaning:

newShaNum <- length(unique(data$sha))
ratioSha <- round(100 - (newShaNum/initial_number_of_commits*100),2)
out("finalNumSha", newShaNum)
out("finalNumShaMio", round(newShaNum/1000/1000,1))
out("ratioReducedSha", ratioSha)
out("ratioReducedShaRows", round(100 - nrow(data) / nrow(initial_data) * 100, 2))

f_number_of_projects <- length(unique(data$project))
check(f_number_of_projects == 719) 
out("finalNumberOfProjectsIncluded", f_number_of_projects)

f_sloc <- sum(data$insertion) - sum(data$deletion)
f_sloc_mio <- round(f_sloc / 1000000, 1)
check(f_sloc_mio == 58.2)
out("finalSlocMio", f_sloc_mio)

f_number_authors <- length(unique(data$author))
check(f_number_authors == 46204)
out("finalNumberAuthors", round(f_number_authors/1000,0))

f_bugFixes = data %>% filter(isbug == 1)
f_numberOfBugFixes <-  length(unique(f_bugFixes$sha))
out("finalNumberOfBugFixes", f_numberOfBugFixes)

Let's see how much data we shed in different categories:

cat(paste("Commits (unique):     ", round(100 - length(unique(data$sha)) / length(unique(initial_data$sha)) * 100, 2), "%\n"))
cat(paste("Commits (rows):       ", round(100 - nrow(data) / nrow(initial_data) * 100, 2), "%\n"))
cat(paste("Buggy commits (rows): ", round(100 - sum(data$isbug) / sum(initial_data$isbug) * 100, 2), "%\n"))
cat(paste("Projects:             ", round(100 - length(unique(data$project)) / length(unique(initial_data$project)) * 100, 2), "%\n"))
cat(paste("SLOC:                 ", round(100 - sum(data$insertion - data$deletion) / sum(initial_data$insertion - initial_data$deletion) * 100, 2), "%\n"))
cat(paste("Languages:            ", round(100 - length(unique(data$language)) / length(unique(initial_data$language)) * 100, 2), "%\n"))
cat(paste("Authors:              ", round(100 - length(unique(data$author)) / length(unique(initial_data$author)) * 100, 2), "%\n"))

3.4. Summarization

Summarize the data for the graphs and modelling as we did in the repetition:

data$combined = data$combinedOriginal # does not matter
if (USE_AUTHORS_INSTEAD_COMMITTERS) {
    data$devs = data$author
} else {
    data$devs = data$committer
}
X = summarizeByLanguage(data)
Y = logTransform(X, log, log)

3.5. Graphs

Let's do some graphs:

3.5.1. Commits vs. Bugfixes.

data %>% group_by(language) %>% summarize(commits = n(), bugfixes = sum(isbug)) -> df

ggplot(data=df, aes(x=commits, y=bugfixes)) + 
  geom_smooth(method='lm',size=.5, formula = y ~ x)  +
  geom_point()  + 
  geom_text_repel(aes(label=language), segment.color = 'grey50', fontface = 'bold', size = 5) + 
  theme_light() +
  theme(text = element_text(size = 15)) +
  labs(y="Bug-fixing commits", x = "Commits") +
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
                labels = trans_format("log10", math_format(10^.x))) + 
  scale_x_log10(breaks = trans_breaks("log10", function(x) 10^x),
                  labels = trans_format("log10", math_format(10^.x))) -> p

ggsave(paste(WORKING_DIR, "/Figures/commits.pdf", sep = ""))
print(p)

3.5.2. Bug rate vs project age for selected languages

2.2s
Analysis (R)
R TOPLAS19
X %>% 
  dplyr::select(project, language, commits, bcommits,max_commit_age) %>% 
  mutate(br=round(bcommits/commits*100,1)) -> df

most_commits <- .9 * sum(X$commits)
df %>% arrange(desc(commits)) -> df

# find the number of project that would comprise over 90% of the entire dataset
accumulatedCommits = 0
i = 0
while (accumulatedCommits <= most_commits) {
    i = i + 1
    accumulatedCommits = accumulatedCommits + df[i, 3]
}
# verify that we have the correct nunber
check(sum(df[1:i,3]) > most_commits) 
check(sum(df[1:i-1,3]) <= most_commits) 

df[1:i,] ->most_df

most_df %>% filter(language %in% c("C", "C++","Clojure","Haskell","Objective-C","Scala")) -> most_df2

br_mean <- most_df2 %>% group_by(language) %>% summarise(br=mean(br))
age_mean <- most_df2 %>% group_by(language) %>% summarise(age=mean(max_commit_age))

ggplot(data=most_df2, aes(x=max_commit_age,y=br,color=language)) + geom_point() +
  geom_hline(aes(yintercept=br, color='black'), size=.25, br_mean) +
  geom_vline(aes(xintercept=age), size=.25, age_mean) +
   theme_light() +
  theme(text = element_text(size=20)) +
  scale_x_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                  labels = trans_format("log10", math_format(10^.x))) +
                labs(x="Project age (days)", y="Bug rate (percent of commits)") +
                theme(legend.position = "none") +
                facet_wrap(~language, ncol=2) -> p

X %>% 
  mutate(br=round(bcommits/commits*100,1)) %>% 
  group_by(language) %>% 
  summarize(mbr = mean(br), mage = mean(max_commit_age)) %>% 
  arrange(desc(mbr)) -> mbr

ggsave(paste(WORKING_DIR, "/Figures/age.pdf", sep = ""), width=8, height=8)
print(p)

3.5.3. Developer behavior

data %>% group_by(author) %>% summarize(n=n())  %>% arrange(desc(n)) -> devs

most_commits_bydev <- .9*sum(devs$n)

# find the number of developers accounting for <= 90% of the entire dataset
accumulatedCommits = 0
d = 0
while (accumulatedCommits <= most_commits_bydev) {
    d = d + 1
    accumulatedCommits = accumulatedCommits + devs[d, 2]
}

# verify that we have the correct nunber
check(sum(devs[1:d-1,2]) < most_commits_bydev) 
check(sum(devs[1:d,2]) > most_commits_bydev)

devs_prolific <- devs$author[1:d]

# Show the number of languages the most prolific authors committed for
data %>% filter(author %in% devs_prolific) -> devs_prolific
devs_prolific = devs_prolific %>% group_by(author,language) %>% summarize(n=n()) %>% summarize(l=n())  %>% arrange(desc(l))
devs_prolific
hist(devs_prolific$l)
0 items

3.6. Modelling

3.6.1. Basic analyses & graphs of the dataset

dimnames(X)[[2]]
pairs(X[,3:7], gap=0, pch='.')

Let's log transform the data:

Y = logTransform(X)
pairs(Y[,2:6], gap=0, pch='.')

3.6.2. Language Frequencies

Explore the relative frequencies of the languages--how many projects use them in the dataset

sort(table(Y$language))

The dataset contains 17 languages (16, if we remove TypeScript). They were represented by unevenly distributed numbers of projects (from 23 for Perl to 199 for Javascript).

Explore the relationship between the languages, and other measurements

# (devs, tins, max_commit_age, commits, bcommits)
boxplot(split(Y$lbcommits, Y$language), las=2, main="Log(bugged commits, 0 replaced with 0.5)")
par(mfrow=c(1,2), oma=c(1,1,1,1), mar=c(5,1,2,1))
boxplot(split(Y$ldevs, Y$language), las=2, main="Log(devs)")
boxplot(split(Y$ltins, Y$language), las=2, main="Log(tins)")
boxplot(split(Y$lmax_commit_age, Y$language), las=2,
        main="Log(max_commit_age)")
boxplot(split(Y$lcommits, Y$language), las=2, main="Log(commits)")

The distribution of bug-commits varied between the languages, but so did the distributions of the other measurements. It is difficult to see direct associations between languages and bugs from bivariate plots.

3.6.3. Fitting the Negative Binomial

Let's now fit the negative binomial and compare to the original paper - this is the same code as in replication:

nbfit = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language, contrasts = list(language = contr.Weights(Y$language)), data=Y)
nbfit_r = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language_r, contrasts = list(language_r = contr.Weights(Y$language_r)), data=Y)
# combine them into single result table
resultWeighted = combineModels(nbfit, nbfit_r, Y$language)
juxtWeighted = merge(resultWeighted, baselineFSE_RQ1(), by = 0, all = T, sort = F)
juxtWeighted$ok = checkPValues(juxtWeighted, "FSE_pv", "pVal")
juxtWeighted
0 items

And we see that just cleaning the data did invalidate several of the language claims made by the original paper.

3.6.4. Fitting the zero-Sum Contrasts Negative Binomial

The zero-sum contrasts are preferred to weighted contrasts for their better stability. Let's look at how they look:

contr.sum(length(levels(Y$language)))
0 items

And let's fit the model:

nbfit = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language, contrasts = list(language = contr.sum), data=Y)
nbfit_r = glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language_r, contrasts = list(language_r = contr.sum), data=Y)
# combine them into single result table
resultZeroSum = combineModels(nbfit, nbfit_r, Y$language)
juxtZeroSum = merge(resultZeroSum, baselineFSE_RQ1(), by = 0, all = T, sort = F)
juxtZeroSum$ok = checkPValues(juxtZeroSum, "FSE_pv", "pVal")
juxtZeroSum
0 items

Some invalidatuions still.

3.6.5. Fit Negative Binomial regression without languages and compare the full and the reduced models

nbfit_reduced <- glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits, data=Y)
summary(nbfit_reduced)

Comparing two nested models, i.e.: reduced model vs : full model, using F-test:

anova(nbfit_reduced, nbfit)
0 items
cat("AIC, full:", AIC(nbfit), "\n")
cat("AIC, reduced:", AIC(nbfit_reduced), "\n")
cat("BIC, full:", BIC(nbfit), "\n")
cat("BIC, reduced:", BIC(nbfit_reduced), "\n")

The difference between the models is borderline.

3.6.6. Adjusting for Multiple Hypothesis

The original paper just compares the p-Values against thresholds, which is not correct way to do. In the presence of multiple hypothesis, the p-Values must be adjusted. There are various ways to adjust the p-Values, namely the Bonferroni and FDR (Benjamini & Hochberg). The FDR adjustment is more permissive and the Bonferrioni is the more conservative one. What we can do now is to revisit the juxtaposed tables and add extra columns for the FDR and Bonferroni adjustments:

# the the pValues for the predictions, not for the control variables since these are ignored by the adjustments
pValWeighted = juxtWeighted$pVal[6:(5 + length(unique(Y$language)))]
# create the empty vectors with NAs instead of pValues
fdrWeighted = rep(NA, nrow(juxtWeighted))
bonfWeighted = rep(NA, nrow(juxtWeighted))
# update the relevant parts of the vectors with the adjusted pValues
fdrWeighted[6:(5+ length(pValWeighted))] = round(p.adjust(pValWeighted, "fdr"), 3)
bonfWeighted[6:(5+ length(pValWeighted))] = round(p.adjust(pValWeighted, "bonferroni"), 3)
# add the columns to the juxtaposed tables
juxtWeighted$pVal_fdr = fdrWeighted
juxtWeighted$pVal_bonf = bonfWeighted
# now remove the old ok column and add the ok, ok_fdr and ok_bonf columns
juxtWeighted = juxtWeighted %>% dplyr::select(-(ok))
# add the columns
juxtWeighted$ok = checkPValues(juxtWeighted, "FSE_pv", "pVal")
juxtWeighted$ok_fdr = checkPValuesLevel(juxtWeighted, "FSE_pv", "pVal_fdr", 0.01)
juxtWeighted$ok_bonf = checkPValuesLevel(juxtWeighted, "FSE_pv", "pVal_bonf", 0.01)
juxtWeighted
0 items

With the exception of the last language (Coffeescript), the FDR and Bonferroni are the same and they actually invalidate some of the original predictions. Let's look at the zero-sum contrasts version now:

# the the pValues for the predictions, not for the control variables since these are ignored by the adjustments
pValZeroSum = juxtZeroSum$pVal[6:(5 + length(unique(Y$language)))]
# create the empty vectors with NAs instead of pValues
fdrZeroSum = rep(NA, nrow(juxtZeroSum))
bonfZeroSum = rep(NA, nrow(juxtZeroSum))
# update the relevant parts of the vectors with the adjusted pValues
fdrZeroSum[6:(5+ length(pValZeroSum))] = round(p.adjust(pValZeroSum, "fdr"), 3)
bonfZeroSum[6:(5+ length(pValZeroSum))] = round(p.adjust(pValZeroSum, "bonferroni"), 3)
# add the columns to the juxtaposed tables
juxtZeroSum$pVal_fdr = fdrZeroSum
juxtZeroSum$pVal_bonf = bonfZeroSum
# now remove the old ok column and add the ok, ok_fdr and ok_bonf columns
juxtZeroSum = juxtZeroSum %>% dplyr::select(-(ok))
# add the columns
juxtZeroSum$ok = checkPValues(juxtZeroSum, "FSE_pv", "pVal")
juxtZeroSum$ok_fdr = checkPValuesLevel(juxtZeroSum, "FSE_pv", "pVal_fdr", 0.01)
juxtZeroSum$ok_bonf = checkPValuesLevel(juxtZeroSum, "FSE_pv", "pVal_bonf", 0.01)
juxtZeroSum
0 items

The situation is fairly similar here.

3.6.7. Statistical significance vs practical significance, for languages

Since the number of observations is large, statistical significance can be driven by the sample size without being meaningful in practice.

Here we contrast model-based confidence intervals and prediction intervals, on the log and on the original scale

numLanguages = length(unique(Y$language))

# Create a new data structure for prediction
newY <- NULL
for (i in 1:numLanguages) {
  newY <- rbind(newY, 
                data.frame(language=rep(levels(Y$language)[i], 100),
                ldevs=rep(median(Y$ldevs), 100), 
                lcommits=seq(from=min(Y$lcommits), to=max(Y$lcommits), length=100),
                ltins=rep(median(Y$ltins), 100),
                lmax_commit_age=rep(median(Y$lmax_commit_age), 100)))
}
newY$commits <- exp(newY$lcommits)

# Make predictions
pr_nbfit <- predict(nbfit, type="response", newdata=newY, se.fit=TRUE)
newY$pr_mean <- pr_nbfit$fit
newY$pr_se <- pr_nbfit$se.fit

Consider languages with the most predicted bugs (C++) and fewest predicted bugs (Clojure). Compute the log CI for C++ and Clojure and the log Prediction CI. Then translate the intervals on the original scale.

0.7s
Analysis (R)
R TOPLAS19
axfont  = 32 # size of axis title font
ticfont = 26 # size of axes' font
legfont = 28 # size of legend title
legtext = 26 # size of legend text
ptitle  = 30 # size of plot title letters

getConfInterval<-function(df,lang) {
  df %>% 
    filter(language==lang)  %>%
    mutate(language = lang,
           x = lcommits, y = log(pr_mean), 
           yhigh = log(pr_mean + qnorm(1-0.01/numLanguages) * pr_se),
           ylow =  log(pr_mean - qnorm(1-0.01/numLanguages) * pr_se)) %>%
    dplyr::select(language, x, y, ylow, yhigh)
}
dfCI <- rbind(getConfInterval(newY, "C++"), getConfInterval(newY, "Clojure"))

getPredInterval<-function(df,lang) {
  df %>% 
    filter(language==lang) %>% 
    mutate(language = lang,
           x = lcommits, y = log(pr_mean), 
           yhigh = log(qnbinom(1-0.01/numLanguages, mu= pr_mean, size= nbfit$theta) ),
           ylow = log(qnbinom(0.01/numLanguages, mu=pr_mean, size=nbfit$theta))) %>%
    dplyr::select(language, x, y, ylow, yhigh)
}
dfPI <- rbind(getPredInterval(newY, "C++"), getPredInterval(newY, "Clojure"))

plotIt <- function(df) {
  ggplot(data = df, aes(x=x,y=y,color=language)) + geom_line() +
    geom_ribbon(aes(ymin=df$ylow, ymax=df$yhigh), linetype=2, alpha=0.1) +
    labs(x="log of commits", y="log of bug-fixing commits") + 
    theme_light() +
    theme(axis.title = element_text(size=axfont),
          axis.text = element_text(size=ticfont),
          plot.title = element_text(hjust = 0.5, size = ptitle))
}

plotIt(dfCI) + theme(legend.position = "none") + ggtitle("(a)") -> p1
plotIt(dfPI) + theme(legend.title = element_text(size=legfont),
                      legend.text = element_text(size=legtext)) +
                      guides(colour = guide_legend(nrow = 1)) + 
                      ggtitle("(b)") -> p2

# Grab the legend to display it separately
tmp    <- ggplot_gtable(ggplot_build(p2))
l      <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[l]]
legend

p2 + theme(legend.position = "none") -> p2

getConfInterval <- function(df,lang) {
  df %>% 
    filter(language==lang)  %>%
    mutate(language = lang,
           x = commits,
           y = pr_mean, 
           yhigh = pr_mean + qnorm(1-0.01/numLanguages) * pr_se,
           ylow =  pr_mean - qnorm(1-0.01/numLanguages)* pr_se ) %>%
    dplyr::select(language, x, y, ylow, yhigh)
}
dfCI <- rbind(getConfInterval(newY, "C++"), getConfInterval(newY, "Clojure"))

getPredInterval<-function(df,lang) {
  df %>% 
    filter(language==lang) %>% 
    mutate(language = lang,
           x = commits, y = pr_mean, 
           yhigh = qnbinom(1-0.01/numLanguages, mu= pr_mean, size= nbfit$theta) ,
           ylow =qnbinom(0.01/numLanguages, mu=pr_mean, size=nbfit$theta)) %>%
    dplyr::select(language, x,y,ylow,yhigh)
}
dfPI <- rbind(getPredInterval(newY, "C++"), getPredInterval(newY, "Clojure"))

plotIt <- function(df) {
  ggplot(data = df, aes(x=x,y=y,color=language)) + geom_line() +
    geom_ribbon(aes(ymin=df$ylow, ymax=df$yhigh), linetype=2, alpha=0.1) +
    theme_light() +
    theme(axis.title = element_text(size=axfont),
          axis.text = element_text(size=ticfont),
          plot.title = element_text(hjust = 0.5, size = ptitle),
          legend.position = "none") +
    labs(x="commits", y="bug-fixing commits")
}
plotIt(dfCI) + xlim(0,800) + ylim(0,400) + ggtitle("(c)") -> p3
plotIt(dfPI) + xlim(0,800) + ylim(0,620) + ggtitle("(d)") -> p4
pdf(paste(WORKING_DIR, "/Figures/intervals.pdf", sep = ""), width=32, height=8)
figs <- grid.arrange(arrangeGrob(p1, p2, p3, p4, nrow=1), arrangeGrob(legend), nrow=2, heights=c(4,1))
dev.off()
# Plot predicted means for all the languages
par(mfrow=c(1,2))
# Log scale
plot(log(pr_mean)~lcommits,
     main="log(Exp. values), all languages",
     data=newY[newY$language==levels(Y$language)[1],],
     type="l", 
     ylab="log(bugged commits)")
for (i in 2:numLanguages) {
  lines(log(pr_mean)~lcommits,
     data=newY[newY$language==levels(Y$language)[i],])
}

# Original scale
plot(pr_mean~commits,
     main="Exp. values, all languages",
     data=newY[newY$language==levels(Y$language)[1],],
     type="l", 
     xlim=c(0, 800),
     ylim=c(0,400),
     ylab="bugged commits")
for (i in 2:numLanguages) {
  lines(pr_mean~commits,
     data=newY[newY$language==levels(Y$language)[i],])
}

3.6.8. Add uncertainty in labels

1629.0s
Analysis (R)
R TOPLAS19
if (UNCERTAINTY) {
    fp <- 0.36
    fn <- 0.11
    # Function to get parameter values
    getParams <- function(Ystar) {
        # Fit NB with standard contrasts
        nbfit <- glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language, 
                           contrasts = list(language = "contr.sum"), data=Ystar)
        s <- summary(nbfit)$coefficients
      
        # Fit the releveled model with standard contrasts, 
        # to get the parameter for Scala
        nbfit_r <- glm.nb(bcommits~lmax_commit_age+ltins+ldevs+lcommits+language_r, 
                         contrasts = list(language_r = "contr.sum"), data=Ystar)
        s_r <- summary(nbfit_r)$coefficients
    
        # Return params, incluing Scala
        out <- c(s[,1], s_r[6, 1])
        names(out) <- c(dimnames(s)[[1]][1:5], levels(Ystar$language))
        out
    }
    numBootstrapIterations = 10000
    # Perform sampling
    set.seed(1)
    paramNum <- numLanguages + 5
    paramsLang <- matrix(rep(NA,paramNum*numBootstrapIterations), nrow=paramNum)
    for (i in 1:numBootstrapIterations) {
        # Sample rows
        Ystar <- Y[sample(1:nrow(Y), replace = T),]
      
        # Adjust for tp and fp:
        # Reduce bcommits by prob fp
        # Increase non-bugged commits by prob fn
        tmp <- rbinom(n=nrow(Ystar), size=Ystar$bcommits, prob=1-fp) +
          rbinom(n=nrow(Ystar), size=(Ystar$commits-Ystar$bcommits), prob=fn)
        Ystar$bcommits <- tmp
      
        # Get parameters
        paramsLang[,i] <- getParams(Ystar)
    }
}

To determine whether the bootstrapped values are statistically signifficant, we analyze whether the x and 1-x quantiles both have the same sign. If they do, the result is signifficant in that regard, if the do not, then the results is not statistically signifficant. Similarly to p-values, we can compare using different quantiles. The proper, conservative quantile is 0.01 divided by number of languages tested, the less conservative option is just 0.01.

if (UNCERTAINTY) {
    paramsRowNames = getModelRowNames(nbfit, Y$language)
    result = data.frame(
        row.names = paramsRowNames,
        coef = rep(NA, length(paramsRowNames)),
        se = rep(NA, length(paramsRowNames)),
        sig = rep(NA, length(paramsRowNames)),
        sigCons = rep(NA, length(paramsRowNames))
    )
    
    
    par(mfrow=c(2,4))
    quant = 0.01
    quantCons = 0.01 / numLanguages
    for ( i in 1:length(paramsRowNames)) {
        result$coef[[i]] = round(mean(paramsLang[i,]), digits = 2)
        result$se[[i]] = round(sd(paramsLang[i,]), digits = 2)
        qsigns = sign(quantile(paramsLang[i,], probs = c(quant, 1-quant, quantCons, 1-quantCons), na.rm = T))
        result$sig[[i]] = qsigns[[1]] == qsigns[[2]]
        result$sigCons[[i]] = qsigns[[3]] == qsigns[[4]]
        hist(paramsLang[i,],
             xlab=paramsRowNames[i], 
             main=paste("mean =", round(mean(paramsLang[i,]), digits=2))
        )
        abline(v=0, col="red", lwd=2)
    }
}
if (UNCERTAINTY) {
    result
}
0 items

Finally, let's juxtapose this to the baseline as usual:

if (UNCERTAINTY) {
    juxtBootstrap = merge(result, baselineFSE_RQ1(), by = 0, all = T, sort = F)
    juxtBootstrap$ok = checkSignificance(juxtBootstrap, "FSE_pv", "sigCons")
    juxtBootstrap
}
0 items

3.7. More graphs: Bug rate over time

4.5s
Analysis (R)
R TOPLAS19
filter_proj <- function(df,proj,lang) {
   data %>% 
    ungroup() %>%
    filter(project == proj, language == lang) %>% 
    dplyr::select(commit_age, isbug) %>%
    arrange(commit_age) -> bt
  bt %>% 
    group_by(commit_age,isbug) %>% 
    summarize(n = n()) %>% 
    spread(key = "isbug",value = "n") -> bt2
  bt2 %>% 
    mutate(br = round(`0`/(`0`+`1`),2), 
           month = as.integer(commit_age/30)) -> bt3
  bt3  %>% 
    na.omit() %>% 
    group_by(month) %>% 
    summarize(n = n(), brs = sum(br), brm = brs/n) %>%
    mutate(name= paste0(proj," ",lang)) -> prj
  rbind(df, prj)
}

filter_proj(NULL, "linux","C") %>%
filter_proj("mono","C#") %>%
filter_proj("homebrew","Ruby") %>%
filter_proj("WordPress","Php") %>%
filter_proj("salt","Python") %>%
filter_proj("mythtv","C++") %>%
filter_proj("jenkins","Java") %>%
filter_proj("akka","Scala") %>%
filter_proj("rabbitmq-server","Erlang") %>%
filter_proj("brackets","Javascript") %>%
filter_proj("yi","Haskell") %>%
filter_proj("overtone","Clojure")  ->prj

prj %>% ggplot( aes(x = month, y = brm)) +
        geom_point(size=.4) +
        theme_light() +
        geom_smooth(method='lm',size=.5, formula = y ~ x) +
        facet_wrap( ~ name,ncol=3) + labs(x="",y="") +
        xlab("Project lifetime (months)") + ylab("Percent bug-labeled commits") +
        theme(strip.text.x = element_text(size = 14), text = element_text(size=13)) +
        theme(text = element_text(size=20))

        
ggsave(paste(WORKING_DIR, "/Figures/bugspermonth.pdf", sep = ""), width = 8, height = 8, dpi = 100)

3.8. Commits for top-5 projects by language

data %>% group_by(language, project) %>% summarize(n = n()) %>% arrange(desc(n)) %>% arrange(desc(language)) %>% top_n(5, wt = n) -> projsize_ordered2
projsize_ordered2

projsize_ordered2 %>% ggplot(aes(x = reorder(factor(project), -n), y = n)) + 
  geom_bar(stat="identity") + 
  facet_grid(. ~ language, scales = "free") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(x = "Project (by language)", y = "Number of Commits") +
  scale_y_continuous(labels = scales::comma)
0 items

3.9. Authors

everything %>% group_by(author, sha) %>% summarize(n=n()) %>% group_by(author) %>% summarise(commits = n()) -> auth
auth %>% arrange(desc(commits))
tot <- sum(auth$commits)
check( tot == 1485049)
top <- sort(auth$commits, dec=T)[1:453]
sum(top)/tot ## 46 %

everything %>% group_by(author, project) %>% summarize(n=n()) %>% 
  group_by(author) %>% summarise(proj = n())   %>% arrange(desc(proj)) -> authproj

summary(authproj)  ## mean 1.2
0 items

3.10. Writing the results

Finally, let's write the results of our analyses into CSV files so that they can be picked and analyzed later:

weighted = data.frame(
    coef = juxtWeighted$coef,
    se = juxtWeighted$se,
    pVal = round(juxtWeighted$pVal,3),
    pVal_fdr = juxtWeighted$pVal_fdr,
    pVal_bonf = juxtWeighted$pVal_bonf,
    row.names = juxtWeighted$Row.names
)
write.csv(weighted, paste0(WORKING_DIR, "/Data/languages_weighed.csv"))
zeroSum = data.frame(
    coef = juxtZeroSum$coef,
    se = juxtZeroSum$se,
    pVal = round(juxtZeroSum$pVal,3),
    pVal_fdr = juxtZeroSum$pVal_fdr,
    pVal_bonf = juxtZeroSum$pVal_bonf,
    row.names = juxtZeroSum$Row.names
)
write.csv(zeroSum, paste0(WORKING_DIR, "/Data/languages_zeroSum.csv"))
# only store bootstrap if we have actually created it
if (UNCERTAINTY) {
    bootstrap = data.frame(
        coef = juxtBootstrap$coef,
        se = juxtBootstrap$se,
        sig = juxtBootstrap$sig,
        sigCons = juxtBootstrap$sigCons,
        row.names = juxtBootstrap$Row.names
    )
    write.csv(bootstrap, paste0(WORKING_DIR, "/Data/languages_bootstrap.csv"))
}
remove(WORKING_DIR)

4. Analysis of the permutations of re-anaysis flags (permutations.Rmd)

Rscript --vanilla generate_permutations.R

When the permutaions have been geneated, we must analyze them. Clear the flags and set the working directory to permutations root:

source("implementation.R")

flags = c(
  REMOVE_DUPLICATES = 0, 
  REMOVE_TYPESCRIPT = 0,
  REMOVE_V8 = 0,
  USE_AUTHORS_INSTEAD_COMMITTERS = 0,
  UNCERTAINTY = 0
)

WORKING_DIR = "./artifact/permutations"

First, get the permutation results calculated previously:

  • est = value of the estimate

  • se = standard error

  • pv = NA if the result was not signifficant, p-value limit otherwise)

For bootstrap, we get est, se and sig and sigCons which are boolean vectors telling us whether the results were significant.

loadPermutation = function(b, name, permutation) {
    lWeighted = read.csv(paste0(permutation, "/Data/languages_weighed.csv"))
    rownames(lWeighted) = lWeighted$X
    lWeighted = lWeighted %>% dplyr::select(-X)
    lZeroSum = read.csv(paste0(permutation, "/Data/languages_zeroSum.csv"))
    rownames(lZeroSum) = lZeroSum$X
    lZeroSum = lZeroSum %>% dplyr::select(-X)
    bootstrapFilename = paste0(permutation, "/Data/languages_bootstrap.csv")
    if (file.exists(bootstrapFilename)) {
        lBootstrap = read.csv(bootstrapFilename)
        rownames(lBootstrap) = lBootstrap$X
        lBootstrap = lBootstrap %>% dplyr::select(-X)
    } else {
        lBootstrap = NULL
    }
    # and create the result list
    list(
        name = name,
        weighted = lWeighted,
        zeroSum = lZeroSum,
        bootstrap = lBootstrap
    )
}

Let's now look at all permutations and loat them:

loadAllPermutations = function(baseDir, b = baseline) {
    dirs = list.dirs(baseDir, recursive = F)
    result = list()
    for (d in dirs) {
        pName = strsplit(d, "/")
        pName = pName[[1]][[length(pName[[1]])]]
        if (nchar(pName) != length(flags))
            next()
        tryCatch({
            result[[pName]] = loadPermutation(b, pName, d)
        }, error = function(e) {
            print(e)
            cat("Permutation ", pName, "FAIL\n")
        })
    }
    cat("Total valid permutations: ", length(result))
    result
}
permutations = loadAllPermutations(WORKING_DIR)

Let's now create the skeleton for teh RQ1 table in the paper:

result = baselineFSE_RQ1()
result = mergeDataFrames(result, baselineCACM_RQ1())
result = result %>% dplyr::select(FSE_coef, FSE_pv, CACM_coef, CACM_pv)

repetition = permutations[["00000"]]$weighted %>% dplyr::select(repetition_coef = coef, repetition_pv = pVal)
result = mergeDataFrames(result, repetition)

cleaned = permutations[["11111"]]$weighted %>% dplyr::select(clean_coef = coef, clean_pv = pVal, adjusted_fdr = pVal_fdr, adjusted_bonf = pVal_bonf)
result = mergeDataFrames(result, cleaned)

zeroSum = permutations[["11111"]]$zeroSum %>% dplyr::select(zeroSum_coef = coef, zeroSum_pv = pVal_bonf)
result = mergeDataFrames(result, zeroSum)

bootstrap = permutations[["11111"]]$bootstrap %>% dplyr::select(bootstrap_coef = coef, bootstrap_sig = sigCons)
result = mergeDataFrames(result, bootstrap)

result
0 items

And output it to the rq1 table:

output_RQ1_table_repetition(result)
output_RQ1_table_reanalysis(result)
remove(WORKING_DIR)

5. Missing Commits (missing_commits.Rmd)

This separate notebook deals with our analysis of commits not accounted for in the artifact obtained from the original authors.

# load the file containing the actual implementation details
knitr::opts_chunk$set(echo = FALSE)
source("implementation.R")
initializeEnvironment("./artifact/missing-commits")

Load the data processed by the repetition and summarize them.

data = read.csv("./artifact/repetition/Data/newSha.csv")
data$combined = data$combinedOriginal # does not matter
data$devs = data$committer # does not matter
data = summarizeByLanguage(data)

First, we download projects. This gets us the projects' metadata, all commits, and all unique files changed in the commits. Out of 728 projects, we downloaded 618 (causes: network failures during download, projects going private, or projects being deleted) and analyzed 513 (node.js, which we used to download the projects, segfaulted on several of them). The commits reported for the study were then analyzed; and for each project, we remember the list of commits used in the study.

Total records: 1578165 Total projects: 729 Multi-commits: 46526 Unique commits: 1531639

Multi-commits are commits that have multiple languages.

The downloaded projects were matched. Since the study has project names without repository owners, matches could be ambiguous. We end up with 423 matched projects. One item, dogecoin, has the same name but two different projects. For each project, we looked at all commits and classified them as:

  • valid (i.e. present in the study and in the project)

  • irrelevant (i.e. present in the project, but not relevant to the study since they do not change any file in the studied languages)

  • missing (present in the project, but not in the study, while changing at least one file in studied language)

  • invalid (present in the study, not present in the project)

This data has been obtained by running the commits-verifier tool.

<!--%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%-->

<!--%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%-->

5.1. Results on missing commits

<!--%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%-->

<!--%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%-->

First some verification and data cleanup -- if we matched a project wrongly, then we would see all invalid commits.

mc = read.csv("./input_data/missing-commits.csv")
mc %>% filter(invalid > 0) %>% arrange(desc(invalid))
0 items

There are four projects with invalid counts larger than 1. In the case of "framework" and "hw", the numbers are large enough to be worth setting them aside. The case of "generator-mobile", where we have only invalid commits, suggests a badly matched project. We ignore it. Lastly, "DefenitelyTyped" has 8 mistmatched commits -- but since it is one of the TypeScript projects that contains no code, we can safely ignore it.

mc %>% filter(invalid <= 1) -> missing_commits

valid_sum <- sum(missing_commits$valid)
check(valid_sum == 426845)
missing_sum <- sum(missing_commits$missing)
check(missing_sum == 106411)
ratio_missing <- round(missing_sum/(missing_sum+valid_sum)*100,2)
check(ratio_missing == 19.95)
out("MissingCommitsThousands", round(missing_sum/1000,0))
out("MissingCommitsRatio", ratio_missing)

In total, we have seen 426k commits in the projects we have cross-checked. There were 106k missing commits (19.95%).

  • The number of commits per project is skewed towards very few valid commits

  • Invalid commits are in almost every project, and there are projects that are almost entirely missing

Projects with the highest ratio of missing commits:

missing_commits %>% mutate(ratio = round(missing/(missing+valid)*100,2)) %>% arrange(desc(ratio))
0 items
  • V8 is high on the list -- the 12th most incomplete project (around 70% of commits are missing).

ggsave() disabled due to a strange font error —MPD

0.7s
Analysis (R)
R TOPLAS19
data %>% group_by(language) %>% summarize(commits = sum(commits)) -> commits_by_lang
commits_by_lang[commits_by_lang$language == "C", 2] <- commits_by_lang[commits_by_lang$language == "C", 2] +
                                                        commits_by_lang[commits_by_lang$language == "C++", 2]
commits_by_lang %>% filter(language != "C++") -> commits_by_lang
commits_by_lang %>% mutate(missing = 0) -> c
c$language <- as.character(c$language)

c[c$language == "C", 3] <- sum(missing_commits$cpp)  #C++ and C together
c[1, 1] <- "C/C++"
c[c$language == "C#",3] <-  sum(missing_commits$cs)
c[c$language == "Objective-C",3] <-  sum(missing_commits$objc)
c[c$language == "Go",3] <-  sum(missing_commits$go)
c[c$language == "Coffeescript",3] <-  sum(missing_commits$coffee)
c[c$language == "Javascript",3] <-  sum(missing_commits$js)
c[c$language == "Ruby",3] <-  sum(missing_commits$ruby)
c[c$language == "Typescript",3] <-  sum(missing_commits$ts)
c[c$language == "Php",3] <-  sum(missing_commits$php)
c[c$language == "Python",3] <-  sum(missing_commits$python)
c[c$language == "Perl",3] <-  sum(missing_commits$perl)
c[c$language == "Clojure",3] <-  sum(missing_commits$clojure)
c[c$language == "Erlang",3] <-  sum(missing_commits$erlang)
c[c$language == "Haskell",3] <-  sum(missing_commits$haskell)
c[c$language == "Scala",3] <-  sum(missing_commits$scala)
c[c$language == "Java",3] <-  sum(missing_commits$java)

c %>% mutate(ratio = round(missing/(commits+missing)*100,0)) %>% arrange(desc(ratio)) %>% as.data.frame -> ratio_missing

ggplot(data = ratio_missing, aes(x = reorder(language, ratio), y = ratio)) + 
    geom_bar(stat="identity") +
    xlab("") + ylab("Percentage missing commits") +
    annotate("text", x = "Perl", y = 20, label = paste(ratio_missing[ratio_missing$language=="Perl",4], "%", sep = ""), color = "white") +
    theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
    coord_cartesian(ylim=c(0, 20)) -> p

#ggsave(paste(WORKING_DIR, "/Figures/ratio_missing.pdf", sep = ""), p, width=5, height=2, units="in", scale=1.5)
print(p)
out("PerlMissingRatio", ratio_missing[ratio_missing$language=="Perl",4])

Perl is the outlier here, then Erlang, Go, PHP, and JavaScript.

remove(WORKING_DIR)

6. Commits Survey (commit_survey.Rmd)

source("implementation.R")
initializeEnvironment("./artifact/commit_survey")

6.1. Get developers' labels into one combined dataframe

0.4s
Analysis (R)
R TOPLAS19
library(dplyr)
library("fst")
library("irr")

inputPath = "./input_data/commit_survey"
# read all files in a directory
answers = list.files(inputPath)
data = lapply(answers, function(x) { 
    cat(x, "\n")
    read.csv(paste(inputPath, x, sep = "/"), header = F, sep =",") 
    })
i = 1
while (i <= length(data)) {
    d = data[[i]]
    d$category = as.character(d$V2)
    d$category[is.na(d$category)] = 0
    d$category[d$category != 0] = 1
    data[[i]]$category = as.numeric(d$category)
    i = i + 1
}

# if this is set to T, pulls data from the extra developers we had available for the survey as well - but these finished later, or did not finish at all so we only used their subset. 
if (F) {
    inputPathExtra = "./input_data/commit_survey_extras"
    answersExtra = list.files(inputPathExtra)
    dataExtra = lapply(answersExtra, function(x) { read.csv(paste(inputPathExtra, x, sep = "/"), header = F, sep =",") })
    i = 1
    while (i <= length(dataExtra)) {
        d = dataExtra[[i]]
        d$category = as.character(d$V2)
        d$category[is.na(d$category)] = 0
        d$category[d$category != 0] = 1
        dataExtra[[i]]$category = as.numeric(d$category)
        i = i + 1
    }
    
    # join the single dataframe per file into a single dataframe with all of them
    
    
    data = rbind(data, dataExtra)

}
separateDevs = data
data = do.call("rbind", data)
# now aggregate according to the commit
#data$category = as.character(data$V2)
#data$category[is.na(data$category)] = 0
#data$category[data$category == "0"] = 0
#data$category[data$category != 0] = 1
data$category = as.numeric(data$category)
data$test = rep(1, nrow(data))
sanity_check = aggregate(test ~ V1, data, sum) # we expect the sanity check to be three for all commits
if(!all(sanity_check$test == 3)) cat("Error -- Not all commits have 3 votes\n")
dd = data
data = aggregate(category ~ V1, data, sum)
data$score = data$category / sanity_check$test # 3 # sanity_check$test
#data$score = sapply(data$category, function(x) mean(sapply(x, function (x) if ( x == 0) 0 else 1)))
data$label = ifelse(data$score>=0.5, 1, 0) # This will never == 0.5 because we will have 3 votes
significant <- data[sapply(data$category, length) > 1,]
discord <- significant[(significant$score != 0) & (significant$score != 1), ]

6.2. Compare developers' labels to study's labels

stripGHUrl = function(what) {
    cat(what)
    substr(what, nchar(what) - 39, nchar(what))
}
ourLabelsPath <- "./input_data/petrs_commits"
buggycommits <- read.csv(paste(ourLabelsPath, "buggy_commits.csv", sep="/"), header=F)
buggycommits$V1 <- as.character(buggycommits$V1)
#buggycommits$V1 = sapply(buggycommits$V1, stripGHUrl)
buggycommits$ours = buggycommits$V3
buggycommits$paper = 1
nonbuggycommits <- read.csv(paste(ourLabelsPath, "non_buggy_commits.csv", sep="/"), header=F)
nonbuggycommits$V1 <- as.character(nonbuggycommits$V1)
#nonbuggycommits$V1 = sapply(nonbuggycommits$V1, stripGHUrl)
nonbuggycommits$ours = as.numeric(!nonbuggycommits$V3)
nonbuggycommits$paper = 0
allcommits = rbind(buggycommits, nonbuggycommits)

data$V1 = as.character(data$V1)
allcommits$V2 = as.character(allcommits$V2)
# Calculate disagreement

getPaperLabels <- function() {
    labels = rep(0, nrow(data))
    for (r in 1:nrow(data)) {
        df2row = allcommits[allcommits$V1 == data$V1[[r]],]
        labels[r] = df2row$paper;
    }
    labels    
}

getOurLabels <- function() {
    labels = rep(0, nrow(data))
    for (r in 1:nrow(data)) {
        df2row = allcommits[allcommits$V1 == data$V1[[r]],]
        labels[r] = df2row$ours;
    }
    labels    
}

data$ours = getOurLabels()
data$paper = getPaperLabels()

data$agreePaper = as.numeric(data$label == data$paper)
data$agreeOurs = as.numeric(data$label == data$ours)

#data$disagreeOurs = getDisagreementLabels(data, allcommits)
#data$disagreePaper = getDisagreementLabelsPaper(data, allcommits)

write.csv(data, file=paste0(WORKING_DIR, "/Data/commit_survey_results.csv"))
falsePositives <- nrow(data[data$paper == 1 & data$label == 0,])
falseNegatives <- nrow(data[data$paper == 0 & data$label == 1,])

out("commitsFalsePositives",  paste0(round(falsePositives / nrow(buggycommits) * 100, 1),"\\%"))
out("commitsFalseNegatives", paste0(round(falseNegatives / nrow(nonbuggycommits) * 100, 1), "\\%"))

#cat(paste("Percentage of false positives (Paper vs Devs): ", falsePositives / nrow(buggycommits), "\n"))
#cat(paste("Percentage of false negatives: (Paper vs Devs)", falseNegatives / nrow(nonbuggycommits), "\n\n"))
unanimous_falsePositives <- nrow(data[data$paper == 1 & data$score == 0,])
out("commitsUnanimousFalsePositives", paste0(round(unanimous_falsePositives / falsePositives * 100, 1), "\\%"))
#cat("Percentage of false positives upon which developers unanimously agreed: ", unanimous_falsePositives / falsePositives)
PfalsePositives <- nrow(data[data$paper == 1 & data$ours == 0,])
PfalseNegatives <- nrow(data[data$paper == 0 & data$ours == 1,])

cat(paste("Percentage of false positives (Paper vs Us): ", PfalsePositives / nrow(buggycommits), "\n"))
cat(paste("Percentage of false negatives (Paper vs Us): ", PfalseNegatives / nrow(nonbuggycommits), "\n\n"))
PDfalsePositives <- nrow(data[data$label == 1 & data$ours == 0,])
PDfalseNegatives <- nrow(data[data$label == 0 & data$ours == 1,])

cat(paste("Percentage of false positives (Devs vs Us): ", PDfalsePositives / nrow(data[data$label == 1,]), "\n"))
cat(paste("Percentage of false negatives (Devs vs Us): ", PDfalseNegatives / nrow(data[data$label == 0,]), "\n"))

Let's look at how certain people are:

cat(paste("3 votes not a bug:                 ", length(data$score[data$score == 0]), "\n"))
cat(paste("Likely not a bug (1 vote for bug): ", length(data$score[data$score == 1/3]), "\n"))
cat(paste("Likely a bug (2 votes for bug):    ", length(data$score[data$score == 2/3]), "\n"))
cat(paste("3 votes for a bug:                 ", length(data$score[data$score == 1]), "\n"))

This is cool, but let's do this for buggy and non-buggy commits as determined by the paper:

buggy = data[data$paper == 1,]
cat("Paper labelled as buggy\n")
cat(paste("3 votes not a bug:                 ", length(buggy$score[buggy$score == 0]), "\n"))
cat(paste("Likely not a bug (1 vote for bug): ", length(buggy$score[buggy$score == 1/3]), "\n"))
cat(paste("Likely a bug (2 votes for bug):    ", length(buggy$score[buggy$score == 2/3]), "\n"))
cat(paste("3 votes for a bug:                 ", length(buggy$score[buggy$score == 1]), "\n"))
cat("Paper labelled as non-buggy\n")
nonbuggy = data[data$paper == 0,]
cat(paste("3 votes not a bug:                 ", length(nonbuggy$score[nonbuggy$score == 0]), "\n"))
cat(paste("Likely not a bug (1 vote for bug): ", length(nonbuggy$score[nonbuggy$score == 1/3]), "\n"))
cat(paste("Likely a bug (2 votes for bug):    ", length(nonbuggy$score[nonbuggy$score == 2/3]), "\n"))
cat(paste("3 votes for a bug:                 ", length(nonbuggy$score[nonbuggy$score == 1]), "\n"))

And now with me

buggyp = data[data$ours == 1,]
cat("Peta labelled as buggy\n")
cat(paste("3 votes not a bug:                 ", length(buggyp$score[buggyp$score == 0]), "\n"))
cat(paste("Likely not a bug (1 vote for bug): ", length(buggyp$score[buggyp$score == 1/3]), "\n"))
cat(paste("Likely a bug (2 votes for bug):    ", length(buggyp$score[buggyp$score == 2/3]), "\n"))
cat(paste("3 votes for a bug:                 ", length(buggyp$score[buggyp$score == 1]), "\n"))
cat("Peta labelled as non-buggy\n")
nonbuggyp = data[data$ours == 0,]
cat(paste("3 votes not a bug:                 ", length(nonbuggyp$score[nonbuggyp$score == 0]), "\n"))
cat(paste("Likely not a bug (1 vote for bug): ", length(nonbuggyp$score[nonbuggyp$score == 1/3]), "\n"))
cat(paste("Likely a bug (2 votes for bug):    ", length(nonbuggyp$score[nonbuggyp$score == 2/3]), "\n"))
cat(paste("3 votes for a bug:                 ", length(nonbuggyp$score[nonbuggyp$score == 1]), "\n"))

Now look at the developers and calculate their ratio of bugs found:

for (d in separateDevs) {
    cat(sum(as.numeric(d$category)) / length(d$category), "\n")
}

Ok, what we can do, is to compute for each developer how often he is in opposition to the rest:

devOps <- function(index) {
    score = data$score
    names(score) = data$V1
    d = separateDevs[[index]]
    dc = d$category
    names(dc) = d$V1
    bugX = 0
    nonBugX = 0
    resBug = 0
    resNonBug = 0
    for (i in d$V1) {
        s = score[[i]]
        if (dc[[i]] == 0) {
            if (s == 2/3)
                nonBugX = nonBugX + 1
            else if (s == 1/3)
                resNonBug = resNonBug + 1
        } else {
            if (s == 1/3)
                bugX = bugX + 1
            else if (s == 2/3)
                resBug = resBug + 1
        }
    }
    c(bugX, nonBugX, resBug, resNonBug) / length(d$V1)
}
for (i in 1:10)
    print(devOps(i))

6.3. Calculating Cohen's Kappa

# the number of people in the survey
numDevs = length(separateDevs)
# now calculate a dataframe where we have aggregated commits and for each commit we have the category assigned by all of the developers (NA if the developer did not see the commit)
ratings = NULL
tmp = list()
for (n in (1:numDevs)) {
    x = cbind(as.character(separateDevs[[n]]$V1), separateDevs[[n]]$category)
    colnames(x) = c("url", paste0("x",n))
    if (n == 1) {
        ratings = x
    } else {
        ratings = merge(ratings, x, by = c("url"), all = T)
    }
}
# create matrix of only the ratings w/o the commit urls
#ratings_matrix = as.matrix(subset(ratings, select = - c(url)))

Now that we have the matrix, we should calculate the Cohen's Kappa to determine the interrater agreement between the raters. Because Cohen's Kappe is only defined for 2 raters, we use the Light's Kappa, which is an average over all pairwise Cohen Kappas. The fact that different commits were reviewed by different reviewers further complicates this. We therefore create a submatrix for each 2 pairs of developers, remove any NA rows, calculate the kappas and then report the results and distribution:

kvalue = c()
kpvalue = c()
ksize = c()
kfirst = c()
ksecond = c()
n = 1
for (first in (1:(numDevs - 1))) {
    for (second in ((first + 1):numDevs)) {
        tmp = cbind(ratings[colnames(ratings)[[first + 1]]], ratings[colnames(ratings)[[second + 1]]])
        tmp = na.omit(tmp)
        if (nrow(tmp) != 0) {
            k = kappa2(tmp)
            kvalue[[n]] = k$value
            kpvalue[[n]] = k$p.value
            ksize[[n]] = nrow(tmp)
            kfirst[[n]] = first
            ksecond[[n]] = second
            n = n + 1
        } 
    }
}
summary(kvalue)
summary(kpvalue)
summary(ksize)
out("commitsKappaMedian", round(median(kvalue), 3))
out("commitsKappaMin", round(min(kvalue), 3))
out("commitsKappaMax", round(max(kvalue), 3))
out("commitsKappaPValMedian", round(median(kpvalue), 3))
out("commitsKappaPValThird", round(quantile(kpvalue, 0.75), 3))

Ok, all of our kappas are positive. Most of the p-values are also very small.

boxplot(kvalue)
boxplot(kpvalue)
sort(kpvalue)

7. Threats To Validity (threats_to_validity.Rmd)

# load the file containing the actual implementation details
knitr::opts_chunk$set(echo = FALSE)
source("implementation.R")
initializeEnvironment("./artifact/threats-to-validity")

7.1. Tests in the files

data = read.csv("./artifact/repetition/Data/newSha.csv")
files = loadEverything()

Let's see how many of the files are tests. First clear the files dataset to only contain commits that are in the data we analyze:

files = files %>% filter(sha %in% unique(data$sha))
files %>% filter(str_detect(file_name,"Test|test")) -> tests
out("ratioTestsFiles",round(nrow(tests)/nrow(files)*100,1))
files %>% filter(!str_detect(file_name,"Test|test")) -> notests
out("testFilesCommitted", nrow(tests))
out("ratioTestFilesCommittedOverAll", round(nrow(tests)/nrow(files)*100,1))