Characterization of source code defects by data mining conducted on GitHub

Péter Gyimesi, Gábor Gyimesi, Zoltán Tóth, R. Ferenc

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

In software systems the coding errors are unavoidable due to the frequent source changes, the tight deadlines and the inaccurate specifications. Therefore, it is important to have tools that help us in finding these errors. One way of supporting bug prediction is to analyze the characteristics of the previous errors and identify the unknown ones based on these characteristics. This paper aims to characterize the known coding errors. Nowadays, the popularity of the source code hosting services like GitHub are increasing rapidly. They provide a variety of services, among which the most important ones are the version and bug tracking systems. Version control systems store all versions of the source code, and bug tracking systems provide a unified interface for reporting errors. Bug reports can be used to identify the wrong and the previously fixed source code parts, thus the bugs can be characterized by static source code metrics or by other quantitatively measured properties using the gathered data. We chose GitHub for the base of data collection and we selected 13 Java projects for analysis. As a result, a database was constructed, which characterizes the bugs of the examined projects, thus can be used, inter alia, to improve the automatic detection of software defects.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
PublisherSpringer Verlag
Pages47-62
Number of pages16
Volume9159
ISBN (Print)9783319214122
DOIs
Publication statusPublished - 2015
Event15th International Conference on Computational Science and Its Applications, ICCSA 2015 - Banff, Canada
Duration: Jun 22 2015Jun 25 2015

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9159
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other15th International Conference on Computational Science and Its Applications, ICCSA 2015
CountryCanada
CityBanff
Period6/22/156/25/15

Fingerprint

Coding errors
Data mining
Data Mining
Defects
Tracking System
Coding
Specifications
Control systems
Deadline
Inaccurate
Software System
Java
Choose
Control System
Specification
Metric
Unknown
Software
Prediction

Keywords

  • Bug database
  • Data mining
  • GitHub

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Gyimesi, P., Gyimesi, G., Tóth, Z., & Ferenc, R. (2015). Characterization of source code defects by data mining conducted on GitHub. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9159, pp. 47-62). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9159). Springer Verlag. https://doi.org/10.1007/978-3-319-21413-9_4

Characterization of source code defects by data mining conducted on GitHub. / Gyimesi, Péter; Gyimesi, Gábor; Tóth, Zoltán; Ferenc, R.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 9159 Springer Verlag, 2015. p. 47-62 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9159).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Gyimesi, P, Gyimesi, G, Tóth, Z & Ferenc, R 2015, Characterization of source code defects by data mining conducted on GitHub. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 9159, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9159, Springer Verlag, pp. 47-62, 15th International Conference on Computational Science and Its Applications, ICCSA 2015, Banff, Canada, 6/22/15. https://doi.org/10.1007/978-3-319-21413-9_4
Gyimesi P, Gyimesi G, Tóth Z, Ferenc R. Characterization of source code defects by data mining conducted on GitHub. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 9159. Springer Verlag. 2015. p. 47-62. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-21413-9_4
Gyimesi, Péter ; Gyimesi, Gábor ; Tóth, Zoltán ; Ferenc, R. / Characterization of source code defects by data mining conducted on GitHub. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 9159 Springer Verlag, 2015. pp. 47-62 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{9f5b8064298240ff927456c7e1365a80,
title = "Characterization of source code defects by data mining conducted on GitHub",
abstract = "In software systems the coding errors are unavoidable due to the frequent source changes, the tight deadlines and the inaccurate specifications. Therefore, it is important to have tools that help us in finding these errors. One way of supporting bug prediction is to analyze the characteristics of the previous errors and identify the unknown ones based on these characteristics. This paper aims to characterize the known coding errors. Nowadays, the popularity of the source code hosting services like GitHub are increasing rapidly. They provide a variety of services, among which the most important ones are the version and bug tracking systems. Version control systems store all versions of the source code, and bug tracking systems provide a unified interface for reporting errors. Bug reports can be used to identify the wrong and the previously fixed source code parts, thus the bugs can be characterized by static source code metrics or by other quantitatively measured properties using the gathered data. We chose GitHub for the base of data collection and we selected 13 Java projects for analysis. As a result, a database was constructed, which characterizes the bugs of the examined projects, thus can be used, inter alia, to improve the automatic detection of software defects.",
keywords = "Bug database, Data mining, GitHub",
author = "P{\'e}ter Gyimesi and G{\'a}bor Gyimesi and Zolt{\'a}n T{\'o}th and R. Ferenc",
year = "2015",
doi = "10.1007/978-3-319-21413-9_4",
language = "English",
isbn = "9783319214122",
volume = "9159",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "47--62",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Characterization of source code defects by data mining conducted on GitHub

AU - Gyimesi, Péter

AU - Gyimesi, Gábor

AU - Tóth, Zoltán

AU - Ferenc, R.

PY - 2015

Y1 - 2015

N2 - In software systems the coding errors are unavoidable due to the frequent source changes, the tight deadlines and the inaccurate specifications. Therefore, it is important to have tools that help us in finding these errors. One way of supporting bug prediction is to analyze the characteristics of the previous errors and identify the unknown ones based on these characteristics. This paper aims to characterize the known coding errors. Nowadays, the popularity of the source code hosting services like GitHub are increasing rapidly. They provide a variety of services, among which the most important ones are the version and bug tracking systems. Version control systems store all versions of the source code, and bug tracking systems provide a unified interface for reporting errors. Bug reports can be used to identify the wrong and the previously fixed source code parts, thus the bugs can be characterized by static source code metrics or by other quantitatively measured properties using the gathered data. We chose GitHub for the base of data collection and we selected 13 Java projects for analysis. As a result, a database was constructed, which characterizes the bugs of the examined projects, thus can be used, inter alia, to improve the automatic detection of software defects.

AB - In software systems the coding errors are unavoidable due to the frequent source changes, the tight deadlines and the inaccurate specifications. Therefore, it is important to have tools that help us in finding these errors. One way of supporting bug prediction is to analyze the characteristics of the previous errors and identify the unknown ones based on these characteristics. This paper aims to characterize the known coding errors. Nowadays, the popularity of the source code hosting services like GitHub are increasing rapidly. They provide a variety of services, among which the most important ones are the version and bug tracking systems. Version control systems store all versions of the source code, and bug tracking systems provide a unified interface for reporting errors. Bug reports can be used to identify the wrong and the previously fixed source code parts, thus the bugs can be characterized by static source code metrics or by other quantitatively measured properties using the gathered data. We chose GitHub for the base of data collection and we selected 13 Java projects for analysis. As a result, a database was constructed, which characterizes the bugs of the examined projects, thus can be used, inter alia, to improve the automatic detection of software defects.

KW - Bug database

KW - Data mining

KW - GitHub

UR - http://www.scopus.com/inward/record.url?scp=84949035043&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84949035043&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-21413-9_4

DO - 10.1007/978-3-319-21413-9_4

M3 - Conference contribution

AN - SCOPUS:84949035043

SN - 9783319214122

VL - 9159

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 47

EP - 62

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

PB - Springer Verlag

ER -