Modeling Software Changes

Todd L. Graves
Alan Karr
National Institute of Statistical Sciences, USA
Audris Mockus
Bell Laboratories, USA


Industrial size software systems are expensive to maintain and improve. To improve the quality of the software and to decrease the maintenance costs it is essential to understand the maintenance process. Since the data associated with software development is highly structured, sequential, and discrete it is difficult to find essential relationships and do prediction. The paper proposes a collaborative tool where several experts in different domains can explore and interpret the data from a large software development process.

Keywords: Software development, version control, developer profiles.

1. Introduction

Controlling maintenance costs of large software systems is a key industrial problem. Software development activity is documented by source code, version control databases, bug reports, requirements and design documents. Such data contains few numbers and a lot of hierarchies, categories, and textual records. While these data are fundamental to software research, methods for their analysis are largely absent from the statisticial literature.

In this paper we propose two modeling tools, an interactive table and a profile view combined with textual annotation, that allow detection of trends and outliers in the software change data. The tools are based on the Web platform and have simple interfaces to be used by software engineers with no statistical expertise and to streamline access to the tools and the data.

We analyze a version control database of a large real time software system to answer basic methodological questions, such as how to characterize changes to the source code, how to describe developers in terms of those changes, and what models are appropriate for software change.

Particular directions of the investigation included determining time trends of the changes and obtaining work profiles of individual developers. The time when changes happen (time of day, closeness to deadlines) can be used better to allocate computing resources and to recognize abnormal or extreme change patterns. The assessment of developer profiles is important since it is established in the empirical software literature (see CURTIS (1981)) that individual developers can differ by an order of magnitude in certain bug-fixing tasks. Hence, it is of interest to know what characteristics make some developers much more efficient than others.

2. Version Control Database

We use SCCS (see ROCHKIND, (1975) and ECMS (Extended Change Management System, similar to SABLIME see version control and maintenance records from a multi-million line real-time software system that was developed over more than a decade. The code is dynamic and constantly changing: modifications are submitted daily by the thousands of engineers involved in the project. Our data contain the complete change history, including every modification made during the project, as well as many related statistics. The source code for this system is partitioned into subsystems, each subsystem is partitioned into modules and modules are partitioned into files. A change (delta) consists of a set of lines added and deleted from one file by one developer. Each delta lists the date and time when the change was submitted to the version control system as well as numbers of lines added, deleted, and unmodified by that change. Sets of related changes are grouped into Maintenance Requests (MRs). We selected a representative subsystem for our analysis. The subsystem contains approximately 2M lines, 3000 files, and 100 modules. It has been changed 132405 times over the course of the last decade. There were 27150 MRs in the dataset each having on average 4 deltas. Furthermore, the MRs were classified according to the type of maintenance activity into three classes: fault fixes, new feature development, code improvement.

3. Analysis and Tools

The goal of the analysis was to describe and interpret time trends of the various quantities and to investigate developer profiles. To accomplish the tasks the data was represented as interactive tables and histograms (see EICK ET AL (1997)) accessible through a standard Web browser. This architecture was used to share the results with other researchers working on a related project as well as with the developers and their managers responsible for the maintenance of the analyzed software system. It also allowed developers and experts in software and statistics to perform analyses individually, then provide feedback.


. The time trends

of the change data are presented in a number of hierarchies and time scales. Hierarchies include subsystem-module-file, developer-manager-organization, and line-delta-MR. Time scales include hour of the day, day of the week, month of the year, and yearly data. Figure 1 shows hourly time trends. Since all the developers worked in a single time zone there is a clear hourly trend of activity with one peak after lunch and one before. There is an obvious decrease in activity at night and during lunch time. The column with averages of added lines shows that the largest changes are being made just before midnight.

Figure 1: This table shows numbers of changes for different hours of the day. The first two columns show hours since midnight and numbers of changes, and the third through fifth columns contain average (over all changes for that hour) numbers of added, deleted, and unchanged lines.

The interactive tables show numeric and textual data with the variable names across the top and values for each observation in row-ordered cells. Three representations of data values are possible, depending on available screen space: as textual numeric digits, as thin bars with lengths proportional to the values, and as a combination of these two, with the digits overplotted on the bars. The rows of the table can be sorted to show correlations among the variables. The scrollbar on the left side of the view controls the available screen space and scrolls the table. When are developers most productive?

There were 509 developers over a 12 year period. The data for each developer represents numbers of changes submitted per hour. To investigate developer profiles we used ``ProfileView''; an interactive visualization of multiple developers (see Figure 2). The figure shows each developer as a small icon. The icon can be described as a clock with 24 hours. Zero hour is at the top and other hours continue at 15 degree intervals clockwise. For each hour we plot a point whose distance from the center is proportional to the value of the variable for that hour. The points are then connected by lines, forming a starlike shape.

Figure 2: A display of the 120 developers most similar (in their change profile) to the developer at the bottom left. The icons represent the trace that a 24 hour clock would draw with the end of its hour hand if the length of that pointer represents numbers of changes made during the particular hour.

Figure 2 shows a number of important features. First, there is a great difference in the total numbers of changes each developer made over the considered period. This is reflected by different sizes of the star icons representing each developer. Another striking feature is that different developers have distinct working patterns; some submit changes only during normal business hours, others have more flexible schedules. Looking more closely at a number of productive developers it becomes obvious that in addition to the normal activity there is a peak of activity just before midnight (it appears as a tail at the top of the star, or as a panhandle). Those changes probably reflect the changes made under deadline pressure. Further investigation revealed that after hours changes make up a much higher proportion of changes on weekends than on weekdays. This supports the hypothesis that late night changes are made to meet closing deadlines.

4. Discussion

The investigation presents tools and methods to represent and analyze software change data by several experts. Such an approach is essential in problems related to the maintenance of large software systems when input of several parties is necessary for proper interpretation of the results. As a result of the investigation we were able to answer a set of specific questions about time trends and about developer change profiles in the industrial-size dataset.

The combination of a written document with interactive analysis tools available through ubiquitous Internet browser interfaces proved essential in obtaining significant results. Part of the success came from the ability to get immediate and critical feedback about the analysis directions from other group members. Although the interactive table is a simplistic modeling tool, it proved to be very effective in describing and analyzing large two dimensional relations (time vs something). Its simplicity is a significant advantage since the table can be effectively used without special training. The complex structure of the software data required analysis of three dimensional relationships (developer vs time vs something else, or module vs time vs something else, etc) done with the ProfileView.

The daily work patterns suggest that although most of the changes are made during normal business hours, the largest changes are made late at night, probably working under time pressure and potentially compromising the quality of the code. The profiles of individual developers indicate varying work patterns, some workers adhering to the business day cycles and some having a more flexible schedule. Some developers are equally likely to submit changes during any hour of day and night. The knowledge of the distribution of developer activity over the day helps to plan for adequate computing resources. Study of the profiles revealed that a very large number of the most productive developers exhibit an unexpected sharp increase in activity shortly before midnight. This indicates a substantial number of changes might be submitted under time pressure.


Curtis, B. (1981).
Substantiating programmer variability. Proceedings of the IEEE 69(7):846.
Eick, S.G., Mockus, A., Graves, T.L., and Karr, A.F. (1997).
WEB-Based Text Visualization. Proceedings of SoftStat '97. 3-10.
Rao, R. and Card, S. K. (1994).
Table lens: Merging graphical and symbolic representations in an interactive focus plus context visualization for tabular information, In Proceedings of ACM Conference on Human Factors in Computing Systems (CHI'94), 318-322, Boston, MA.
Rochkind, M.J., (1975).
The Source Code Control System, In IEEE Trans. on Software Engineering, SE-1:4 364-370