Web-based Analysis of Large-Scale Software Systems

Thomas Ball, Stephen G. Eick, and Audris Mockus

February 26, 1997

1 Introduction

The software development process is complex and multifaceted, especially in large-scale projects, leaving a trace of many different documents. A plethora of hand-written and automatically generated documents define the software process, requirements for a system, its architecture, status and details of its implementation, testing, etc. Often these documents are kept in disparate databases and data formats. In large-scale projects, the source code of a system and changes to the code are recorded automatically in a version control system, a form of documentation that is quite voluminous and rich in detail. While all the documents are related in well-defined ways, it is hard to explore these relationships because of the different interfaces to each of the data sources.

To help describe and understand different aspects of software evolution simultaneously, we have developed a number of tools for examining changes to documents and visualizing system artifacts such as source code and source version history. To integrate the tools and to provide unified simple access, we implemented them using standard web infrastructure, such as common gateway interface (CGI) scripts to process and retrieve the data, HTML and JavaScript [Goodman, 1996] to provide hypertext and form-based interactions, and Java [Flanagan, 1996] applets to provide interactive graphical views. All of the applications run on a standard Netscape browser and implement web user interface and style conventions. We will briefly describe three of these tools:

Through our examination of these four tools we will explore three issues:

2 Internet Difference Engine

Large software projects keep substantial amount of documentation about the project online. Many projects have invested substantial effort in converting their project documentation to HTML. Having a standard access method (the web browser) to all project documentation is great asset, especially for documentation that changes frequently (e.g., project status reports, bug reports, meeting notes, performance results, etc.) and/or is produced and consumed by participants distributed geographically across several locations

However, it is often the case that such documentation is not placed under version control, either to reduce overhead or because it is assumed that only the latest version is of interest. The Internet Difference Engine (IDE) is a tool that can help manage such distributed and changing documentation by tracking changes to a project's web pages.

IDE consists of three components: an HTML comparison tool HtmlDiff, a URL tracking tool w3newer that determines when the content of page has changed, and a database of HTML pages and their version history. The core of the engine is HtmlDiff which computes the differences between two HTML pages, and presents their commonalities and differences in a merged HTML page with graphical icons to highlight the differences. The tracking tool w3newer periodically checks the selected URLs against versions stored in its version control database.

Here is a small example of HtmlDiff run on two old versions of the Software Production Research Department's home page. Below are two abbreviated versions of the page, which are the input to HtmlDiff. Under them is the output of HtmlDiff. Green arrows point at new material (in strong, italic font) while red arrows point to deleted material (struck out). In this example, someone has ``personalized'' our home page, and has replaced the "Visualization Home Page" by a number of department projects.

As of 6/14/1995

Software Production Research Department

As of 7/19/1995

Software Production Research Department

Output of HtmlDiff on above two pages

HtmlDiff: Here is the first difference. There are 3 differences on this page.is old. is new.

Software Production Research Department

It is worth noting that HTML serves at least two different purposes: as a documentation language (hypertext formatting language) and as a scripting language to create applications out of multiple Java applets, or as a framework for embedded applications, such as VRML or our Live Document application. HtmlDiff can also be used to show differences between versions of HTML scripts.

While IDE tracks the changes to a set of web documents, much of the software project data is already tracked by legacy version control systems. Version control may be used when it is necessary to recreate an older version of the system (to back out a change), or to coordinate the efforts of multiple programmers working on a system simultaneously. Typical examples of documents kept under version control are source code and design documents. The two applications describe analyses of version control data.

3 Analyzing Version Control Data

Analyzing software change is difficult because the software and the creation and maintenance processes are both complicated and interdependent. We are interested in the basic methodological questions concerning quantitative analysis of software maintenance and development such as:

Version control systems are an excellent source of data for analyses of software projects because they provide automated and consistent reporting over the lifetime of virtually any software project. Version control systems capture data such as lines deleted/inserted to make a change, the time the change was made, who made the change, an abstract describing the reason for the change, groups of related changes, etc.

We describe two complementary tools for analyzing such data:

4 SeeSoft

SeeSoft incorporates several reduced representations for text:

Figure 1: The SeeSoft text view showing code age according to a rainbow color scale. Proprietary information has been blurred in the figures.

SeeSoft displays such as Figure 1 are interactive, and employ techniques from dynamic statistical graphics [Becker et al., 1987]. For example, a user may turn lines on or off by brushing the mouse over the color scale (see Figure 2) to reduce the visual complexity of the display and focus attention on an area of interest. As the mouse touches any line of the code, the line itself and statistic values corresponding to it appear on screen. (This form of dynamic identification is similar to that used in the S language [Becker et al., 1988] for identifying points in a scatterplot.) The statistic used to color lines is user-selected (in Figure 1 it is the age of the code), as is the color palette (Rainbow in Figure 1). Other important options include Browser, which opens a window (see Figure 2) that shows in a readable font the text beneath its controller (shown in the middle of the third file).

The web interface to SeeSoft, like many Web interfaces to databases, consists of four principal components: a large version control database maintained on a server, a CGI-bin access program written in Perl [Schwartz, 1993], a JavaScript interface running in a Web browser, and the SeeSoft applet.

The visual display, illustrated in Figure 2, includes three Netscape HTML frames [Musaciano and Kennedy, 1996]. The left and bottom frames allow the user to select a subsystem and module for display (the code and statistics for a particular subsystem and module are retrieved by executing a Perl script running as a CGI-bin command [Gundavaram, 1996]), to select from two different versions of the code and to control whether the SeeSoft applet is displayed in the frame or in its own window.

Figure 2:   SeeSoft running within Netscape Navigator as an applet with a browser window showing the source code text in one of the files. The color of each line is tied to its version with the middle versions deactivated.

5 Live Documents

To characterize how software changes over time, we designed a framework of applets called Live Documents. Live Documents replace static figures and tables in documents with interactive applets, allowing the writer or reader to customize the document to get a different view of the data. This architecture simplifies sharing analysis results with others working on a related project. The abundance and complex structure of the version control data requires expertise in different domains to do the analysis. Live Documents provides an environment where domain experts, software experts, and organizational experts can perform their individual analyses and share them with others in a highly interactive fashion.

Live Document framework can be described in terms of five layers:

  1. Author, creator of the document.
  2. HTML - a text formating and scripting language that glues together a textual description with interactive tools.
  3. A set of interactive applets providing different views of the version control data as well as control mechanisms.
  4. A set of Java classes to provide a common look and feel as well as a mechanism for applets to communicate with one another and share data.
  5. A set of CGI scripts interfacing to the version control system on a server.

The author of a live document composes a standard HTML and, in addition to tables and images, can add graphical views designed to illustrate various points of interest. Views have controls (which the author may selectively add) to allow the reader to explore hypotheses that the author did not choose to explore or did not have enough expertise to analyze. The interactive applets allow the reader to analyze the data presented to confirm (or reject) the document's claims, as well as to pursue their own hypothesis.

Live Documents can employ applets as sophisticated as SeeSoft. However, the applets reflecting the "true spirit" of live documents tend to be simpler. Such applets are not stand alone applications but rather, components of the web document. Several applets can be combined to provide different views of the same data (data sharing). The views may be synchronized so that they all respond to a user action (e.g., highlighting subsets of the data and menu selections). This makes the set of web pages appear as a single application. Live document applets tend to contain a minimal set of built-in controls so as not to overwhelm the reader. Required controls can be added by the author in the appropriate places within the document.

5.0.1 Live Documents in Action

To investigate developer profiles we used interactive visualization of multiple developers (see Figure 3). The version control data in this example contains information on changes to the code made by 509 developers over a 12 year period. Each case in the analyses corresponds to one developer. The data for each case represent the number of changes submitted per hour as well as averages of added, deleted, and unchanged lines over changes submitted during that hour.

Figure 3 shows a snapshot from a browser window including textual description, a table applet, a profile view applet, and three control applets (appearing as choice widgets). The table applet shows developer names (left field) and the amount of changes they made during each hour of the day (the second field from left shows the midnight to 1am interval, followed by the 1am to 2am interval, and so on). The length of the bars represents the number of changes. The scrollbar on the left allows to zoom in to see names of individuals and numbers of changes in textual form and to zoom out to get an overview of all developers.

The profile view shows each developer as a small icon. The icon can be described as a clock with 24 hours. Hour zero is at the top and other hours continue at 30 minute (360/24 degrees) intervals clockwise. The value for the particular hour is represented as an offset on a ray starting at the center of the clock. All values are connected by a line forming a star-like shape.

The yellow (highlighted portion) of the icons correspond to the yellow records in the table. The user can select (by dragging the mouse) any subset of the icons or records and the selection is automatically reflected in both views.

Figure 3: A table listing developers (developers are sorted by number of changes submitted at 14 o'clock). The bottom view shows 120 developers most similar (in their change profile) to the developer at the bottom left. The icons represent a trace the 24 hour clock would draw with the end of its hour pointer if the length of that pointer represents numbers of changes made during the particular hour. Notice the peak close to midnight (appearing as a pan handle) common to many developers.

6 Summary

We have described several web-based visualization tools for examining changes to on-line documents, source code, and version control data. In the past, the access means to documentation sources associated with large-scale systems have been diverse as the sources themselves, and usually required direct access to the hosts where the data resides. We implemented our tools using standard web infrastructure to both unify data access and user interface, relieving the user of the necessity to login to different hosts and run different software packages.

The main issue in porting the visualization tools to web platform was to accommodate a client-server model with relatively low bandwidth connection, as compared to local disk access. To make the tools more "web compliant" their architecture needed the following enhancements:

We found that after implementing the above mentioned adjustments we gained a number of advantages over non-web applications (in addition to integration). The web access (typing a single URL or pointing and clicking, with no software package to install) removed the problem of gaining access to the right system and/or installing and administrating the relevant package on the client. A standard user interface (Netscape's web browser) minimizes the effort in learning to use the visualization tools. These two factors are essential in increasing developer and manager productivity by allowing more wide spread usage of software visualization tools.

From a tool developer's perspective there are also significant advantages. The main advantage is portability across hardware and operating system platforms. This drastically reduces development and maintenance effort and costs. Java is a high-level language with graphics (AWT), network, and multitasking capabilities. The porting of visualization tools from C++ to Java in terms of the amount of source code (in number of lines) and effort ( this is very subjective) was reduced roughly in half. Applications like Live Document are especially easy to create by scripting with HTML or JavaScript and using the base component set of applets.

Finally, it is worth noting that many of the described applications can be used outside the domain of analyzing software projects. In particular, the Internet Difference Engine can track an arbitrary set of web pages, SeeSoft can be used to visualize regular text, not necessarily software code, and Live Documents can be used to analyze an arbitrary complex database besides version control data.


Ball and Douglis, 1996
Ball, T. and Douglis, F. (1996). An internet difference engine and its applications. In Proceedings of the COMPCON 1996 Conference.

Ball and Eick, 1996
Ball, T. A. and Eick, S. G. (1996). Software visualization in the large. IEEE Computer, 29(4):33-43.

Becker et al., 1988
Becker, R. A., Chambers, J. M., and Wilks, A. R. (1988). The New S Language. Wadsworth & Brooks/Cole, Pacific Grove, CA.

Becker et al., 1987
Becker, R. A., Cleveland, W. S., and Wilks, A. R. (1987). Dynamic graphics for data analysis. Statistical Science, 2:355-395.

Eick et al., 1997
Eick, S., Mockus, A., Graves, T., and Karr, A. (1997). Web based text visualization. In SoftStat'97 Proceedings.

Eick, 1994
Eick, S. G. (1994). Graphically displaying text. Journal of Computational and Graphical Statistics, 3(2):127-142.

Flanagan, 1996
Flanagan, D. (1996). Java in a Nutshell. O'Reilly & Associates, Sebastopol, CA.

Goodman, 1996
Goodman, D. (1996). JavaScript Handbook. IDG Books Worldwide, Inc., Foster City, CA.

Gundavaram, 1996
Gundavaram, S. (1996). CGI Programming on the World Wide Web: On-the-Spot Information. O'Reilly & Associates, Sebastopol, CA.

Musaciano and Kennedy, 1996
Musaciano, C. and Kennedy, B. (1996). HTML The Definitive Guide. O'Reilly & Associates, Sebastopol, CA.

Schwartz, 1993
Schwartz, R. L. (1993). Learning Perl. O'Reilly & Associates, Sebastopol, CA.