FAQ

From Open Source Software Research
Revision as of 03:37, 13 October 2014 by Gmadey (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Top Links
• Query the Archive
• Schema Browser
• Research Data
• Making Queries
• Resources
• Papers
• Contact
• FAQ
• Schemas
• All tables
• Finding data
• ER diagrams

How to get access to the data and tools in the SourceForge Research Data Archive (SRDA)?

Q. How can a researcher obtain access to the data and tools in the SourceForge Research Data Archive (SRDA)?

A. All requests for data must be submitted in writing (e-mail) to the Notre Dame PI, (Greg Madey). Only academic and scholarly researchers are eligible to receive the data. To receive the data, a short questionnaire and agreement must be completed, signed and returned. Additional information is here.

How to login to the SourceForge Research Data Archive (SRDA) and access the data?

Q. How do I login to the SourceForge Research Data Archive (SRDA) and access the data?

A. See the attached tutorial with screen shots here.

How to cite the SourceForge Research Data Archive (SRDA)?

Q. How should the SourceForge Research Data Archive (SRDA) be cited in technical and scholarly papers?

A. Several recommended citations are suggested: two workshop papers describing SRDA and one for the WWW location of SRDA.

  • Matthew Van Antwerp and Greg Madey, "Advances in the SourceForge Research Data Archive (SRDA)", The 4th International Conference on Open Source Systems, IFIP 2.13 - (WoPDaSD 2008), Milan, Italy, September 2008. (paper) (slides) (BibTeX citation)
  • Yongqin Gao, Matthew Van Antwerp, Scott Christley and Greg Madey, "A Research Collaboratory for Open Source Software Research", In the Proceedings of the 29th International Conference on Software Enginering + Workshops (ICSE-ICSE Workshops 2007), International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS 2007), Minneapolis, MN, May 2007. (paper)
  • Greg Madey, ed., The SourceForge Research Data Archive (SRDA). University of Notre Dame. (your access date) <http://srda.cse.nd.edu/>

Dumps available for local processing?

Q. Are there dumps available for local processing, or are we limited to using the web-form by the contract? If there was a way to either get full scripting client access to the db, or for registered users to download dumps and use them locally I think that would be great. That said we need to do a better job of this at OSSmole too. --JamesHowison (3 Jan 2008)

A. Unfortunately, our contract with SourceForge does not allow us to give out the actual data sets. We are working on providing Web Service access, which should ameliorate most of the difficulties associated with the web form. We welcome any input on preferences from the researcher perspective.

Long running queries?

Q. I have some queries which run for quite a while, presumably as they involve large joins. Does the server continue to run them after the web session times out, eventually creating the results file in the directory? Could one simply keep checking for a new results file and download it once the query has completed?

At the moment I've had to write an agent to fill out the form and divide the query up but it is painfully slow and undoubtedly a larger load on your server ... --JamesHowison (3 Jan 2008)

A. It is hard to tell exactly what happens to a query when the session times out. Most long running queries are computationally infeasible and will cause the SQL thread to run out of memory in which case the thread will need to be killed. Presumably, if a query will not cause the PostgreSQL thread to run out of memory, it should finish even after the session times out. The session timeout has been raised to 30 minutes, so this should ameliorate the problem. Additionally, web services should ease the querying process. I believe the timeout there is completely controlled by the client. The example code I placed on the web services wiki page sets the timeout to one hour (3600 seconds), but if a query may take longer, that parameter can be enlarged.

Charset?

It is hard to deal with the textual data in tables like User's realnames and Artifacts/Forums, when one doesn't know the charset. Even more confusing, I (James Howison) think it is possible that different tables are in different charsets.

I ran a few tests by putting some high-range text [1] into a bug tracker a while back which Sourceforge serves as utf8 and it looks fine (assuming that you are using a good unicode font). If you pull the 'details' column from sf1107.artifact the resulting file will be interpreted as utf8 (because it doesn't have a charset header and utf8 is the default for xml parsers, also utf8 can be guessed well from the byte stream, so editors like TextMate will open it as utf8).

However don't simply assume that all the text in the database is in utf8. For example if you pull the realname for user_id=387136 (or 419100) you'll find the resulting file looks ok if interpreted as latin1 (ISO-8859-1) but wrong if interpreted as utf8. (utf8 and ASCII are byte compatible, but utf8 and latin1 above the ASCII range (eg é sadly are not). For now I check whether the file looks like utf8 and if so I add the utf8 header, if not I add the latin1 header (in ruby:)

results = File.read('downloaded-xml-file.xml')

charset = if results.isutf8 then "utf-8" else "ISO-8859-1" end
    
xml = "<?xml version='1.0' encoding='#{charset}'?>\n" + results

doc = REXML::Document.new(xml)
rows = doc.elements.to_a( "//row" )

This seems to work, but given that there may be other encodings (eg the kml project has lots of russian characters in its Trackers), it isn't ideal (perl has Encode::Guess which might help).

Does postgresql have the charset information for the text fields? Sometimes even if a field says a particular charset it actually stores text in a different one (remember it's all just a byte stream) and you have to be quite careful getting it out (thinking about interaction with the expected charset of your client and even terminal!).

Note that the schemaspy or mysqladmin-type access would tell users this kind of info, normal postgres client access would make it moot (assuming that the charsets are properly setup). Failing that, ideally the xml version would specify the charset at the top of the file and would ensure that it converted from the charset of a particular field into the charset of the xml file (which would ideally be utf8). Ie when pulling users.realname the server would convert the field from latin1 to utf8 before putting it into the xml file.

You might also have to decode html entities (notice in the example tracker item above how the ampersand is stored in details as "&amp;". --JamesHowison

Date Format

All of the date fields in the SourceForge database do not use the standard SQL date or timestamp data type; instead, dates are stored as integers. The date values are based upon the time functions on UNIX operating systems that measure the number of seconds from the epoch (midnight GMT, January 1st, 1970). The date values are monotonic integers, so you can directly compare two dates together to determine whether one date is before or after another date. However, the high resolution of the date values means that two dates will rarely be the same value, so extra processing must be performed for determining if two dates are "the same time". Likewise, additional processing must be performed to pull out more human accessible date information like months, days, and years. The following is a fragment of Java code which shows how to convert a SourceForge date value into a Java Date field and perform some simple manipulations on the date.

// SourceForge date, probably read from database
int sfDate;

// Create Java Date class from SourceForge date
// Java Date is milliseconds so multiply by 1000
java.util.Date actDate = new java.util.Date(sfDate * 1000);

// Create Java Calendar class to manipulate date
Calendar cal = Calendar.getInstance();
cal.setTime(actDate);

// Add 30 days to the date
cal.add(Calendar.DATE, 30);

// print out the date
DateFormat dd = DateFormat.getDateInstance();
System.out.println("Activity date: " + dd.format(actDate))


Alternatively, you can convert seconds since the Unix epoch to human-readable format with the following command:

$ date -d '1970-01-01 UTC 1137273860 sec'

This will yield the following output:

Sat Jan 14 16:24:20 EST 2006

Here is a simple shell script which you can put in /usr/local/bin and make executable:

#!/bin/bash

date -d "1970-01-01 UTC $1 sec"

I call mine 'convert' and I use it like so:

$ convert 1137273860
Sat Jan 14 16:24:20 EST 2006

The following function implements the UNIX epoch conversion in JavaScript using an HTML form:

<HTML>
<HEAD>
<script language="JavaScript">
  function convert()
  {
    var theDate = new Date(document.form1.timeStamp.value * 1000);
    dateString = theDate.toGMTString();
    arrDateStr = dateString.split(" ");
    document.form1.numMonth.value = getMonthNum(arrDateStr[2]);
    document.form1.numDay.value = arrDateStr[1];
    document.form1.numYear.value = arrDateStr[3];
    document.form1.numHour.value = arrDateStr[4].substr(0,2);
    document.form1.numMinute.value = arrDateStr[4].substr(3,2);
    document.form1.numSecond.value = arrDateStr[4].substr(6,2);
  }
  function getMonthNum(abbMonth)
  {
    var arrMon = new Array("Jan","Feb","Mar","Apr","May","Jun",
          "Jul","Aug","Sep","Oct","Nov","Dec");
    for(i=0; i<arrMon.length; i++)
    {
      if(abbMonth == arrMon[i])
        return i+1;
    }
  }
</script>
</HEAD>
<BODY>
<form name=form1>
  <tabl border=0>
  <tr>
    <th>Timestamp:</th>
     <td valign=bottom rowspan=2>
      <input type=button name=switcher value="convert" onClick="convert();"></td>
    <th>Month:</th><th> </th>
    <th>Day:</th><th> </th>
    <th>Year:</th><th> </th>
    <th>Hour:</th><th> </th>
    <th>Minute:</th><th> </th>
    <th>Second:</th><th> </th>
  </tr>
  <tr>
    <td align=center><input type=text size=14 maxlength=11 name=timeStamp></td>
    <td><input type=text size=4 maxlength=2 name=numMonth></td><th>/</th>
    <td><input type=text size=4 maxlength=2 name=numDay></td><th>/</th>
    <td><input type=text size=4 maxlength=4 name=numYear></td><th>@</th>
    <td><input type=text size=4 maxlength=2 name=numHour></td><th>:</th>
    <td><input type=text size=4 maxlength=2 name=numMinute></td><th>:</th>
    <td><input type=text size=4 maxlength=2 name=numSecond></td><th>GMT</th>
  </tr>
  </table>
</form>
</BODY>
</HTML>

Where are project attributes like Operating System, Development Status, Programming Language, etc. stored in the database?

That information is encoded within the trove categories. This includes the trove_cat table which are all the trove categories and the trove_group_link table which are the trove categories associated with each group.

For example, if you go to the summary page for a project on Sourceforge, and look at the link for (say) status, part of the URL has a query part

form_cat=15

or some number. That number is the trove category number, so this database query

select * from trove_cat where trove_cat_id = 15;

would return the record for that category:

 trove_cat_id |  version   | parent | root_parent | shortname |             fullname             |           description            | count_subcat | count_subproj |                          fullpath                           |  fullpath_ids
--------------+------------+--------+-------------+-----------+----------------------------------+----------------------------------+--------------+---------------+-------------------------------------------------------------+----------------
           15 | 2001051101 |     14 |          13 | gpl       | GNU General Public License (GPL) | GNU General Public License (GPL) |            0 |             0 | License :: OSI Approved :: GNU General Public License (GPL) | 13 :: 14 :: 15

while the database design is quite flexible, it is essentially storing a whole taxonomy tree in a table; it makes it somewhat difficult to do queries. There are only a small number of roots so you can form a query where the root_parent is the one you are interested in.


select * from trove_cat where trove_cat_id in (select distinct root_parent from trove_cat);

trove_cat_id |  version   | parent | root_parent |     shortname     |       fullname       |                               description                               | count_subcat | count_subproj |       fullpath       | fullpath_ids
--------------+------------+--------+-------------+-------------------+----------------------+-------------------------------------------------------------------------+--------------+---------------+----------------------+--------------
          199 | 2000032101 |      0 |           0 | os                | Operating System     | What operating system the program requires to run, if any.              |            0 |             0 | Operating System     | 199
          225 | 2000032101 |      0 |           0 | environment       | Environment          | Run-time environment required for this program.                         |            0 |             0 | Environment          | 225
          160 | 2000032001 |      0 |           0 | language          | Programming Language | Language in which this program was written, or was meant to support.    |            0 |             0 | Programming Language | 160
            6 | 2000031601 |      0 |           0 | developmentstatus | Development Status   | An indication of the development status of the software or resource.    |            0 |             0 | Development Status   | 6
            1 | 2000031601 |      0 |           0 | audience          | Intended Audience    | The main class of people likely to be interested in this resource.      |            0 |             0 | Intended Audience    | 1
          274 | 2000102401 |      0 |           0 | natlanguage       | Natural Language     | The oral/written language for the development and use of this software. |            0 |             0 | Natural Language     | 274
           18 | 2000031601 |      0 |           0 | topic             | Topic                | Topic categorization.                                                   |            0 |             0 | Topic                | 18
           13 | 2000031601 |      0 |           0 | license           | License              | License terms under which the resource is distributed.                  |            0 |             0 | License              | 13
(8 rows)


What research projects have used the Sourceforge Research Archive (SRDA)?

Q. Who are the users of the data? Can I learn more about their projects and papers they published?

A. See a list of research projects that have used the Sourceforge Research Archive (SRDA) here.

Where else can I get data similar to that found at SRDA?

Q. Where else can I get data similar to that found at SRDA?

A. Multiple sources of data are available including:

Let us know of other sources.

How do you eliminate newlines and carriage returns from text fields?

Q. How do you eliminate newlines and carriage returns from text fields?

A. substitute the carriage return and the newline character inside the column with a space. For example:

SELECT a.artifact_id, REPLACE(REPLACE(a.summary, CHR(10),' '), CHR(13),' ')
FROM sf0606.artifact as a
WHERE a.group_artifact_id=497423 ORDER BY a.open_date


How do you retrieve CVS data

Q. How do you retrieve CVS data?

A. Some instructions on how to do this are provided at Finding_data#CVS_Data