Finding data

From Open Source Software Research
Revision as of 22:36, 20 October 2013 by Gmadey (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The goal of this page is to have all "real info" mapped to the corresponding table(s) in the database for obtaining that data. Some tables are inaccurate and certain data is distributed throughout the database in difficult-to-query ways. If you've wrestled with the database and discovered where certain information is stored or crafted a complex SQL query for obtaining certain data, please document it here.

Downloads

To get the total number of downloads for a particular project, you must NOT use the stats_project_all table. It is inaccurate for all but a few schemas (see the note at the bottom of that table's page for more info). The correct table to query is stats_groupid_alltime_agg (although there are also some caveats here as well).

The stats_groupid_alltime_agg table was eliminated from the sf data dumps starting with schema sf0209. In sf0109 there were 11 tables having titles beginning with "stats_." Starting with sf0209 there are only 6 tables having titles beginning with "stats_." Overall there are 5 less tables in the sf0209 through sf0509 sf data dumps than in the sf0109 sf data dump, and the 5 tables that were eliminated had titles beginning with "stats_." I rechecked the sf0509 stats_project_all, the stats_multi_rank_history_byday and the top_group tables, but none of them or any combination of them appears to produce the correct downloads totals. I looked at every table in the sf0509 database and found no other tables with a "downloads" field. Where have all the downloads gone???

It appears that there is no perfectly accurate count of downloads for schemas after sf0109. I was able to produce a reasonably accurate count of downloads for schema sf0509 by adding the stats_groupid_alltime_agg figures from sf0109 to the sum of the stats_multi_rank_history_byday figures for sf0509, for each project. This sum appears to understate the number of downloads by approximately 2 weeks for the sf0509 schema. Data for the sf0509 schema were collected on 05/15/09, and the downloads totals described here are short of the downloads totals shown on the Sourceforge.net for 05/15/09 by about two weeks. I checked several projects manually.

Developers

To obtain correct number of developers, again, you must NOT use stats_project_all. Correct data can be extracted from the user_group table as follows:

SELECT count(*) FROM sf0507.user_group WHERE group_id = group_id

where group_id is the group_id found in the groups table for the particular project you are examining. Unfortunately, developer data is not stored directly in any table that SourceForge provides to us (except for the aforementioned inaccurate stats_project_all data). Planned work for the archive includes creating a view or new table for each schema with the above query result stored for each group_id.

Determining Precise Date of Schema

SourceForge does not always create their monthly dumps at the same time every month (although they are usually around the 15th of the month). Furthermore, obviously all months are not the same length. It may be beneficial to determine with reasonable precision, the time and day when a particular schema was made.

SELECT date FROM sf0408.forum WHERE true order by date desc limit 1 

This will return a unix timestamp. This assumes the forums are active enough that the margin of error by using this method is small. This is just one method for determining a timestamp close to when the dump was exactly made. There may be other methods for determining this with greater precision.

Operating System, Development Status, Programming Language, etc.

Strictly speaking, all of this information is stored in the trove_group_link table. However, it is unreadable because the information is stored as numbers. These numbers are codes which correspond to entries in the trove_cat table. A query for obtaining all trove catalog information on a certain project (translated to the corresponding trove_cat entry which is human readable):

SELECT b.* FROM sf0208.trove_group_link a, sf0208.trove_cat b WHERE a.group_id = 235 and a.trove_cat_id=b.trove_cat_id

This will grab all of the rows for project with group_id 235 from the trove_group_link table, and then take the trove_cat_id's and look them up in the trove_cat table, which is readable and fairly self-explanatory. Note that this has all of the information that a project provides about itself, including OS, (natural) language, programming language, intended audience, etc. If you only want one of those categories, you can add another restriction to the WHERE clause specifying the root_parent (category) you want. For example:

SELECT b.* FROM sf0208.trove_group_link a, sf0208.trove_cat b WHERE a.group_id = 235 and a.trove_cat_id=b.trove_cat_id and b.root_parent=199

will only return the operating system related entries. Here are the highest levels of the tree (root_parent of 0):

Table "trove_cat" (selected columns)
trove_cat_id shortname fullname description
1 audience Intended Audience The main class of people likely to be interested in this resource.
6 developmentstatus Development Status An indication of the development status of the software or resource.
13 license License License terms under which the resource is distributed.
18 topic Topic Topic categorization.
160 language Programming Language Language in which this program was written, or was meant to support.
199 os Operating System What operating system the program requires to run, if any.
225 environment Environment Run-time environment required for this program.
274 natlanguage Natural Language The oral/written language for the development and use of this software.

If you can remember the shortname, you can do a query like this:

SELECT b.*
FROM sf0208.trove_group_link a, sf0208.trove_cat b
WHERE a.group_id = 235 and a.trove_cat_id=b.trove_cat_id and b.root_parent = (select trove_cat_id from sf0208.trove_cat where shortname='os')

and it will do the same as the previous query.

There is more information on this here.


Example: finding all bugs for a project

A query that displays all the bugs for a project:

SELECT a.artifact_id, g.group_name, a.details
FROM sf0609.artifact a, sf0609.artifact_group_list ag, sf0609.groups g
WHERE a.group_artifact_id=ag.group_artifact_id AND ag.group_id=g.group_id AND g.group_name='JHotDraw'

The above query returns all bugs for the 'JHotDraw' project. If the project developers deleted some of their bug reports, you shall have to run this query on several dumps and stitch the data.

Example: finding all tracker reports for a project

Tracker reports include bug reports, feature requests, support requests and patch submissions. The following query returns the total number of tracker reports for each group_id. Note that many projects do not have any tracker reports.

SELECT ag.group_id, count(*) 
FROM sf0509.artifact a, sf0509.artifact_group_list ag 
WHERE ag.group_artifact_id=a.group_artifact_id GROUP BY ag.group_id

CVS Data

You can access the CVS data via the SRDA query page. Here are the tables available (though some are empty - we're still working out the kinks): (savannah is GNU Savannah, sf is SourceForge, berlios is BerliOS Developer)

berlios_cvs_groups
berlios_cvs_main
berlios_cvs_revisions
berlios_cvs_symbolic_names
berlios_cvs_user_group
berlios_cvs_users
savannah_cvs_groups
savannah_cvs_main
savannah_cvs_revisions
savannah_cvs_symbolic_names
savannah_cvs_user_group
savannah_cvs_users
sf_cvs_groups
sf_cvs_main
sf_cvs_revisions
sf_cvs_symbolic_names
sf_cvs_user_group
sf_cvs_users
Cvs er diagram.png

See the attached to the right an ER diagram with the details of each table. FYI, this is only up to date through about March of 2008.

These are in the schema cvs, so you must prefix them with cvs when you query the database. e.g.

SELECT count(*) FROM cvs.sf_cvs_revisions

How to find a project's open source license?

The open source license used by a project can be found in the groups table.

See the bottom of the page:

license: is the project's license type