Up until June 2011, The Signpost (Weber State University’s student news organization), was hosted by College Publisher — A content management system designed for university newspapers like The Signpost. In June 2011, the paper made the jump to WordPress.
The first order of business was to import past stories from College Publisher into our new WordPress hosted site. Easy right? Get a SQL dump from CP, format the columns for the WP database, upload the new SQL file. Bada-boom bada-bing. If only…
After waiting nearly two weeks for a response from College Publisher, they provided us with a temporary FTP site which contained the archived file export:
Working with large Excel files can be difficult, working with nearly 4 GB excel files can be damn near impossible — especially when working on a office computer with 4 GB of of memory. After closing all non-essential processes, I was able to open up the file — only to find 60,00 rows — many of which contained classified ads that were stored with stories, incomplete records, and other corrupted gobbledygook.
Sorting by story size, and filtering my some key words, allowed me to filter out 20,000 bogus records — bringing the Excel file down to about 40k. Upon further inspection, nearly all of these remaining stories contained duplicate headlines. A typical duplicate set contained 4 records, one of which being the whole correct story.
40k records, with 4 duplicates per unique story leaves 10k unique stories — more stories than would be feasible for one individual to sort through in a reasonable amount of time.
The eventual solution? I created SQL insert scripts from the Excel file and imported them into a MySql database. Then I whipped up a PHP application that allows searching by article title, author, and body content. Duplicates are displayed to the user, along with a Reddit-esque voting mechanism for voting correct stories up, and bad ones down. As prior authors search for their old work, they improve the system — “crowd sourcing” the work or filtering out bad data. To date, 7,810 records have been voted up or down — 18%.
Thanks, College Publisher!
Oh, and as far as importing the stories into WordPress? I guess we can tackle that in another few years when the archives have been filtered.