Jump to content
NEurope
Sign in to follow this  
bob

Help with data collection problem

Recommended Posts

Hello,

 

This may come across as a bit strange, but my girlfriend needs some help with the data collection for her PhD, and it seems i'm not man enough to help her.

 

Basically what she wants to do is collect the details of some 32,000 members of a fansite, so she can analyse the breakdown of the members.

 

This brings two problems.

 

1) The first is getting the information off the page. I thought you could save the html file and then write some code to subtract the info you wanted (she only wants age, country and gender). But i'm not sure what you would use to do this.

 

2) The second problem is that the members are listed 30ish people to a page, and there are 1000 pages. Is there any way you could download all 1000 pages in one go so you didn't have to manually download 1000 html files?

 

Any help towards this would be much appreciated!

Share this post


Link to post
Share on other sites

You could just contact the admin and ask for a CSV export from the database, if that's possible?

 

I could easily generate a list of those three bits of info for the forum members on this site just by exporting the fields from the database.

Share this post


Link to post
Share on other sites

That might be possible, I'll ask her to try and email the admin.

 

How might it be possible otherwise though? If you had the html file, what would be the best way to filter through it and subtract all the entries marked 'age' etc?

Share this post


Link to post
Share on other sites

Depends how well marked up they are, can you tell the difference between an age figure and just another arbitrary number, like a date? Do you have an example page?

Share this post


Link to post
Share on other sites

Well here is part of the code used on the page:

 

<div class="avatarContainer">

<a href="/en/user/?id=151548149" class="avatar avatarSmall"><img src="http://www.sdcdn.com/avatars/88/878/305/878305047.jpg" width="44" height="44" alt="DivaElenasuper" /></a> </div>

<div class="nick">

<a href="/en/user/?id=151548149" class="user">DivaElenasuper</a> </div>

<div class="aslContainer">

<span class="asl bidiLevel"><span class="avatarAge">17</span><span class="bullet">•</span><span class="ic ic-girl " title="Girl"></span><span class="bullet">•</span><span class="ic ic-flag gr " title="Greece"></span><span class="bullet">•</span><span class="ic ic-addfriend " title="Add as friend" rel="151548149"></span> </span> </div>

<div class="comment">21 hours ago</div>

<div class="actions">

</div>

 

 

So the info she'd like to take out would be the age (17), the country (Greece) and the gender (girl). And then do that 32,000 times and collate the info into a database.

 

I think she's going to try and ask the admin, but i'm still interested as to how one would do it, and what tools you would use. :geek:

Share this post


Link to post
Share on other sites

There might be a simpler way but you could find and replace in that data on that in something like Notepad++, and sort it into table cells at the same time.

 

You would need regular expressions to get rid of the unwanted stuff though.

 

You could probably use a Firefox extension like DownThemAll to download all the pages.

Share this post


Link to post
Share on other sites

If I were the admin I'd be wary about giving out a CSV containing 32,000 members private details (email address etc). I certainly wouldn't want my details given out to anyone who asked for them.

 

What you would need to do is ask the administrator to go into the database and extract only the necessary details and exclude everything else. I think the best way to do this would be to write the script that does this for him or give him detailed instructions.

 

[/unhelpful]

Share this post


Link to post
Share on other sites

Yeah I figured the admin would be aware of that.

 

For example if I was going to dump tables from this forum, I would only take 3 fields, I might also exclude anyone with a flag for not displaying their DOB publicly. Otherwise it's information they have already chosen to give away, anyway. Without the name of the user it's all just faceless statistics.

 

You really don't need any script for this assuming it's MySQL and has something like phpMyAdmin for managing it. You'd just go to the db, click export, select the relevant fields. You could even dump it straight into an MS Excel file.

Share this post


Link to post
Share on other sites

Thanks very much, that's very helpful. I'll see what i can do!

 

And, yeah, obviously they wouldn't need to give her email addresses. She doesn't even need nicknames, only public ages, countries and genders.

Share this post


Link to post
Share on other sites
Sign in to follow this  

×