Scraping Forum Data

@Illyana @NicolBolas Here you are!

This is sorted by time read. The forum only provides data rounded to days if you have more than a day, hours if you have more than an hour, and minutes otherwise. I have converted all to minutes, but this does result in oddly-round numbers for most people. This is simply a matter of rounding.

The top 3 post readers are:

  1. @Got_a_Screw_Loose
  2. @Xenon27
  3. @Anomalocaris
  4. Skip a few
  5. @TaranMayer

This took me an hour two of messing around with making some web scraping scripts. @DRow could have just hit “Export Users”. Tfw I’m not an admin.

This is the code, in case anyone cares.

soup = BeautifulSoup(open("Info.txt", encoding='utf-8'), 'html.parser')
sauce = soup.findAll('tr', class_="ember-view")

with open('users.csv', mode='w') as user_file:
    users = csv.writer(user_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    users.writerow(['Username', 'Likes Recieved',  'Likes Given', "Topics", 'Replies', 'Viewed', 'Read', 'Visits', 'Time Read'])

numdone = 0
for group in sauce:
    rowdata = []
    with open('users.csv', mode='a', newline='') as user_file:
        users = csv.writer(user_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        user = group.findChildren('a')[1].get_text(strip=True)
        rowdata.append(user)
        nums = group.findChildren('span', {"class": "number"})
        for num in nums:
            try:
                rowdata.append(str(num.attrs['title']).replace(',', ''))
            except:
                rowdata.append(num.get_text(strip=True))
        time = group.findChildren('span', {"class": "time-read"})[0].get_text(strip=True).replace(' ', '').replace('<', '')
        time_int = int(time[:-1])
        units = time[-1]
        final_time = 0
        if(units=='d'):
            final_time = time_int*1440
        elif(units=='h'):
            final_time = time_int*60
        elif(units=='m'):
            final_time = time_int
        else:
            final_time = "unknown"
        rowdata.append(final_time)
        users.writerow(rowdata)
    numdone += 1
    print(numdone)
24 Likes

Okay…? That’s fine, but you didn’t. I’m not sure there’s a reason to be defensive here, Taran just put up some stats for everyone. You can too if you want but don’t get defensive about it

13 Likes

Why do practically all the people at the bottom with like a minute of reading time’s usernames look like their keyboard had a seizure? (Minus Kyle of course)
image

10 Likes

Lol. I noticed that too.

Cool, I guess. I even gave you all the code I used if you want to try running it yourself.

5 Likes

@Sharky_do if you can gather the data yourself and put it in a chart of your own, I’d be impressed. I have officially challenged you to do this. You have everything you need.

6 Likes

5 Likes

I will try, tests from school keeps on comming.

I’m pretty sure your school doesn’t give you tests nearly everyday

5 Likes

Wow. This is great. What did you run this on? python?

2 Likes

That code does indeed appear to Python, yes.

3 Likes

Well, this is mostly correct as the internets (and computing devices in general) are essentially series of tubes that move information around.

However, to be Turing Complete (Computerphile video) you also need some sort of conditional processing sprinkled with mathematical functionality for convenience.

For example, you need conditional branching to sort the data like Taran did, and simple arithmetic operations to create metrics like the recent time read per number of visits in the last 60 days.

It is very nice that Taran has demonstrated the skills to scrape and move the data around as well as the knowledge of tools to do some basic processing:

However, now is the time to take the game to another level and see if the collected raw data could be turned into new class of useful information by identifying hidden patterns.

I challenge you, @Sylvie, to come up with a formula or train a neural net that could determine if a forum user is a young VRC competitor or a adult mentor based only on the total, yearly, monthly, etc… user stats. It doesn’t have to be 100%.

By just glancing at the numbers, I have a hunch that it is doable. Are you up to the task?

19 Likes

Oh, it’s certainly doable.

You def wouldn’t need a neural net. That being said I don’t know how to build one. A simple algorithm should suffice.

3 Likes

That being said, I’m sure there’s a python library that takes care of it…

22 Likes

eyyy its xkcd!:grin:

5 Likes

Ah yes, meaningful numbers. That’s certainly what these are.

3 Likes

@technik3k one issue I’m running into: It can separate the kids and the adults fine, except for one thing. The kids that act like adults confuse it.

What do we think about an accuracy rate like this? Positive means it thinks you’re a coach, negative means it thinks you’re a kid.

Green are actually coaches, red are actually kids.

This is ~68% accuracy.

And I can say, the mentors that are marked as kids are ones that don’t act like most of the mentors do.

13 Likes

(@DRow can you split this into another topic?)

1 Like

I concur

this has gotten off topic

1 Like

Yes, please, preferably starting from post 367 or 374 with some light clean up.

Taran, that means that you are not looking at all available data, may not have enough dimensions that you sort your points in, or, in case if you are using neural network, it is not deep enough to recognize all hidden dependencies.

There is no question that for both kids and adults there are several distinct groups with characteristic behaviors that will cluster together and also with some users falling in between.


Cluster analysis - Wikipedia

However, if you select enough analysis criteria to sort the users in (dimensions) and your neural network is deep enough, then it should be able to recognize all clusters and cut through the noise.

I think there is enough raw data to make it much better. The trick is to find criteria that make more sense.

7 Likes

C’mon guys!!!
:frowning_face: