Creating a 4chan HR Diagram

File: 330 KB, 567x600, Hertzsprung-Russel_StarData.png [View same] [iqdb] [saucenao] [google]

Creating a 4chan HR Diagram Anonymous Thu Jun 28 23:29:09 2018 No.9837467 [Reply] [Original]

Those familiar with astronomy will recognize this image - the Hertzsprung–Russell diagram for stellar classification. For those who aren't familiar with it, the HR diagram plots the brightness and temperature of observed stars and groups the stars into physically significant groups and families - giants, main sequence stars, dwarf stars, etc. What I'd like to propose as a project for /sci/ is trying to accomplish something similar with this website - a 4chan HR diagram which plots out the various boards in terms of some combination relevant parameters (board traffic, board age, image posting, etc etc etc) and determine if it's possible to illustrate some kind of rough empirical structure of 4chan's communities and cultures. Does such a parameterization lump boards into groups and families? Does a "main sequence" of 4chan boards exist? Are there "evolutionary paths" which follow natural progressions? These are the kinds of questions I'd like to see if we can answer.

Anonymous Thu Jun 28 23:30:40 2018 No.9837468
File: 22 KB, 721x961, 4chan hr first attempt.png [View same] [iqdb] [saucenao] [google]

As a kind of rough 'proof of concept' of the idea, I acquired stats on average posts-per-day for each board for the last month as well as image-to-post ratios and plotted the boards with that parameterization and already we see that some interesting groupings already appearing with this relatively simple parameterization:
1) At the top of the graph we have a close grouping of 'eye candy' boards (wallpapers, cute stuff, and all of the porn boards)
2) In the middle of the graph there is a large, amorphous blob of blue boards
3) There are two distinct 'branches' of pink boards in the lower half of the graph
- an upper branch which consists of boards like /bant/, /trash/, and /b/
- a lower branch which consists of boards like /soc/, /r9k/, and /pol/

I'd be curious to hear /sci/'s thoughts, as well as any ideas people might have for further refining this concept.

>>	Anonymous Thu Jun 28 23:38:42 2018 No.9837473 >>9837467 I like this idea!! We could classify them on the level of brainletness of each broad

>>	Anonymous Thu Jun 28 23:51:41 2018 No.9837496 Sounds pretty interesting. You could also try looking at the average words used per post, or maybe how often users will post links to other websites, and see if there's any correlation between those and the data you already collected.

Anonymous Fri Jun 29 00:20:24 2018 No.9837528
File: 879 KB, 2230x1408, Screen Shot 2018-06-28 at 10.19.39 PM.png [View same] [iqdb] [saucenao] [google]

required reading: this UN-funded study about /pol/. Good insight into researching 4chan, it will be useful OP

https://arxiv.org/pdf/1610.03452.pdf

Anonymous Fri Jun 29 00:50:26 2018 No.9837560
File: 105 KB, 409x284, the least rare pepe.png [View same] [iqdb] [saucenao] [google]

>>9837473
Appreciate the spirit, but hoping to try for something a little more objective.

>>9837496
Yeah that might be interesting, I'm interested in seeing what kind of correlations exist between different parameters. The biggest obstacle is data collection - finding details like board traffic info or image posting is easy with sites like 4stats and all the various archives and traffic monitors that exist... doing something like figuring out the average number of words per post across 70-something boards is something I can't even fathom figuring out how to do. That's a big part of why I wanted to get /sci/'s two cents on the idea; I figured the folks here would have some insight on what parameters might really be worth looking at and how to go about collecting information on them.

>>9837528
That's fucking hysterical and I have a new reaction image now. Thank you anon.

Anonymous Fri Jun 29 09:40:10 2018 No.9838118
File: 244 KB, 1416x820, 0sRB20K.jpg [View same] [iqdb] [saucenao] [google]

>>9837560
me again from the other thread

in the past I also gathered such data as ~post-length and posts/unique-user.
Stopped because it takes too long to check all threads individually and still update the stats in a timely manner.
You can calculate the data from the public 4chan API ( https://github.com/4chan/4chan-API ), but it's going to take some time. Not an issue really, if it's supposed to be only a one time snapshot.
API limit is one request per second. So with 72 boards * ~10 pages * ~15 threads, it's going to take a few hours, but then you have all the data you would want.

>>	Anonymous Fri Jun 29 10:38:17 2018 No.9838189 >>9838118 Cool, I'll take a look at that.

>>	Anonymous Fri Jun 29 11:02:12 2018 No.9838219 >>9837560 https://4stats.io

>>	Anonymous Fri Jun 29 13:15:34 2018 No.9838449 >>9837528 >https://arxiv.org/pdf/1610.03452.pdf It seems some of the authors have been hanging out here rather than just extracting data from bots, ref the "comfy" pepe.

>>	Anonymous Fri Jun 29 16:37:17 2018 No.9838778 >>9838118 Actually I don't suppose you could you go into a bit more detail with how to do that? I'm not really familiar with this sort of thing.

>>	Anonymous Fri Jun 29 16:54:07 2018 No.9838809 I like to see something about average post length as a proxy for thoughtfulness vs. traffic. Kind of similar to this one >>9837468 but different.

Anonymous Fri Jun 29 17:20:01 2018 No.9838859

>>9838778
you would need some minimal programming knowledge to automate the API requests.

You would first get a list of all existing boards
https://a.4cdn.org/boards.json
then get info on existing threads (taking /sci/ as an example board here)
https://a.4cdn.org/sci/threads.json
or
https://a.4cdn.org/sci/catalog.json
take the OP post number and fetch the threads like
https://a.4cdn.org/sci/thread/9837467.json (this thread)
and that object contains all the posts and also some extra data like "unique_ips" in the OPs object
Put all the data in some kind of database I guess and where you go from there is up to you.

Some boards don't have archives afaik, so if you fetch /b/ threads for example, then it would probably make sense to start loading them from the back of the catalog, so they aren't pushed of the board by the point you reach them, if you would start from the beginning.

Anonymous Fri Jun 29 17:23:08 2018 No.9838864
File: 22 KB, 376x349, 1530159782903.jpg [View same] [iqdb] [saucenao] [google]

I think it would be interesting to see how pics spread. Like, if an image/meme is first posted on /g/, then how long does it take to travel over to /jp/? Sort of like a web matrix of how people spread information across the different boards.

the 30 year old boomer was originally only on a couple of boards, but now it has spread as a far as /o/, and /ck/.

Anonymous Fri Jun 29 17:33:59 2018 No.9838888

>>9838778
>>9838859
and it might be a good idea to pay attention to the time of day, that you plan to create that data snapshot.
3 hours is a long time and different boards have their peak-activity at different times of the day.
So you have to decide whether you want to compare boards at one single point in time or get the data at the weekday/time, that they normally are most visited.
You can check the graph on the 4stats site for that I guess.

And even then some boards have some interesting activity patterns like /asp/ or /qst/.
/asp/ for example is always super-active monday and tuesday nights, while /qst/ is the board with the highest variance of activity and highly dependent on time of day like no other board. (dropping as low as 10-15% during off times).
Everyone seems to meet up there at only very specific times during the day to participate in the guided threads OPs are organizing.

Anonymous Fri Jun 29 18:20:24 2018 No.9839000

>>9838888
If we're talking about comparing things like 'post length' and 'image ratios' and shit, then it probably isn't super critical what time the data is taken. You'd have a large enough sample size that any collection of posts you look at can probably be taken to be indicative of average posting trends as far as those kinds of things are concerned. Likewise sites like 4stat take averages over several weeks, so things like one or two big nights of activity surrounding a big announcement or event should get washed out in the statistical averaging.

Anonymous Fri Jun 29 20:41:37 2018 No.9839167

>>9839000
that could be interesting by itself.
Compare post quality on peak vs off-time. I would bet that on all boards during daily peak you get a lot more shitposts and bait attempts, while during less active times you have a better chance to find genuine discussions with people interested in the boards topic.
But that's nothing more than a guess anyway.

>>	Anonymous Fri Jun 29 22:52:00 2018 No.9839385 >>9838864 God I hate this shitty fucking forced meme.

>>	Anonymous Fri Jun 29 23:05:40 2018 No.9839408 >>9839385 that's the thing, there must be some interesting story behind how it spread to every board. I don't think it's a bot

>>	Anonymous Fri Jun 29 23:19:11 2018 No.9839415 >>9837473 Percentage of frogposts would be an excellent proxy for this.

>>	Anonymous Sat Jun 30 07:23:06 2018 No.9840007 >>9839000 >probably isn't super critical what time the data is taken We have some polls from >>>/g/cyb that show large variations across the time zones. I'd say a 24 h period is needed to be sure.

Anonymous Sat Jun 30 12:15:31 2018 No.9840294

>>9840007
Looking at posting trends over time - especially as you cross the timezones from the US into Asia into Europe would be interesting to look at for different sites.

/sci/ should look at expanding this thread idea into just a general 4chan statistics project, I'm sure there's a lot of really interesting and surprising insights we could gain into this site by looking at different posting trends.

If anyone else has ideas, suggest them here or, if you're proactive, trying and put some data together.

>>	Anonymous Sat Jun 30 14:09:59 2018 No.9840468 File: 35 KB, 669x661, Demography2.png [View same] [iqdb] [saucenao] [google] >>9840294 One thing is for sure, neither the boards nor the topics are homogeneous. Pic. very much related.

>>	Anonymous Sun Jul 1 11:04:19 2018 No.9841964 >>9840468 Wow, that's a lot more PhDs than I suspected.

>>	Anonymous Sun Jul 1 11:14:56 2018 No.9841986 it's an anonymous website full of trolls who love to answer questions falsely good luck lad

>>	Anonymous Sun Jul 1 11:28:50 2018 No.9842013 Wow an actually interesting thread on /sci/ I never thought I'd live to see the day

Anonymous Sun Jul 1 11:41:23 2018 No.9842028

>>9840294
>If anyone else has ideas, suggest them here or, if you're proactive, trying and put some data together.
Looking at flag rates on sites like /pol/ and /int/ could be interesting - seeing if concentrations of posters in the Americas, Asia, and Europe follow expected trends based on timezones or are more unexpected.

>>	Anonymous Sun Jul 1 11:50:47 2018 No.9842038 >>9842013 I know, right? Also post your IQ and write a book about how you hate CS dont forget to mention elon musk and black science man

>>	Anonymous Sun Jul 1 13:06:37 2018 No.9842142 >>9841964 That was my first thought too. Then again, 4chan has a lot of autists so why not? There are several of my colleagues I strongly suspect, many have a PhD.

>>	Anonymous Sun Jul 1 13:07:57 2018 No.9842145 >>9842142 >There are several of my colleagues I strongly suspect, many have a PhD. how can you not know? Ask them and laugh inside that you have the same job but without a ph.d

>>	Anonymous Sun Jul 1 14:32:00 2018 No.9842269 >>9842145 Given that I too have a PhD I am not sure about that laughing part.

>>	Anonymous Sun Jul 1 14:36:29 2018 No.9842274 >>9842142 >There are several of my colleagues I strongly suspect, many have a PhD. Wait, so you suspect that they have PhDs or you suspect that they post on 4chan?

>>	Anonymous Sun Jul 1 14:54:42 2018 No.9842304 >>9842269 well then if they have the same job as you with an undergraduate degree and you're the one with a ph.d, that sounds bad

Anonymous Sun Jul 1 16:23:46 2018 No.9842453
File: 49 KB, 1645x995, snapshot.png [View same] [iqdb] [saucenao] [google]

>>9838778
I created a snapshot of the first 3 pages for each board, which should be enough to at least experiment with some of the properties and see if there are some obvious patterns maybe.

It's 130MB of JSON, but in that human readable format at least it should be easy to work with.

https://drive.google.com/file/d/15zki-GD9j8c6gjQoUwDGApN8wqh3iRut/view

Anonymous Sun Jul 1 17:36:08 2018 No.9842597

>>9842453
This is some interesting data, but going through it by hand is gonna be a bitch. I'm not familiar with programming in Java, but I am familiar with using Mathematica for modelling and data analysis and it can execute URL code very easily.

If I get some free time this week I'll try and code up something that can (hopefully) automate data collection and analysis for the site. Should be a pretty straightforward flowchart:
>Select board to analyze
>Mathematica calls board data API to get thread data
>Extract a list of all of the OP numbers for existing threads
>Go down the list starting with the ones closest to getting bumped off
>Mathematica calls thread data API for each number
>Extract the quantities you're interested in
>Repeat until the thread list is exhausted
>Export the data points for that board to an external file
>Repeat with each board

150 threads per board, say 10 seconds to process each thread, that's about 45 minutes to collect a full sample for each board. It'd take a while, but it'd be a helluva lot easier than doing the analysis by hand.

>>	Anonymous Sun Jul 1 17:40:25 2018 No.9842607 >>9837467 Why is temperature plotted from high to low?

>>	Anonymous Sun Jul 1 17:43:26 2018 No.9842616 >>9842607 what will change if it was plotted from low to high

Anonymous Sun Jul 1 17:45:35 2018 No.9842621

>>9842597
there is no scenario where you would analyze anything manually by hand.
Any programming language should be able to parse data in JSON format. After all, that's what the 4chan API sends back to you as well.

And processing a board will be way faster than that. Pretty sure the limiting factor will always be the API limit of 1 request per second.

Anonymous Sun Jul 1 18:04:04 2018 No.9842664

>>9842621
Oh I know it'll probably be more efficient than that, but years of data analysis have taught me to always be a little pessimistic when making estimates... that way you're always pleasantly surprised when it processes more quickly. Realistically, if it's only truly limited by the request cap, it could well finish a board in 5-10 minutes and you could do the whole site in a day or two once you get all the bugs worked out.

>>9842607
Stars cool as they age - by plotting the HR diagram with temperature decreasing from left to right, it turns the HR diagram into a map of stellar evolution

Anonymous Sun Jul 1 22:12:37 2018 No.9843166

>>9837467
This is actually interesting, its a shame this wasn't done several years ago when 4chan's boards were much more distinct culture wise. But this should still be worthwhile.

>>9839408
Its called a forced meme for a reason; the same retard or group of retards post the same image again and again across various boards in vain attempt to make it catch on.
In this case its a bunch of spergs from nu-/v/ trying to spread their cancer everywhere.

>>	Anonymous Sun Jul 1 22:26:47 2018 No.9843201 >>9843166 >its a shame this wasn't done several years ago when 4chan's boards were much more distinct culture wise The ones that people actually care about are all still pretty distinct.

>>	Anonymous Sun Jul 1 22:37:28 2018 No.9843221 >>9843201 maybe /v/, /tv/, /pol/, and /r9k/ are very clustered together on the diagram, and /b/ is off with the porn boards

Anonymous Mon Jul 2 00:02:06 2018 No.9843338
File: 34 KB, 721x961, hr pink - labels.png [View same] [iqdb] [saucenao] [google]

>>9843221
Interestingly, with the limited info we've got so far, there seem to be three distinct branches in pink boards (excluding the massive 'porn regime'), one leading to /gif/ and /trash/, one leading to /bant/ and /b/, one leading to /soc/, /r9k/, and /pol/. It'll be interesting to see if these similar branches appear with more advanced versions of the plotting.

Update on the programming - I've got the basic importing and API calling stuff working. I can pick a board, have it collect the numbers for all the active threads, and start calling the API for the thread details with minimal effort. Extracting the data from the thread is a little more challenging. Extracting the readily labeled info like unique ips, replies, images, etc is easy, but I haven't thought of a good way to do word/character counting since there's so much extraneous information about each post I don't want counted. Will keep updating this week, but I've got a research group meeting first thing in the morning and real work that I need to get done this week, so it may be a while.

>>	Anonymous Mon Jul 2 00:46:24 2018 No.9843406 >>9843338 can't wait to see what blueboards yield, although i doubt we could gleam much specifics from a board's culture or how that changes over time.

>>	Anonymous Mon Jul 2 01:30:39 2018 No.9843449 >>9842145 I think he meant suspect of having autism

>>	Anonymous Mon Jul 2 01:33:59 2018 No.9843452 >>9843338 Wouldn't the extraneous info cancel out since it's effectively the same for every post?

Anonymous Mon Jul 2 01:39:56 2018 No.9843462
File: 21 KB, 721x961, hr blue.png [View same] [iqdb] [saucenao] [google]

>>9843406
blue boards are much harder to see any kind of patterns in, with the exception of the three at the top that are part of that same 'eye candy' cluster, almost all the blue boards are in sort of the same band and just smeared out through that area.

image-to-post ratio vs posts-per-day only tells us a little bit: we can conclude that boards posting above 50% images to posts are *almost* exclusively pornographic, we can see that there may be some trends among pink boards, and almost all the blue boards exist in a big band between about 15-35%, with only about a dozen or so boards deviating from it... but that's about all we can get from that.

we need more data, more parameters to look at:
how do post lengths and word counts compare between boards? are posts short and sweet? long but simple? long and thought out?
how evenly spread out are users? is it a lot of people in a lot of different discussions, or are only a handful of threads attracting a lot of posters?
etc etc etc

i think the more details we can get, the more patterns and trends we'll see develop

Anonymous Mon Jul 2 01:48:08 2018 No.9843469

>>9843452
yes, but it's not going to scale uniformly because you get basically the same amount of extra info (html formatting, fonts, post numbers, image info, etc) for each post. so 20 extremely short posts might have the same total character count as five or six multi-paragraph posts, which is a problem if we're going to look at using word counts/post lengths to characterize board activity.

i did a bit more work on it and i've come up with a string pattern code that seems to purge the bulk of the extra stuff while leaving most of the actual post content intact. it's not perfect, but it should work well enough. but i've seriously got to call it a night for now.

>>	Anonymous Mon Jul 2 01:56:37 2018 No.9843475 >>9843462 Very cool anon, but uh, it makes me wonder if we are just as bad as companies that collect people’s data...

Anonymous Mon Jul 2 03:39:03 2018 No.9843586

>>9842274
I know what their degrees are and I just suspect they are autists. I have o idea if they post on 4chan. That comma should have made it clear.

>>9842304
Am I really that ambiguous? In my line of work it is normal to have a PhD, where I work it is about 25% while in some countries it would be the majority. And it is a field where a PhD really does help and importantly a PhD is also appreciated. People are skilled though as I mentioned there are a lot with high functioning autism.

>>	Anonymous Mon Jul 2 03:59:13 2018 No.9843600 >>9837467 >calling our sun "Sun" It's name is Sol you idiot.

Anonymous Mon Jul 2 08:28:26 2018 No.9843849
File: 33 KB, 1353x709, analysis_imageRatio_postLength.png [View same] [iqdb] [saucenao] [google]

That site you are using to visualize the stats is really neat. Just found it today as well.
I took the data of the first 3 pages I gathered yesterday and compared some of the stats.

Simple, but maybe obvious example here.
The more images people are posting, the shorter their posts are on average.

"Post Length" is the # of characters in a post with HTML tags and post-number quotes (>>123456) removed, so it only counts actual written and quoted text.

Anonymous Mon Jul 2 09:47:32 2018 No.9843940
File: 65 KB, 525x481, thumbs up.png [View same] [iqdb] [saucenao] [google]

>>9843849
>Simple, but maybe obvious example here.
Don't undersell it, that's actually a really interesting result and not one that's necessarily intuitive. One might expect that there would be at least a few examples of boards in all four extremes:
>1) high image, long post
>2) high image short post
>3) low image, short post
>4) low image, long post
Yet we see two big 'forbidden zones' in this graph in what would be regions 1 and 3. With the exception of /f/ (a board that doesn't allow image posting) there really aren't that many examples of boards where posters say little and contribute little (and, in fact, if you were to substitute flashes for images on /f/, it would probably push them into the linear regime with the other threads, removing that lone exception). And on the other side of things we see a MASSIVE empty space where one might expect to find boards that say a lot but also post lots of content.

Your plot's important because it demonstrates that there's a very strong correlation (an almost linear one at that) between how much content a boards posts vs how much of an actual conversation they're probably having. More images tend to mean less conversation - that's really a really neat result! Well done! I'd also be interested in hearing what approach you used for removing tags and post numbers while still being able to isolate and count each post.

This is kind of why I made this thread - not just to discuss the HR diagram idea, but the general concept of applying rigorous, scientific analysis to 4chan activity, behavior, trends, etc. There's a lot of really cool stuff we could look at and analyze, from plots like this, to looking at how posting trends change throughout the day, or the 'meme diffusion' concept >>9838864 suggested, etc etc etc.

>>	Anonymous Mon Jul 2 09:47:56 2018 No.9843942 >>9838864 I think you can find its origin if you look up "30 year old at the gym" on, /fit/ archive

>>	Anonymous Mon Jul 2 10:18:30 2018 No.9843980 >>9843849 Interesting stuff. I mainly hang out in >>>/g/cyb and >>>/sci and my impression is that some threads (like /cyb/) tends to run until autosage. So why not add error bars with max/min and standard deviations?

>>	Anonymous Mon Jul 2 10:28:08 2018 No.9843994 >>9843849 I really like that trash falls in almost the exact middle of the distribution, since presumably it has content that originated in multiple different boards and averages out

Anonymous Mon Jul 2 11:04:45 2018 No.9844047

>>9843849
That's really cool - nice job.

>>9838864
Is it possible to create an image with some kind of 'tag' or 'tracker' that could be easily identified from searches even if the image is cropped or altered? You could make a meme and post it on a board and then track it through archives and see how it spreads and changes over time.
>Day 1 - posted Test Meme #27 on /b/
>Day 2 - TM27 has been reposted 42 times, but has yet to cross the board barrier
>Day 3 - a poster redrew TM27 with a twiddly little mustache, dubbed TM27-a
>Day 4 - 132 variants of TM27 have been reposted 8700 times across 37 boards

Anonymous Mon Jul 2 11:24:02 2018 No.9844082
File: 56 KB, 1453x200, regex.png [View same] [iqdb] [saucenao] [google]

>>9843994
>>9843980
it's not necessarily 'that' accurate though, since it's only the first 3 pages, meaning mostly fairly active threads with more than average replies are represented.
>why not add error bars with max/min and standard deviations?
I would need to look up how to do those things properly. I am far from being an expert when it comes to data visualization or statistics in general.

>>9843940
>I'd also be interested in hearing what approach you used for removing tags and post numbers while still being able to isolate and count each post.
I used regex to replace/remove certain text. Everything in JavaScript, since that's what I know best.

https://pastebin.com/QXSGJbrf (line 50-54)
This is an improved version of what I used for the image, which doesn't show /p/ for example, because I didn't remove the EXIF text yet.

The resulting JSON looks like this
https://pastebin.com/XGkfLGh6

>>	Anonymous Mon Jul 2 11:33:12 2018 No.9844097 >>9843475 Nah, none of this is personal info and it's all purely academic in interest.

Anonymous Mon Jul 2 12:07:14 2018 No.9844138

>>9844082
>I would need to look up how to do those things properly. I am far from being an expert when it comes to data visualization or statistics in general.
For error bar analysis the max, min and mean is a good start. You can also use hinges to indicate how even the distrinution is.
http://www.statisticshowto.com/upper-hinge-lower-hinge/

>>	Anonymous Mon Jul 2 13:06:22 2018 No.9844205 >>9844138 Alternatively, if you're already calculating mean averages, calculating a standard deviation is pretty easy to do from there. There's probably a built-in-function for it in whatever you're using.

>>	Anonymous Mon Jul 2 13:13:02 2018 No.9844216 >>9843600 It has no set name. People call it many different things, just like Earth.

>>	Anonymous Mon Jul 2 13:40:25 2018 No.9844255 >>9844216 I think you mean the 13th colony

Anonymous Mon Jul 2 18:24:30 2018 No.9844669
File: 44 KB, 1448x768, postLength_middleThirdBar_averageDot.png [View same] [iqdb] [saucenao] [google]

>>9843980
not sure if I did something wrong or the distribution is just fucked up across all boards.
Dot is the average post length and the bar stretches from the 34th to 66th percentile.

I can only assume this is because of the mixing of large amounts of posts with no text with some outliers that are literal textwalls

>>	Anonymous Mon Jul 2 18:41:18 2018 No.9844688 >>9844669 How are you calculating the error bars?

Anonymous Mon Jul 2 18:58:50 2018 No.9844709
File: 41 KB, 1421x755, postLength_middleThirdBar_medianDot.png [View same] [iqdb] [saucenao] [google]

>>9844688
>How are you calculating the error bars?
I realized I am an idiot the moment I finished reading that. mixing averages and percentiles...
Don't know why, but I did (Average - 34th percentile) for the bottom error

This one is now 33th---50th---67th, still hard to make out anything useful in that mess

Anonymous Mon Jul 2 22:17:15 2018 No.9845054
File: 9 KB, 523x686, hr automation test.png [View same] [iqdb] [saucenao] [google]

OP here with an update - I've gotten a good chunk of the automation sorted out. Individual thread info is gonna take a while, but I've got most of the broad board statistics stuff partially working. The pic from >>9837468 took like an hour to assemble going through the object data by hand. Pic related was processed in about 40 seconds so that's a massive fucking improvement.

As you can see there are still some kinks to work out - there seem to be some issues with identifying worksafe and NSFW boards from the API data, so that's got to get fixed, and I don't have it doing any kind of statistics or normalization yet. But it's definitely progress and we see that the automated results get roughly the same shape as the manually assembled plot so that's a good sign.

Anonymous Tue Jul 3 11:38:40 2018 No.9845945

>>9845054
Wow, that is a big improvement. I'm sure doing stuff like thread statistics will take a lot more time (0.5 seconds, 150 threads, 72 boards = 1.5 hours) but at the very least this means it should be possible to look at how the broader board statistics change throughout the day or in response to events and things.

It'd be interesting to see which boards are stable and which move around on the images-per-reply vs average-posts-per-day plots... who knows maybe there are even specific transition paths boards will follow as they go between different regimes.

>>	Anonymous Tue Jul 3 12:17:24 2018 No.9846024 How do you parse the site?

>>	Anonymous Tue Jul 3 13:53:24 2018 No.9846150 File: 158 KB, 1153x905, snap2.png [View same] [iqdb] [saucenao] [google] Creating a full snapshot right now with all pages included this time. If everything works out as intended, I will post the data in a bit.

Anonymous Tue Jul 3 16:48:00 2018 No.9846364
File: 3.27 MB, 240x320, 1475147064143.gif [View same] [iqdb] [saucenao] [google]

>>9844709
Cool stuff. As for usefulness there are a few things that leap out of the diagram.
- first off image heavy boards tend to be very brief, most postings just an image with a few posts of significant lengths. Looking at e the left part of the error bar overlaps the average length close to zero.
- as image radio decreases so do the variance. My hypothesis is that the short ones are reaction images but the longer ones are when reaction image are less suitable
- averages are skewed to the left. My hypothesis 2 is that heavy left skewed boards have a lot of poo posters.
- boards like /diy( are VERY slow and centered. Uninteresting posts sink slowly while interesting posts live on for a long time.

So from your excellent graphs we present a few hypothese that can be tested (with a lot of work).

Anonymous Tue Jul 3 18:22:55 2018 No.9846527
File: 205 KB, 739x799, analysis_result.png [View same] [iqdb] [saucenao] [google]

>>9846150
about 45minutes to fetch everything with this 6 in parallel setup with requests from different IPs.

Full snapshot of all threads from all boards:
https://drive.google.com/file/d/1FGMRwhTOEmfM2xS2VPoRKSYrEYKfAO6D/view?usp=sharing

some data after processing the threads (with some extra data I took from 4stats like "avgPostsPerDay"):
https://pastebin.com/YqePgRMz
(does this look about right?)

The shoddy work-in-progress code I used to fetch and analyze the thread data:
https://github.com/Nocory/4chan-detailed-statistics

>>9846364
hm, the full snapshot should be more helpful now to find any interesting correlations
>boards like /diy/
yeah, the smaller boards can vary wildly in their activity from day to day.

>>9846024
the public 4chan API

Anonymous Tue Jul 3 21:02:34 2018 No.9846773
File: 166 KB, 1030x219, 4chan data sample.jpg [View same] [iqdb] [saucenao] [google]

Programming update - I have most of the data collection code worked out and have it collecting everything from average users per thread to post lengths to average reply numbers, and deviations for everything collected on a per-board basis (so stuff like Top PPM and average posts per day aren't going to have error bars).

As stated in >>9845054, I can do basic board stats in about 40 seconds, but because of the API call limit (0.5 seconds) it takes a LOT longer to do thread analysis on individual boards. Running a test case of 10 boards took about 25 minutes, so extrapolating you're looking at about 3 hours for a full sweep and there's not really any getting around that without setting up some kind of parallel data collection with proxies and stuff like >>9846527 (which is super impressive by the way, well done)

I want to try running at least one full sweep, threads and all, just to see what kind of data I can get, but I think I'm going to focus more on seeing what I can get out of the shorter sweeps - they can be done faster and still yield a lot of data. I may look at something like >>9845945 suggested - looking at how different plots evolve over time and whether there are stable and unstable boards or fixed paths that boards take to transition from one part of a plot to another.

>>	Anonymous Tue Jul 3 21:40:41 2018 No.9846823 >>9837528 this paper is actually informative and entertaining. thanks based anon

Anonymous Tue Jul 3 21:56:08 2018 No.9846858
File: 202 KB, 1462x740, Screen Shot 2018-07-03 at 7.45.39 PM.png [View same] [iqdb] [saucenao] [google]

>>9846823

>we followed standard ethical guidelines
>the rest of this paper features language likely to be upsetting.

>Although deeper analysis of these differences is beyond the scope of this paper, we highlight that, for some of the coun- tries, the “rare flag” meme may be responsible for receiving more replies. I.e., users will respond to a post by an uncom- monly seen flag. For other countries, e.g., Turkey or Israel, it might be the case that these are either of particular interest to /pol/, or are quite adept at trolling /pol/ into replies (we note that our dataset covers the 2016 Turkish coup attempt and /pol/ has a love/hate relationship with Israel).

>So-called janitors, volunteers periodically recruited from the user base, can prune posts and threads, as well as recommend users to be banned by more “senior” 4chan employees. Generally speaking, although janitors are not well respected by 4chan users

>We find that 12% of /pol/ posts contain hateful terms, which is substantially higher than in /sp/ (6.3%) and /int/ (7.3%).

they sure had fun with this research

Anonymous Tue Jul 3 21:58:26 2018 No.9846860

>>9837468
I've been wanting to create a 4chan archiver for research so long. When I learn how to properly program webcrawlers and get a few spare TBs of disk space, I will get all the stats and graphs we nerds need. Unless someone comes and does it first, sure

>>	Anonymous Tue Jul 3 22:00:42 2018 No.9846864 >>9846860 >a few spare TBs of disk space You're going to need a LOT more than that.

Anonymous Tue Jul 3 22:02:19 2018 No.9846867
File: 65 KB, 540x868, correlation.png [View same] [iqdb] [saucenao] [google]

>>9846773
you did one request every 0.5 seconds without any problems?
API docs stated not to make more than one per second and I even set it to 1.1s, when I fetched them, just to be on the safe side and not error out.

I wondered before what the actual limit is and what kind of API usage admins are ok with. (actual traffic and request limits)
Would be interesting to know how the various archive sites are going about fetching the data, especially when an archive is covering several boards and maybe even saves images.

on a side note
I tried checking for correlation between different board properties and ended up with this
https://pastebin.com/42rsWSdm
most of it fairly obvious again, but also some interesting starting points like
>[ 'imageRatio-avgPostsByPoster', '0.67' ]
the more image focused a board, the more often a user posts in the same thread repeatedly (image dumping threads)
with outliers like /mlp/,/vg/,/qst/, that have people post repeatedly without uploading many images at the same time

>>9846858
>We find that 12% of /pol/ posts contain hateful terms, which is substantially higher than in /sp/ (6.3%) and /int/ (7.3%).
text analysis of posts would be interesting to do as well on the thread data

Anonymous Tue Jul 3 22:03:37 2018 No.9846869

>>9846864
Yeah, I planned on just saving a few hours worth of complete 4chan history, it's not like I could afford to go much further than that without thousands of dollars on storage. Maybe doing the research on those hours, dumping everything (saving the data/most used words/most evolved pictures/trends and flavors of the day) and fetching shit again to make new research. We could play with some Neural Network and make it a piece of shit, or discover trendy pastas or whatever

Anonymous Tue Jul 3 22:21:31 2018 No.9846897

>>9846867
The API github says the limit is 2 requests per second from a single IP, but you can also do batches of 20 (don't know what the cooldown is for that though

>the more image focused a board, the more often a user posts in the same thread repeatedly (image dumping threads)
Makes sense - most threads that involve lots and lots of image posting are going to be things like porn dumps, rekt/ylyl/reaction threads, storytimes, etc etc where one person or a few people post a lot of content and people lurk

>>	Anonymous Wed Jul 4 05:01:01 2018 No.9847564 >>9846858 That table would be very sensitive for time of day snapshot. Also national holidays would skew things. You could do a time of day combined with day of week analysis. The sky really is the limit here.

Anonymous Wed Jul 4 07:53:17 2018 No.9847758

>>9846897
are you mixing things up or are we just talking about different ones?

2 requests per second and 20 burst is the limit I set for the 4stats.io API.
The 4chan.org API on the other hand has a 1 request per second limit.

>batches of 20 (don't know what the cooldown is for that though
Basically think of it as some kind of overheat meter, that refuses requests if it goes above 20, but cools at a rate of 2 per second :p

>I can do basic board stats in about 40 seconds
are you sending one request per board to 4stats, when assembling the data?
The /allBoardStats endpoint should give you everything you need in a single one.
https://api.4stats.io/allBoardStats
>but because of the API call limit (0.5 seconds) it takes a LOT longer to do thread analysis on individual boards.
So are you using the 0.5s delay for both 4stats and 4chan API requests?

Anonymous Wed Jul 4 09:38:34 2018 No.9847894

>>9847758
Yes, I did have those request limits mixed up. I've been calling both with a half second delay between requests - which, since each board is taking about 2.5 minutes to complete means the 4chan API is only delivering on the requests every 1 second regardless of my delay. Do you have a link to the docs for the 4chan API?

Calling the 4stats allBoardStats endpoint gets you a ton of data and you can analyze it all and produce plots and stuff in under a minute, easily, (see >>9845054) but as far as I can tell there's only certain types of data collected under that. You can get stuff out of it like posting averages, image reply rates, etc, but getting more detailed info like average post length or users per thread appears to require looking at individual threads, which is why the process takes so goddamn long.

I'm sure there's a better way, but I haven't found one yet (at least not one that's within my programming ability).

>>	Anonymous Wed Jul 4 09:59:26 2018 No.9847915 >>9843849 >The IQ axis >/co/ is even below /hr/ LOL

Anonymous Wed Jul 4 11:22:09 2018 No.9848042
File: 21 KB, 826x201, size.png [View same] [iqdb] [saucenao] [google]

>>9847894
4chan API docs are here https://github.com/4chan/4chan-API

>but as far as I can tell there's only certain types of data collected under that
>getting more detailed info like average post length or users per thread appears to require looking at individual threads
yeah it's only the data, that can be calculated from the catalog alone. Would be great if at least 'unique_ips' was part of the catalog as well.
The site is supposed to provide (semi-)live stats and taking 3 hours to update would be way too long. Currently one full board update cycle is 5 minutes.

Though I store some data long-term to calculate daily peak posts/minute and to use the data for the timeline graph at the bottom of the page.
From that it would be possible to calculate some other interesting things, like volatility of a board or the time of day it's usually most active.
How to load historic data is also detailed in the API docs if you are interested.

>I'm sure there's a better way
I mean there isn't really any alternative to just going through all boards and fetching every thread.
Best thing to do would probably be some kind of centralized server setup, that continuously creates board snapshots and makes the raw and extracted metadata available for download.
This way you, me or someone else wouldn't need to waste hours loading the same data again.

Size of the raw snapshot for me is 352MB, extracted metadata (without full-text comments) is only 9.1MB and the result of the analysis of the metadata ends up being only 100KB
So having a way to just directly load 1.2MB of zipped metadata would be a huge time and space improvement.
And even if someone wanted to do analysis of all post comments (lets say you wanted to track the spread of the word "boomer" across different boards), that would only be another 25MB zipped download

Anonymous Wed Jul 4 11:55:01 2018 No.9848113

>>9848042
I'm running my first attempt at a full sweep and already I've run into few issues with dealing with higher traffic boards - threads are disappearing faster than they can be requested (even reversing the thread order to try and snag the threads at the bottom before they disappeared isn't foolproof), so when all is said and done and the sweep finishes in another hour or so, this will probably be a lot of incomplete data.

Like I said - long term I think I'm going to focus on *just* the boardstats data. It doesn't have statistics on users, replies, etc but it still has potentially useful statistics and it doesn't take three hours to collect data from it.

>>	Anonymous Wed Jul 4 12:12:01 2018 No.9848154 >>9848113 same thing for me. Starting from the last thread of the last page, I only got 135/150 /b/ and 192/202 /pol/ threads for example. Most of the missing ones is due to manual mod deletion though is my guess.

Anonymous Wed Jul 4 12:25:48 2018 No.9848176

>>9848154
That's true, I hadn't even thought about thread deletions, that could be a major issue especially on threads with a lot of active mods and janitors. The problem really comes down to the call limit for the API - 1 second between threads means you're looking at 2 1/2 minutes (minimum) per board... most boards tend to update every 30 or 60 seconds so that's 2-5 board updates, plus any deletions taking place... you could be losing dozens of threads on high traffic boards.

My biggest concerns with this at the moment are (a) the possibility that this will skew results significantly and (b) mathematica will automatically disable a command if it receives null or unexecutable inputs too many times (a feature designed to stop the program from getting stuck in broken loops and stuff like that), so there's a concern that if too many threads are removed it may just stop collecting data for the board all together, which means a big waste of time for no results. I've already gotten several error messages about it trying to do statistical calculations on null data sets which means it's not collecting data on certain threads or possibly even whole boards.

Anonymous Wed Jul 4 13:15:18 2018 No.9848253

>>9848176
Speak of the devil - sweep just finished: Full run time was about 2:40 so a little shorter than expected, but that may be because some boards terminated prematurely: I wasn't able to get details for about a dozen boards including /int/, /trv/, /n/, /x/, /ic/, etc so yeah, I don't think doing full site sweeps like this is a very viable option.

That said - a few highlights from the results of the sweep:

1) There seems to be no real correlation between board activity (Avg PPD) and the number of users per thread, nearly all of the boards (with only a handful of exceptions) tend to average about 20-60 posters per thread, the standard deviation for posters per thread is even narrow, with almost every board having a deviation of about +/- 30-40 posters about their respective means.

2) Average replies per thread seem to follow a sort of bell curve on a log scale. Low traffic boards don't get many replies and high traffic boards probably have short attention spans and threads are quickly abandoned once they get to more than a couple dozen posts. The sweet spot seems to be boards with a few thousand posts per day, many of which get up into the 100s for thread length. These threads also have the highest variance in thread length. The exception is /vg/, which is not only extremely high traffic, but also has nearly all of its threads reaching capacity.

3) It should come as no surprise, but the porn boards have the highest image-per-user rate of the site. /c/ posts an average of 4.1 images per user, and most of the other porn boards range from 2-3 IPU. /lit/, /sci/, /fit/, and /adv/ have some of the lowest IPU counts of the site.

4) /qst/ has the highest average post length per user count of the site by a WIDE margin, coming in at about 1050 characters per user. The two runners up are /mlp/ and /tg/ at about 660 and 340 CPU respectively. Not surprisingly, the porn boards are also at the bottom for CPU counts, with many of them in the sub-100 characters per user

Anonymous Wed Jul 4 13:17:46 2018 No.9848257

>>9848176
can't you just check if the response is valid, before passing it on to be analyzed?
It's also worth imo to add some retry functionality. Very rarely the API doesn't respond in time from my experience, but then immediately sends the data just fine on a second attempt.
>(a) the possibility that this will skew results significantly
Threads that get deleted probably didn't have time to get very big in the first place

Anonymous Wed Jul 4 13:31:19 2018 No.9848279

>>9848257
It's doable, I'm just not sure it's worth the time or effort - as >>9848154 pointed out some of these boards will lose dozens of threads in the time it takes to sweep through them, and if you're taking time to do a second call every time one fails to register that time adds up fast - we're looking at 2 1/2 to 3 hours to sweep the site, adding double-checks could easily push that into the 3 1/2 hour territory or higher. Being able to look at detailed statistics on users, images, replies, and post lengths for threads on an individual board is great... but does it justify the ~15000% increase in runtime?

>Threads that get deleted probably didn't have time to get very big in the first place
Or they're big threads that got autobumped before there was time to analyze them. There's no way to tell.

>>	Anonymous Wed Jul 4 16:54:20 2018 No.9848661 >>9837467 You dumb nigger everyone knows that rchan is /pol/ + honby board. With the mentally ill and pedos congregating around /b/. This is why i hate /sci/, its all glorified priests in white lab coats

Anonymous Wed Jul 4 18:01:04 2018 No.9848772
File: 42 KB, 1483x833, nigger_jew_occurence.png [View same] [iqdb] [saucenao] [google]

I have successfully determined, that boards that talk about "jew"s also mention "nigger"s more frequently.
Interesting to note, that as the extremes, /b/,/an/,/gif/ and /k/ tend to talk more about "nigger"s, while /biz/ and /lit/ seem to focus more on "jew"s.

>>	Anonymous Wed Jul 4 19:11:17 2018 No.9848873 >>9848772 log scales make more sense for this sort of stuff imo

>>	Anonymous Wed Jul 4 19:49:17 2018 No.9848932 >>9848661 t. '16er

>>	Anonymous Thu Jul 5 02:42:45 2018 No.9849476 bump

>>	Anonymous Thu Jul 5 03:57:08 2018 No.9849572 File: 181 KB, 339x359, comfy.png [View same] [iqdb] [saucenao] [google] >>9837560 >That's fucking hysterical and I have a new reaction image now. Thank you anon. Me too

>>	Anonymous Thu Jul 5 06:30:01 2018 No.9849714 File: 1.25 MB, 1390x1083, west_europe_board_distribution.png [View same] [iqdb] [saucenao] [google] >Boards line up with the UK and France >Nearly all of the porn is in the UK >/b/, /pol/, and /v/ are on the German border What does it mean?

>>	Anonymous Thu Jul 5 06:48:31 2018 No.9849730 >>9848772 this is fucking hilarious

Anonymous Thu Jul 5 06:59:17 2018 No.9849740

>>9848772
I'd just graph N:J ratio against total frequency. I'd also like to see the most unique word for each board, not that I'm sure how you'd figure that out. Something like [math]max(\frac{count_{word-board}^2}{count_{word-total}*count_{all-words}})[/math] should work.

>>	Anonymous Thu Jul 5 08:25:08 2018 No.9849832 >>9838219 also it is accessible in https://4stats.moe/

>>	Anonymous Thu Jul 5 08:29:38 2018 No.9849836 >>9844047 All images may be tracked if slightly altered through their perceptual hash which is how reverse-image search engines work. See https://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html for an example.

>>	Anonymous Thu Jul 5 09:35:05 2018 No.9849882 >>9849740 That may prove a bit more challenging than you think - counting words isn't the same as counting numbers - you have to take into consideration case sensitivity, spelling errors, etc.

>>	Anonymous Thu Jul 5 10:31:34 2018 No.9849947 >>9848772 like >>9848873 said - recast this on a log/log plot

>>	Anonymous Thu Jul 5 15:23:09 2018 No.9850425 >>9848873 >>9849947 Except from that there is one point that appears to be on (0, 0). I wouldn't log that.

>>	Anonymous Thu Jul 5 15:42:02 2018 No.9850449 >>9850425 I refuse to believe there's a single board on 4chan without at least one mention of either word... if it is it's a momentary fluke.

Anonymous Thu Jul 5 16:16:07 2018 No.9850490
File: 69 KB, 1483x832, jn.png [View same] [iqdb] [saucenao] [google]
[ERROR]

>>9849947
>>9848873
here is some data with log scale, though it's not the exact same dataset. This one now is from ~30 minutes ago

The previous one showed how much of a users post consisted of the word on average.
This one charts the ratio of posts, that include either word at least once.
(with possible false positives when someone talking about "jewels")

>>	Anonymous Thu Jul 5 16:27:32 2018 No.9850502 >>9850490 >with possible false positives when someone talking about "jewels" haha, that would explain /cgl/ and maybe /tg/, /his/ and /lit/

Anonymous Thu Jul 5 17:07:06 2018 No.9850550
File: 181 KB, 1307x846, boomerPosts.png [View same] [iqdb] [saucenao] [google]
[ERROR]

/biz/ and /lit/ surprisingly taking the lead in percentage of posts with mentions of "boomer"

also of all the boards only /f/ and /r/ had no mention of "jew"

/c/ was the only board with zero occurrence of the word "nigger"

>>	Anonymous Thu Jul 5 17:38:47 2018 No.9850597 >>9850550 you mean in threads right now or does it go through the archives too? there's no way no one on /c/ ever told someone their waifu takes bbc

>>	Anonymous Thu Jul 5 17:47:25 2018 No.9850605 >>9850597 yeah, just the current snapshot of threads that are in the catalog right now

>>	Anonymous Thu Jul 5 17:48:54 2018 No.9850609 >>9850597 >>9850605 it we would have a more comprehensive data set if samples were taken over multiple days

Anonymous Thu Jul 5 19:08:28 2018 No.9850727
File: 90 KB, 686x732, snapper3.png [View same] [iqdb] [saucenao] [google]
[ERROR]

>>9850609
yeah that would be ideal. Some kind of history where you could take the average over the last day or compare day/night, weekday/weekend stats.

I set up a server to continuously generate snapshots, though right now it's only storing the latest one and doesn't keep any older data.
A full sweep of all boards also doesn't take too long (80-90 minutes), since a lot of threads of the slower boards don't change between catalog-checks and therefore don't need to be requested again.

I added some work-in-progress API endpoints to 4stats.io
https://api.4stats.io/snapshotMetaAnalysis
https://api.4stats.io/snapshotTextAnalysis
that data will automatically be updated every time the server finishes processing another board

Probably going to try to add it to the site, so people can play around with it and compare different stats themselves and see if they find something interesting.

Anonymous Thu Jul 5 19:51:19 2018 No.9850785

>>9850727
Including this kind of meta data in the API endpoint is EXTREMELY helpful! Even having just basic statistics on average reply lengths, OP lengths, poster counts, etc opens up so many possibilities to compare, and having it done through the site itself means that work that would have taken hours or days now takes minutes.

Bless you based 4stat anon.

>>	Anonymous Thu Jul 5 20:39:13 2018 No.9850958 I think it would be interesting to look at meme'd etiquette (if you can call it that). For instance, the prevalence of "reddit spacing"

>>	Anonymous Thu Jul 5 22:33:58 2018 No.9851111 >>9850958 Or a search for copypastas.

>>	Anonymous Thu Jul 5 22:36:23 2018 No.9851115 >>9851111 or trigger words? for instance, if sneed is mentioned in a /tv/ thread, that should increase the chance of the thread being deleted.

>>	Anonymous Thu Jul 5 23:03:34 2018 No.9851144 >>9850958 eh, most people complaining about "reddit spacing," are just complaining about how pre-2010 forum posts were sometimes made

>>	Anonymous Fri Jul 6 01:31:55 2018 No.9851358 >>9851115 Or just what words get the most replies on each board. Make an OP-generating machine learning algorithm to optimise the most replies per thread.

>>	Anonymous Fri Jul 6 01:35:27 2018 No.9851363 File: 5 KB, 251x240, 1326504610415.jpg [View same] [iqdb] [saucenao] [google] >>9851358 >Make an OP-generating machine learning algorithm to optimise the most replies per thread.

>>	Anonymous Fri Jul 6 01:41:27 2018 No.9851370 >>9851358 a smart algorithm would post a baity but shit OP and then five posts later post one of those roll for a waifu derail images. instant 300 replies

>>	Anonymous Fri Jul 6 03:56:14 2018 No.9851580 >>9848772 >/gif/ being this high cracked me up

>>	Anonymous Fri Jul 6 06:45:44 2018 No.9851763 >>9850449 It would be a typical 4chan thing to keep the language clean just to blow up some statistics.

>>	Anonymous Fri Jul 6 09:58:07 2018 No.9852002 >>9851763 >inb4 /pol/ goes completely clean for a day just to fuck with us

>>	Anonymous Fri Jul 6 12:40:21 2018 No.9852253 >>9852002 They would blow their communal gasket trying but they would do it.

>>	Anonymous Fri Jul 6 14:19:23 2018 No.9852407 Bump this thread is great

>>	Anonymous Sat Jul 7 00:22:10 2018 No.9853100 >>9851580 surprised /lit/'s as far as it is

>>	Anonymous Sat Jul 7 00:32:15 2018 No.9853108 File: 404 KB, 1272x1800, the tumult of niggers.jpg [View same] [iqdb] [saucenao] [google] >>9853100 its probably due in part to this

Anonymous Sat Jul 7 08:22:04 2018 No.9853582
File: 191 KB, 1243x507, comments.png [View same] [iqdb] [saucenao] [google]

>>9850727
working on getting the history functioning, but it's a bit more complicated than expected, as those things usually are.
Storing the meta data is no problem (~8MB per snapshot), but I was thinking of whether it would be worth it to also keep thread comments for a while, so it's always possible to go back and check for certain new words or properties, without first waiting a few days for new data to be gathered again.
Storing comments is going to take a lot more space though. (~90MB per snapshot)
Of course it would be possible to save space by de-duplicating the comment data, since many comments exist in multiple successive snapshots, so you just store a comment once and then keep a reference to it in each snapshot.
Though I really like the concept of having full snapshots with all comments in human readable form, where you can just copy one file and immediately have everything, instead of querying a database to generate the comment snapshot from the stored post-number references.
Assuming a full snapshot of meta+text data is 100MB, with ~1 snapshot per hour, that would end up being
100MB * 24 * 7 = 16.8GB for one week, which is quite a bit and could easily reach the limit of the disk, as everything is just running on a small 25GB SSD virtual server.
>>9853100
/lit/ also had a /boomercore/ thread at the time I checked the comments for >>9850550
which alone made up a good chunk of the mentions.
The data is only from a single point in time and varies quite a bit between checks. Wouldn't necessarily give too much about it, until it's an actual average over at least a few days.

>>	Anonymous Sat Jul 7 09:30:35 2018 No.9853675 File: 1.00 MB, 2000x2000, 1530732809931.png [View same] [iqdb] [saucenao] [google] >>9837467

>>	Anonymous Sat Jul 7 09:50:42 2018 No.9853698 >>9853675 [citation needed]

>>	Anonymous Sat Jul 7 10:06:54 2018 No.9853721 >>9853675 That graph is impossible

Anonymous Sat Jul 7 10:48:00 2018 No.9853763

>>9853582
Trying to save full snapshots for text analysis is admirable, but as with any potential rabbit hole you might go down I find it's important to ask yourself three basic questions:

1) How much extra work does this involve?
2) What can you get out of this that you can't get out of other, simpler methods?
3) Do the results justify that extra work?

I might be wrong, but it sounds like this is a LOT of extra work, storage, etc for something with only a few applications, many of which can be accomplished with less effort and space just by summarizing the data into posting statistics (post length, use of keywords, etc) as opposed to saving a complete snapshot of every post on the site. Now that might be oversimplifying or misrepresenting things, and if it is - then I think it's important to lay out what else can be done with this data that can't be done another way and is it worth it?

Anonymous Sat Jul 7 11:20:44 2018 No.9853803
File: 1.71 MB, 374x219, 1489305990504.gif [View same] [iqdb] [saucenao] [google]

Not him, nevertheless:
>>9853763
>1) How much extra work does this involve?
Probably a bit not not too overwhelming

>2) What can you get out of this that you can't get out of other, simpler methods?
This is /sci/: hard facts and measurements with error bars trump opinions any day.

>3) Do the results justify that extra work?
Yes.

Stats-anon is doing a good job and I salute his efforts. Data science is a major buzzword these days and his work is a lot better than a lot of other gunk I see passes as "science." It is also an effort that involves /sci/ and provides an excellent opportunity to think, not just about the results but also the methods and the implications, the conclusions and also the pit falls, about who we are in here and how there is a huge variation.

Anonymous Sat Jul 7 12:59:28 2018 No.9853939

>>9853803
Don't get me wrong, I'm not saying this whole thing isn't a worthwhile endeavor - I'm talking specifically about the problems posed by archiving and analyzing full text snapshots of the site - as multiple people working on this have pointed out it takes GB of storage and hours of processing to do real, in-depth analysis of stuff.

As with any experimental method, or theoretical or computational analysis you get to a crossroads where you have to ask yourself "am I trying to do too much work for too little payoff. You don't do laser induced fluorescence to analyze a plasma if a langmuir probe will do well enough, and you don't fill up a 25 GB SSD server with endless text data if just storing summarized data points is enough.

>>	Anonymous Sat Jul 7 17:16:40 2018 No.9854489 bump

Anonymous Sat Jul 7 18:56:46 2018 No.9854610

>>9853939
>Don't get me wrong, I'm not saying this whole thing isn't a worthwhile endeavor - I'm talking specifically about the problems posed by archiving and analyzing full text snapshots of the site - as multiple people working on this have pointed out it takes GB of storage and hours of processing to do real, in-depth analysis of stuff.
GB storage was hard 20 years ago but now disks are in the TB range. And hours of processing, well that is when prototyping. I cannot see this is a problem.

>As with any experimental method, or theoretical or computational analysis you get to a crossroads where you have to ask yourself "am I trying to do too much work for too little payoff.
Sometimes plain exploration is the goal in itself. As I believe it is here.

>You don't do laser induced fluorescence to analyze a plasma if a langmuir probe will do well enough, and you don't fill up a 25 GB SSD server with endless text data if just storing summarized data points is enough.
I disagree with the comparison, the tech you describe is mature. In the early days it would have made sense if only to check that the two methods gave comparable results. In fact the use of multiple methods is a sign of healthy scepticism.

And all this gives us a good discussion and more insight,

Anonymous Sat Jul 7 23:13:39 2018 No.9854943

>>9853582 again
turns out, that processing a week of comments isn't actually that trivial.
With some boards like /v/ or /pol/ there may be ~850.000 comments to go through.
Only tested it briefly, but the most time consuming part seems to be the fetching the items from the disk/database.
Still looks to be doable though, even if it takes a moment.

There is also a tricky issue on how to best go about analyzing a boards comments.
Just using comments that were posted during a certain time window may not work for slower boards, as it wouldn't be enough to get a good picture. (lots of boards with less than 1k posts a day)
But simply taking what's currently visible also doesn't seem ideal, as fast boards are changing way too quickly to yield any results with consistency.
A good compromise may be to take what's currently visible on a board + comments from the last week

Anonymous Sun Jul 8 00:38:06 2018 No.9855097

Working on processing about a half a dozen different figures right now based on the new meta data JSON >>9850727

The plots range from image replies vs posting activity, to replies per thread vs OP length, image responses vs text responses, etc. I'll try to get the framing and labeling and shit done in the next half hour. If anyone has any requests or suggestions post them and I'll get to them if I have time.

>>	Anonymous Sun Jul 8 01:12:46 2018 No.9855137 File: 30 KB, 640x1033, 4chanhr-imagesvsppd.png [View same] [iqdb] [saucenao] [google] Dumping. Some of these show some interesting trends, others are pretty obvious, but still nice to confirm. Starting with an updated version of the OP figure.

>>	Anonymous Sun Jul 8 01:13:49 2018 No.9855138 File: 28 KB, 960x733, 4chanhr-rwivsrwt.png [View same] [iqdb] [saucenao] [google]

>>	Anonymous Sun Jul 8 01:14:49 2018 No.9855140 File: 28 KB, 640x992, 4chanhr-rptvsppt.png [View same] [iqdb] [saucenao] [google]

>>	Anonymous Sun Jul 8 01:15:51 2018 No.9855145 File: 28 KB, 640x995, 4chanhr-rptvsopl.png [View same] [iqdb] [saucenao] [google]

>>	Anonymous Sun Jul 8 01:16:53 2018 No.9855147 File: 31 KB, 640x1023, 4chanhr-aplvsppd.png [View same] [iqdb] [saucenao] [google]

>>	Anonymous Sun Jul 8 01:17:53 2018 No.9855148 File: 27 KB, 640x985, 4chanhr-aplvsppp.png [View same] [iqdb] [saucenao] [google]

>>	Anonymous Sun Jul 8 01:28:12 2018 No.9855173 >>9851358 >in the near future: >12 SQT threads

>>	Anonymous Sun Jul 8 02:38:32 2018 No.9855275 >>9855137 >>9855138 >>9855140 >>9855145 >>9855147 >>9855148 Man, the porn/lewd boards stick out like a sore thumb in basically every plot.

>>	Anonymous Sun Jul 8 06:26:22 2018 No.9855632 >>9855137 Excellent stuff, anon.

>>	Anonymous Sun Jul 8 08:44:29 2018 No.9855825 Coolest thread in all 4chan

>>	Anonymous Sun Jul 8 08:45:27 2018 No.9855827 This thread is golden, any self respecting mod would encourage it by making it sticky, and all of subsequent threads like this. It's a shame that this board has been largely unmoderated since 2014

>>	Anonymous Sun Jul 8 11:16:20 2018 No.9856031 >>9855137 >>9855148 These two are particularly interesting in that there seem to actually be large scale structures - isolated groups, large bands with branches extending out

Anonymous Sun Jul 8 13:24:36 2018 No.9856225

made some progress, but the code is a total work-in-progress clusterfuck until I find time to clean it up.
I hope that everything is accurate, but can't guarantee it 100%. Will just go back tomorrow and see if everything still looks alright, especially the day-averages.

https://api.4stats.io/textAnalysisLastDay
https://api.4stats.io/metaAnalysisLastSnapshot
https://api.4stats.io/metaAnalysisLastDay

"textAnalysisLastDay" is the result of all currently visible posts + any other posts from the last 24 hours, that the server saw (even if they are no longer on the board)
"metaAnalysisLastSnapshot" is just the last snapshot result
"metaAnalysisLastDay" is the average of the individual snapshot results from the last day

>>	Anonymous Sun Jul 8 13:53:10 2018 No.9856255 File: 150 KB, 500x478, a5bcb1b5ceadb86013877ad7956179cf40b9a49809aa014236999595742eb1ac.jpg [View same] [iqdb] [saucenao] [google] OP was not a faggot indeed

>>	Anonymous Sun Jul 8 15:25:23 2018 No.9856377 >>9853675 >/mu/ >114

Anonymous Sun Jul 8 15:40:47 2018 No.9856395
File: 99 KB, 424x420, 1530309839938.png [View same] [iqdb] [saucenao] [google]

now also as .csv files instead of json

https://api.4stats.io/csv/textAnalysisLastDay
https://api.4stats.io/csv/metaAnalysisLastSnapshot
https://api.4stats.io/csv/metaAnalysisLastDay

Anonymous Sun Jul 8 16:15:13 2018 No.9856436
File: 45 KB, 1290x747, reddit_boomer.png [View same] [iqdb] [saucenao] [google]

no reason this has to be a scatter plot, but anyway

/tv/ is the board far ahead of anyone else, when it comes to the ratio of posts mentioning "reddit" with ~1.1% of posts
/biz/ has almost 1.7% of all current posts mentioning "boomer", with /fit/ following close behind

>>	Anonymous Sun Jul 8 16:39:34 2018 No.9856469 >>9856436 What's the point size based on?

>>	Anonymous Sun Jul 8 16:41:20 2018 No.9856473 >>9856469 average posts per day for the board

>>	Anonymous Sun Jul 8 16:45:03 2018 No.9856479 >>9856473 Ooh, nice, that's an interesting way of incorporating a third axis into a 2D plot, I might try that on some of mine.

>>	Anonymous Sun Jul 8 17:05:49 2018 No.9856505 >>9856436 You might want to add in synonyms like "pleddit."

>>	Anonymous Sun Jul 8 17:57:46 2018 No.9856574 >>9837528 >un funded >/pol/ your tax-dollars ladies and gentlemen.

Anonymous Sun Jul 8 18:31:43 2018 No.9856629

>>9856574
Missing from the "analysis" is the question how much is hate and how much is basement edginess. After all, for all the words slung around we have a lot of posts from Israel in the /watch/ generals over in /g/, showing berets and Hebrew keyboards, people who don't seem to be scared away.

>>	Anonymous Sun Jul 8 19:30:28 2018 No.9856718 >>9856629 yeah, "what a nigger" has surpassed hate and is now in the realm of "what a jerk", at least on 4chan & other internet communities

Anonymous Sun Jul 8 19:43:53 2018 No.9856737

>>9856718
>"what a nigger" has surpassed hate and is now in the realm of "what a jerk", at least on 4chan
its been like that since the the early days, we've even had "nigger" and "faggot" word filtered at times due to their constant near universal use as a generic insult or pejorative

>>	Anonymous Sun Jul 8 21:03:58 2018 No.9856813 >>9856505 I guess I could just try "eddit" instead. That should catch everything.

>>	Anonymous Mon Jul 9 01:35:36 2018 No.9857125 >>9856813 >plebbit

>>	Anonymous Mon Jul 9 03:30:50 2018 No.9857255 >>9856436 I think a logarithmic plot might be better suited for this

>>	Anonymous Mon Jul 9 03:55:28 2018 No.9857284 anyone have ideas in terms of how they want all these charts presented? I've got a pirated copies of Photoshop and illustrator that i'm dying to use

>>	Anonymous Mon Jul 9 04:18:42 2018 No.9857309 >>9857284 1) go to your library and check out everything by Ed Tufte 2) read 3) do what he says to do in the books

>>	Anonymous Mon Jul 9 14:19:44 2018 No.9858341 >>9857255 seconding - anytime you're looking at a difference of an order or two of magnitude along an axis you should consider a log scale

Anonymous Mon Jul 9 19:03:07 2018 No.9858894
File: 79 KB, 1315x828, chart.png [View same] [iqdb] [saucenao] [google]

Spent some time working on a page to interact with the stats directly on the site.
It automatically loads the latest available data from the snapshot API.

All work in progress and still needs visual polish, but works quite alright already.
Play around with it, if you want.

https://4stats.io/snapshotAnalysisWorkInProgress

>>	Anonymous Mon Jul 9 20:25:12 2018 No.9859022 >>9858894 now this is podracing

>>	Anonymous Mon Jul 9 22:05:19 2018 No.9859156 >>9858894 God damn dude, that's fucking impressive.

>>	Anonymous Mon Jul 9 22:54:17 2018 No.9859242 >>9855137 >>9855138 >>9855140 >>9855145 >>9855147 >>9855148 >>9856436 >>9858894 This is fantastic work guys!

>>	Anonymous Tue Jul 10 09:43:22 2018 No.9860022 >>9857284 Almost any plotting tool should be fine - I think it's more a matter of working out what parameters give us the best "description" of 4chan board activity.

>>	Anonymous Tue Jul 10 10:16:23 2018 No.9860059 >>9858894 this is beyond great work dude

>>	Anonymous Tue Jul 10 13:12:48 2018 No.9860281 File: 55 KB, 768x768, eeyore.png [View same] [iqdb] [saucenao] [google] If you compare overall posting activity to posts per users it look like Eeyore. Is this our equivalent of Ellis' quantum penguin diagrams?

>>	Anonymous Tue Jul 10 14:22:47 2018 No.9860376 File: 52 KB, 789x777, s4s.png [View same] [iqdb] [saucenao] [google] >>9851763 >>9852002 or [s4s] doing the opposite >>>/s4s/6892903 it's not really affecting the % of post-mentions, but it's visible in the text ratio

>>	Anonymous Tue Jul 10 15:36:36 2018 No.9860512 >>9860376 That's pretty funny actually.

>>	Anonymous Tue Jul 10 15:46:05 2018 No.9860525 File: 46 KB, 1600x1200, 4chan stats board mention.png [View same] [iqdb] [saucenao] [google] >>9837467

>>	Anonymous Tue Jul 10 15:52:50 2018 No.9860532 >>9860525 Cool. Interesting thought though, you could try organizing the boards by some other statistic (ex. posts per day or average post length) instead of alphabetically and see if there are any contours.

>>	Anonymous Tue Jul 10 16:14:54 2018 No.9860575 >>9860525 Hey OP, do text and post ratios for mentions of each board. There might be interesting correlations between mentions of /pol/ and mentions of /b/ for instance

>>	Anonymous Tue Jul 10 16:27:06 2018 No.9860596 >>9860525 what is up with the linking of /m/ on /soc/?

>>	Anonymous Tue Jul 10 16:32:30 2018 No.9860608 File: 42 KB, 562x437, 1526056722812.jpg [View same] [iqdb] [saucenao] [google] >>9860525 >/r/ >>>/r/eddit lmao

Anonymous Tue Jul 10 17:09:17 2018 No.9860679
File: 332 KB, 1426x757, scilab.png [View same] [iqdb] [saucenao] [google]

>>9860532
This was made by another /sci/ anon a few years ago. It should be trivial to recreate and improve on it though. The difficulty seems to be in collecting data.

>>9860575
Interestingly enough, as I recall the anon originally posted this for that purpose. Specifically it was to point out how all the boards consistently tell /pol/esmokers to go back to their containment board. It was a /sci/ meta thread talking about all the /pol/ spammers and Moot participated as well (and for the next several days he had /pol/ playing some cuckold training videos and shit nonstop).

>>9860596
It's actually caused by all the lonely males desperately responding to 'a/s/l' requests.

>>9860608
lol, board mentions are a pretty good containment board detection heuristic.

>>9856225
It's interesting to see people putting effort into this sort of thing. Too bad it looks like there are /pol/esmoker elements behind the wheel given the keywords chosen.

>>9848042
>>9847894
Have you two considered using something like AWS for the data collection (to deal with the requests per IP limit)? Alternatively, 4chan actually goes out of its way for special cases (eg. board archival sites), another alternative would be asking the 4chan admins if they would be willing to provide a better way to obtain the data so that we don't have to hammer on their API.

Perhaps at some point I'll join in depending how this progresses. I still remember the early days when /sci/ was young and we wanted to have some interesting projects of our own.

>>	Anonymous Tue Jul 10 17:50:22 2018 No.9860774 >>9860679 I'm starting a new job next week, but once I've settled into the new rhythm I might try and implement this, it could be an interesting little side project

Anonymous Tue Jul 10 18:10:01 2018 No.9860810

>>9837467
It should include (maybe with colours) the level of shitness. For example:
Black: No originality and toxic as fuck (/b/)
White: Lots of fun/original/interesting content. (Perhaps one of the topic-specific boards? I don't know about all the boards so i would not be able to say)

Anonymous Tue Jul 10 18:19:45 2018 No.9860826

>>9860679
>Too bad it looks like there are /pol/esmoker elements behind the wheel given the keywords chosen.
I just picked some quick words that the average person might think are common on 4chan as a whole, but as the stats show, some of them are really just found on /pol/ and not much outside.
There was no intention to create some kind of edginess ranking, but I will make a new list of words tomorrow I think.

>something like AWS for the data collection (to deal with the requests per IP limit)?
hm, maybe
Currently I use 3 small VPS for everything.
1 for the live stats (loading catalogs during fixed 5 minute cycles), 1 for making the snapshots (60-80 minutes for all boards+threads at 1 request/second) and 1 for the API server, so that clients don't directly connect to the stats gatherers.
> Alternatively, 4chan actually goes out of its way for special cases
Is there any info on this somewhere?
It would of course be neat to not be strictly limited by the 1 request/second rule or have some other options in addition.

Anonymous Tue Jul 10 18:24:25 2018 No.9860831

>>9848772
It makes sense /bant/ is more or less in the middle. For the last few months /bant/ users have been in a "defensive war" against /pol/. /pol/ post racist threads as bait very frequently, which usually get replied with the ussual "back to /pol/" by banters. At least this has led to some OC created by /bant/ users.

>>	Anonymous Tue Jul 10 18:34:01 2018 No.9860850 >>9860810 Devise a way for us to measure "shitiness" qualitatively and we'll look into it.

>>	Anonymous Tue Jul 10 18:34:25 2018 No.9860853 >>9837467 After reading the whole thread, i believe i can happily say that 4chan actually has some very smart people in it. Congratulations, really. Thanks for giving me back hope in this site and in it potential to do fantastic stuff.

Anonymous Tue Jul 10 18:41:23 2018 No.9860870

>>9860850
I am not the guy that posted that but the only answers i could think of would be to make a poll, asking the users of its board if they think their board is good at creating original content or being fun/interesting. The problem would be deciding how to avoid trolling affecting the results. Perhaps by using bait questions such as "is OP a faggot?" or other memes(troll-bait basically). Those how say yes in the example (or fall for the clear troll-bait) would just be considered trolls trying to affect the results instead of people taking it seriously.

Its the only idea i could think of.

Anonymous Tue Jul 10 18:48:26 2018 No.9860886
File: 3.63 MB, 3218x6418, pol plays capture the flag.jpg [View same] [iqdb] [saucenao] [google]

>>9860853
4chan can come up with some fucking insane demonstrations of intelligence and creativity when it finds the right project - it's just that the "right project" is whatever sounds like it'll be fun, interesting, or challenging not necessarily what'll actually be the most productive or worthwhile.

Sometimes that means /sci/ breaking out graduate level physics to determine whether you can cook a steak via orbital reentry. Sometimes that means /co/ learning cryptography to decipher codes in a cartoon show. Sometimes that means /pol/ using airplane contrails and fucking stars to triangulate the position of a flag they're trying to steal. This is 4chan: we're creatures of whim.

Anonymous Tue Jul 10 19:15:07 2018 No.9860935
File: 121 KB, 1472x858, sentiment_thank.png [View same] [iqdb] [saucenao] [google]

>>9860810
the closest thing for now is a sentiment analysis on the comment text, though far from being an accurate representation.
There are certainly better ways to check for that.

>>	Anonymous Tue Jul 10 19:20:10 2018 No.9860946 >>9860886 >/sci/ breaking out graduate level physics to determine whether you can cook a steak via orbital reentry sauce

>>	Anonymous Tue Jul 10 19:24:08 2018 No.9860954 >>9860946 Can i get an screencap of that? I sounds very interesting. Hopefully when i finish my studies i can help in the next fun thing /sci/ does.

>>	Anonymous Tue Jul 10 19:44:51 2018 No.9860985 >>9856737 I never saw faggot getting filtered but I definitely remember roody poo, and 10 years ago peanut butter and jelly got filtered to peanut butter and niggers, iirc. It's been so long it's hard to remember.

>>	Anonymous Tue Jul 10 19:47:35 2018 No.9860993 >>9860985 it was around the same time iirc that "faggot" was word filtered > It's been so long it's hard to remember i know, i was starting high school around the time these these happened

Anonymous Tue Jul 10 19:54:44 2018 No.9861009
File: 548 KB, 1276x1048, 1524426258700.png [View same] [iqdb] [saucenao] [google]

>>9860993
Well, I found a list, very much longer than I knew.

https://www.lurkmore.com/view/4chan_Wordfilters

We do have a current one sitewide, from the butthurt /g/ mod getting booty blasted over this image (and other things) in the /mkg/.
Since you're an oldfag too, here's a filterproof so᠌yboy that you can copy to a text file, works with so᠌yboard as well.
*snicker*

>>	Anonymous Tue Jul 10 19:57:31 2018 No.9861014 >>9861009 Only cancerous containment board users use that term.

Anonymous Tue Jul 10 19:58:08 2018 No.9861016

>>9860946
>>9860954
https://yuki.la/sci/5209640

Almost 200 posts over a 24 hour period. Graduate level might be a bit of an exaggeration, but you had anons legitimately trying to estimate frictional heating on reentry, the effects of rotation, whether to jettison it frozen or raw, etc. IIRC the guy who does XKCD picked up on it and did a whole "What If?" entry working out the problem.

>>	Anonymous Tue Jul 10 19:58:51 2018 No.9861017 File: 928 KB, 500x244, 1367488036816.gif [View same] [iqdb] [saucenao] [google] >>9861014

>>	Anonymous Tue Jul 10 20:00:11 2018 No.9861019 File: 80 KB, 499x434, wordfilter assmad.png [View same] [iqdb] [saucenao] [google] >>9861009 >lurkmore was still up wew lad I actually like that word filter though, it produced a lot of butthurt on boards with srsfags

>>	Anonymous Tue Jul 10 20:00:42 2018 No.9861020 File: 94 KB, 1342x279, electionfags and wordfilters.png [View same] [iqdb] [saucenao] [google] >>9861009

>>	Anonymous Tue Jul 10 20:03:03 2018 No.9861025 File: 20 KB, 616x106, Capture.jpg [View same] [iqdb] [saucenao] [google] >>9861019

>>	Anonymous Tue Jul 10 20:03:42 2018 No.9861028 >>9861025 lol

>>	Anonymous Tue Jul 10 20:05:00 2018 No.9861030 File: 29 KB, 316x368, don't lie.jpg [View same] [iqdb] [saucenao] [google] >>9861017

>>	Anonymous Tue Jul 10 20:09:44 2018 No.9861034 >>9860886 Right now, it's /biz/ hyping up a severely undervalued cryptocurrency Screencap this

>>	Anonymous Wed Jul 11 03:41:14 2018 No.9861554 What exactly do sentimentScore and sentimentComparative represent?

>>	Anonymous Wed Jul 11 05:47:56 2018 No.9861633 >>9853675 https://cdn2.desu-usergeneratedcontent.xyz/qa/image/1521/38/1521388673182.png

>>	Anonymous Wed Jul 11 05:48:48 2018 No.9861636 >>9851363 >>9851358 Have you guys never been on /v/? Any thread not made by a filter-evading-bot dies with fewer than ten replies. Locate a blemish, now that the dust has settled, whats her name? >enemies can open doors

>>	Anonymous Wed Jul 11 06:09:32 2018 No.9861657 File: 35 KB, 786x788, images vs text.png [View same] [iqdb] [saucenao] [google] Best graph that really shows the negative correlation of posts with images and posts with texts, and showing the decline from discussion to image dump

Anonymous Wed Jul 11 07:28:51 2018 No.9861717
File: 26 KB, 770x293, sentiment.png [View same] [iqdb] [saucenao] [google]

>>9861554
it's just something I stumbled upon while building the analyzer and used it, because I didn't really have much other criteria to check yet.
The sentiment analyzer categorizes the individual words into positive/neutral/negative and also scores them from -5 to +5.
'sentimentScore' is the total score of all words divided by the # of comments analyzed on that board
'sentimentComparative' is the average of how much of a comment consists of positive/negative words (scoring whole posts from -5 to +5 this time).
https://github.com/thisandagain/sentiment#how-it-works

It was just an experiment, but I see now that it's much better suited for something like news articles or blog posts instead.
With 4chan posts, it easily misreads things, as seen in the 3rd comment in the pic, where it interprets 'no problem' as 2 negative words.
>>9861633 might be a good thing to check for actually. Is Flesch-Kincaid something that's generally accurate for all kinds of text, even 1 sentence comments?

>>	Anonymous Wed Jul 11 07:45:09 2018 No.9861735 >>9861717 >it interprets 'no problem' as 2 negative words. It seems like you can use a custom corpus of words so it might be possible to get it to interpret phrases like "no problem" as a positive

>>	Anonymous Wed Jul 11 08:13:07 2018 No.9861767 >>9861735 yeah but doing it manually you will never reach the end of covering those special cases. I think it's simply not made for analyzing dialogue-like language.

Anonymous Wed Jul 11 08:36:29 2018 No.9861795

>>9861767
That's true. There does seem to be some research on applying sentiment analysis to dialogue though, which hopefully would yield better results with informal text. I found this paper which I'm skimming now, but seems like it could be applicable:
https://pdfs.semanticscholar.org/0d4f/c603e54dd864dbb0ac1deaa116c67c14f7bd.pdf

Anonymous Wed Jul 11 09:00:28 2018 No.9861815

>>9861657
I dunno if that shows it quite as accurately - almost the entire first 2/3rds of that are porn, lewd, and wallpaper threads and they seem to be throwing almost every single figure off because they're inherently the boards with the most images, least text, and yet a relatively average amount of traffic. We keep getting this big porn clusters in every single figure - we may have to consider excluding some of these boards if we want to get a more accurate picture of the rest of 4chan.

>>	Anonymous Wed Jul 11 09:56:57 2018 No.9861888 >>9860376 my post :)

Anonymous Wed Jul 11 10:08:24 2018 No.9861898

>>9861815
>we may have to consider excluding some of these boards if we want to get a more accurate picture of the rest of 4chan.

If this is needed, i would exclude /b/. It the one with most traffic and, for the most part, everything is porn. It could mess with the rest of the data, and (personal opinion) /b/ is considered by most long-time anons as unsalvable.

If it is needed this is what i would choose to do, but if the situation arrises i will let the experts decide(i am just lurking a bit here since this is way too complicated for me, but very interesting)

>>	Anonymous Wed Jul 11 10:11:46 2018 No.9861904 >>9861898 And yet, ironically /b/ is rarely in the extremes on these graphs - it's always boards like /c/, /cm/, /s/, /aco/, /y/, /e/, etc that are always way in the fringes for image ratios, posts per poster, etc

>>	Anonymous Wed Jul 11 10:18:47 2018 No.9861917 File: 171 KB, 475x614, 1509473574162.png [View same] [iqdb] [saucenao] [google] >>9839415 Yes, frog = impeccable intellect

>>	Anonymous Wed Jul 11 10:23:52 2018 No.9861925 >>9861917 Tell that to /bant/. Or many of the other boards.

Anonymous Wed Jul 11 10:37:40 2018 No.9861941
File: 1012 KB, 1000x1000, 1514405991603.png [View same] [iqdb] [saucenao] [google]

Here's a completely unreadable circular relationship graph I made a while ago, with data collected over a period 2 hours or so. It would be great to have an interactive version of it but I could not find an easy/lazy way to do it.
Here's the raw data: https://pastebin.com/raw/PVRDP59q

>>9840294
>or, if you're proactive, trying and put some data together.
You can find a literal metric shit ton of slightly outdated data here https://archive.org/download/archive-moe-database-201506
I personally don't have enough HDD space left to do anything with it.

>>	Anonymous Wed Jul 11 11:42:41 2018 No.9862026 >>9861735 Similarly "pretty bad" is a doubly negative rather than one positive and one negative.

>>	Anonymous Wed Jul 11 13:18:11 2018 No.9862171 File: 489 KB, 480x262, 1529423607244.gif [View same] [iqdb] [saucenao] [google] >>9838118 >I was accurately able to intuit the most heavily trafficked board without knowledge of the statistics in order to best casually understand the zeitgeist. Neat

>>	Anonymous Wed Jul 11 13:20:26 2018 No.9862179 >>9849836 I was going to mention this. It's actually not difficult to track images, though the many variations would certainly cause a headache.

>>	Anonymous Wed Jul 11 15:29:43 2018 No.9862419 >>9861941 Is this by board mentions or board crosslinks?

>>	Anonymous Wed Jul 11 15:30:34 2018 No.9862420 >>9862419 Crosslinks only

>>	Anonymous Wed Jul 11 15:39:17 2018 No.9862436 >>9862420 kay, yeah I figured that would be way easier to search for

Anonymous Wed Jul 11 21:34:48 2018 No.9863018
File: 172 KB, 633x314, on whom the pale moon gleams.png [View same] [iqdb] [saucenao] [google]

>>9837467
>and determine if it's possible to illustrate some kind of rough empirical structure of 4chan's communities and cultures.

asking for 4chans help in datamining operations, huh?

>>	Anonymous Wed Jul 11 21:48:57 2018 No.9863053 >>9863018 Nope, just a neat project.

>>	Anonymous Wed Jul 11 21:55:23 2018 No.9863062 >>9863053 I don't believe you.

>>	Anonymous Wed Jul 11 21:58:45 2018 No.9863069 >>9863062 That's nice.

>>	Anonymous Wed Jul 11 22:05:01 2018 No.9863076 >>9863069 >That's nice. Thanks, I thought the same thing.

>>	Anonymous Thu Jul 12 09:35:59 2018 No.9863691 >>9863018 Oh get fucked, this is the first interesting project that's been on /sci/ in months.

Anonymous Thu Jul 12 10:05:10 2018 No.9863719
File: 78 KB, 1425x311, regex.png [View same] [iqdb] [saucenao] [google]

Even just counting the characters someone typed in is tricky to get right.
I am removing and replacing
>/p/ EXIF data
>post-number quotes (quotelinks and deadlinks)
>linebreaks (<br> gets replaced with a single character to count pressing enter)
>any remaining HTML-tags, but not their content, like /g/ [code] brackets, that end up as <pre></pre> HTML-tags
>converting HTML-entities to normal characters (">" becomes ">")
>any whitespace from start and end of post

Now I am not a /sci/ regular and noticed the [math] and [eqn] tags just recently.
Do these even work? They don't do anything for me, when trying them in the preview.
And are there any other boards also have similar special content, that can get inserted into a post?

Anonymous Thu Jul 12 11:44:14 2018 No.9863831
File: 243 KB, 3600x1300, sci latex guide.png [View same] [iqdb] [saucenao] [google]

>>9863719
>Now I am not a /sci/ regular and noticed the [math] and [eqn] tags just recently.
>Do these even work? They don't do anything for me, when trying them in the preview.
It's latex format, you can type out equations and junk. There's a guide for it

>>	Anonymous Thu Jul 12 12:15:02 2018 No.9863864 >>9863831 I checked it again. uBlock prevented the required script from loading.

>>	Anonymous Thu Jul 12 12:30:06 2018 No.9863882 >>9863831 test {e}^{i \pi} + 1 = 0 [eqn]{e}^{i \pi} + 1 = 0[/eqn]

Anonymous Thu Jul 12 19:53:17 2018 No.9864544
File: 28 KB, 356x629, threadAge.png [View same] [iqdb] [saucenao] [google]

started checking for oldest thread age (excluding stickies)

Also rewrote the whole text analysis part. I would ultimately like to make it so anyone can pick their own words to check for and not the server having a pre-defined list.
I just don't know how to go about it regarding search performance with full-text search on this kind of scale.
(visible + last day content is ~1.200.000 comments with ~139.000.000 characters, if I did everything right with this)

>>	Anonymous Thu Jul 12 23:42:35 2018 No.9864888 >>9864544 interdasting

>>	Anonymous Fri Jul 13 00:50:43 2018 No.9864971 File: 49 KB, 800x480, 35930078_200159300632581_7458529735379779584_n.jpg [View same] [iqdb] [saucenao] [google] This is a cool thread

>>	Anonymous Fri Jul 13 15:23:45 2018 No.9866267 >>9837467 I hope you will submit the paper and put that UN thing to shame.

>>	Anonymous Fri Jul 13 18:14:26 2018 No.9866564 still around OP? Anything new on your end?

Anonymous Sat Jul 14 05:43:40 2018 No.9867466

>>9853939
>>9853939
/g/ here. I sure hope anon isn't storing the comments as actual text. That would burn a lot of disk space unnecessarily.

tl;dr

If you're analysing words, then give each word a number (32 bits should more than cover 4chan :) rather than use off the shelf compression. That way, you have symbols you can count with no effort at all.

>>	Anonymous Sat Jul 14 08:06:02 2018 No.9867572 >>9860525 could someone arrange this as a directed graph? if only connections with like 150% the average mentioning rate are shown it should be possible to find clusters

Anonymous Sat Jul 14 08:34:07 2018 No.9867608
File: 5 KB, 365x241, 1512330007216.png [View same] [iqdb] [saucenao] [google]

>>9867466
afraid that won't work though.
Let's say you want to check for the occurrence of a certain word like "boomer", then you would miss out on any variations like "boomers", "boomerfolio", "boomerpost" and so on.

Space isn't the problem right now. It's rather finding a way to check ~ a day of it in the most efficient way.
Best solution I have so far is to tokenize comments into words, save the token count in a map and then look through all keys to see if part of it matches the search-word.

Though maybe it's not actually worth the effort or even a good idea in the first place.
4chan is a pretty nice and organic place.
Even while working on it, I get the feeling, that doing in depth analysis kind of removes the soul of it all.
The extent of the live stats at least is only to show board activity, but fully analyzing content and categorizing boards is quite different and maybe not that fun in the end.

>>	Anonymous Sat Jul 14 23:46:36 2018 No.9869265 saving

>>	Anonymous Sun Jul 15 00:15:09 2018 No.9869324 File: 55 KB, 787x253, r9k.png [View same] [iqdb] [saucenao] [google] >>9867608 >4chan is a pretty nice and organic place.

>>	Anonymous Sun Jul 15 02:03:15 2018 No.9869431 >>9838864 >>9839385 >>9839408 a month or so before 30yo boomer really blew up there were was a compliation image of 30yo boomer threads on at least 12 boards trying to force the meme

>>	Anonymous Sun Jul 15 05:55:01 2018 No.9869742 >>9869431 I've never heard of that meme. I guess that means I don't browse cancer boards.

>>	Anonymous Sun Jul 15 06:13:43 2018 No.9869770 >>9869742 >that 30 y.o. boomer that doesn't keep up with the memes

>>	Anonymous Sun Jul 15 11:36:04 2018 No.9870208 >>9869742 wow you're cool too bad you're on /g/ right now

>>	Anonymous Sun Jul 15 11:40:50 2018 No.9870214 >>9870208 >too bad you're on /g/ right now you do know that this is /sci/, right?

>>	Anonymous Sun Jul 15 11:59:04 2018 No.9870231 >>9867608 >Even while working on it, I get the feeling, that doing in depth analysis kind of removes the soul of it all. The soul comes from the people so unless your computer is not powered by nameless horrors the soul should be pretty much safe.

>>	Anonymous Sun Jul 15 14:08:57 2018 No.9870535 >>9870208 Kill yourself >>9869770 gb2 your containment board

>>	Anonymous Sun Jul 15 14:15:35 2018 No.9870557 >>9870535 snap

>>	Anonymous Sun Jul 15 14:25:05 2018 No.9870575 >>9870535 t. self appointed /sci/ guardian

>>	Anonymous Mon Jul 16 02:19:03 2018 No.9871702 The time it takes for an unbumped thread to get archived would be an interesting variable. And related, the average position in the catalog at which threads are bumped from. And if you track bumping, you could also measure the ratio of sage posts.

>>	Anonymous Mon Jul 16 07:48:34 2018 No.9872105 >>9837467 It looks like your remote viewing an alien there wave

Anonymous Mon Jul 16 09:03:16 2018 No.9872175

>>9871702
>The time it takes for an unbumped thread to get archived
that's pretty much the same as rate of new threads with some exceptions, where a board doesn't hold the usual 150 threads
>the average position in the catalog at which threads are bumped from
I don't think you can reasonably check that. Maybe for very slow boards, but otherwise you would have to fetch the catalog so frequently for it to work.

>>	Anonymous Mon Jul 16 13:15:57 2018 No.9872592 >>9848772 >/lit/ is the only one who says "jew" more than "nigger"

>>	Anonymous Mon Jul 16 13:18:39 2018 No.9872594 >>9856436 weird, I thought "boomer" was a /tv/ thing

Anonymous Mon Jul 16 14:19:06 2018 No.9872698 [DELETED]

>>9872175
>that's pretty much the same as rate of new threads with some exceptions, where a board doesn't hold the usual 150 threads
Not exactly. The speed at which threads are pushed down is a combination of the the rate of thread creation and the rate at which the threads below them are bumped. The expected time to be pushed from position x to position x+1 would be
[math]\displaystyle t(x) = \frac{1}{C + f(x) B}[/math]
where C is the thread creation rate, B is the thread bumping rate, and f(x) is the fraction of bumps that occur in threads below position x. At the bottom of the catalog, f(x) goes to zero and thread creation dominates, whereas at the top of the catalog, f(x) approaches one and bumping dominates. So by measuring the time it takes for threads to reach various positions in the catalog, you can calculate the rate of bumping and the distribution of where the bumping takes place.

For threads near the bottom of the
f(x)

Anonymous Mon Jul 16 14:19:25 2018 No.9872699

>>9867608
An interesting take that I thing could go in a lot of directions is looking at reply-chains graphs, particularly in connection with questions regarding analysis of longer post length, etc.
Going through a thread and adding onto each post's entry: replies_to, replied_by

Anonymous Mon Jul 16 14:20:08 2018 No.9872702

>>9872175
>that's pretty much the same as rate of new threads with some exceptions, where a board doesn't hold the usual 150 threads
Not exactly. The speed at which threads are pushed down is a combination of the the rate of thread creation and the rate at which the threads below them are bumped. The expected time to be pushed from position x to position x+1 would be
[math]\displaystyle t(x) = \frac{1}{C + f(x) B}[/math]
where C is the thread creation rate, B is the thread bumping rate, and f(x) is the fraction of bumps that occur in threads below position x. At the bottom of the catalog, f(x) goes to zero and thread creation dominates, whereas at the top of the catalog, f(x) approaches one and bumping dominates. So by measuring the time it takes for threads to reach various positions in the catalog, you can calculate the rate of bumping and the distribution of where the bumping takes place.

Anonymous Mon Jul 16 14:38:41 2018 No.9872736

>>9872699
I can give you a prediction for this already - plotting number of post replies vs post length will give you a bimodal distribution - one short peak for mid-to-long posts that make really good points and contribute to the discussion, and one fucking huge peak for short, inane posts that happened to get repeating digits.

>>	Anonymous Mon Jul 16 14:45:12 2018 No.9872747 >>9872736 It would be interesting to see which boards care about dubs and which do not.

Anonymous Mon Jul 16 14:50:45 2018 No.9872759

Read about half of the thread, but I don't have the patience to check for this: have you tried coming up with as many parameters as you can, doing a principal component analysis on the data, discarding the highly correlated parameters and simply applying a few clustering algorithms to the boards to get some sort of similarity index between different boards? Or maybe running some text analysis on posts across boards to identify what boards certain groups of people tend to visit? (e.g. there'd be a cluster for /sci/, /g/, /lit/, /diy/, etc, one for /hm/, /cm/, /y/, /fit/, etc, and so one)

>>	Anonymous Mon Jul 16 14:55:02 2018 No.9872773 >>9872702 this is interesting, nice approach

>>	Anonymous Mon Jul 16 16:22:21 2018 No.9872927 >>9846869 >>9846860 there are already archiving websites which I'm sure would be happy to satisfy your request for data; 4plebs comes to mind

>>	Anonymous Mon Jul 16 16:45:13 2018 No.9872974 >>9844047 It is possible to embed steganographic data within images by fiddling slightly with RGB values, which is certainly an option for you. There's an tool online somewhere for it.

>>	Anonymous Mon Jul 16 18:26:54 2018 No.9873154 >>9872974 >steganographic data I've had quite enough of your dinosaur mumbo jumbo, sir!

>>	Anonymous Mon Jul 16 18:40:56 2018 No.9873187 >>9872974 This wouldn't survive JPEG compression.

>>	Anonymous Mon Jul 16 18:51:03 2018 No.9873210 >>9873187 Correct me if I'm wrong, but I don't think 4chan uses lossy compression. Regardless, the tool is fun to play around with http://steghide.sourceforge.net/

>>	Anonymous Tue Jul 17 02:49:50 2018 No.9873872 >>9861657 Well obviously. You can't not post text without posting an image.

Anonymous Tue Jul 17 03:07:28 2018 No.9873896

>trying a bit of this for my cozy general, in R
jesus god kill me I hate webshit
I've been struggling way too long with parsing this bullshit, finally resorting to some "xpathapply" gibberish, where it should be so much easier to extract what I want from the quotelink/etc tags, then get rid of the tags and save the text comment
of course also the parsing library I was using shits the bed on pure strings, so I have to ask isHtml(string) or isXML(string), I forgot which because I just ended up so disgusted.

The web was a mistake.

>>	Anonymous Tue Jul 17 03:10:04 2018 No.9873900 >>9873896 Are you trying to scrape the page instead of just using the JSON API?

Anonymous Tue Jul 17 04:16:49 2018 No.9874007

>>9873900
No, getting the catalog json, leading to the thread json, leading to comments in html form
I'm sure that I'm the weak link in the process I don't have much experience with it. I'll probably to start the comment-parsing section over from scratch and come up with better tools/structures to deal with that part. I essentially just want to extract the reply-chain information in some form, and the pure text for practicing some analysis, maybe things like "samefag likelihood" estimation, post generation including attachment, etc

Does anyone know how the post cooldown is calculated? It is longer for posts with images, but it also seems to change periodically, though I could be mistaken.

>>	Anonymous Tue Jul 17 04:21:03 2018 No.9874010 >>9874007 It's different from board to board. There's a list at https://a.4cdn.org/boards.json

>>	Anonymous Tue Jul 17 11:43:45 2018 No.9874479 >>9872736 >>9872747 I'd like to see a plot of replies vs repeating digits. I'd bet there's an exponential increase as you get better and better gets.

Anonymous Tue Jul 17 11:48:02 2018 No.9874483

>>9872759
That's basically the overlying goal of the project - determining critical parameters and trying to create a 'map' of the boards in that parameters space.

There've been quite a few attempts at that so far. Ex. the set starting with >>9855137 but one of the problems is that a lot of different parameters are correlated with each other and determining what parameters are the *true* independent parameters is difficult in those cases. For example in >>9855138 there's a clear correlation between text vs image posting, but which is the actual independent parameter in this case? Or is it neither and both depend on other board characteristics?

Anonymous Tue Jul 17 17:59:34 2018 No.9875026
File: 116 KB, 1597x867, text_analysis.png [View same] [iqdb] [saucenao] [google]

Finished modifying the text analysis, so words can be chosen on the fly instead of it being a predefined list.
Though it's only checking the most recent snapshots of each board now (usually not older than 90 minutes.).
It's not going through any history, just what's currently visible, therefore results could vary quite a bit from hour to hour.

'text_ratio' is the ratio of text the string takes up among all characters
'posts_ratio' is the ratio of posts containing that string at least once
try it if you want
https://4stats.io/snapshotAnalysisWorkInProgress

>>	Anonymous Tue Jul 17 18:10:48 2018 No.9875048 >>9875026 I might be retarded, but I hit "Analyze", it said all good, but it's not doing anything.

>>	Anonymous Tue Jul 17 18:13:54 2018 No.9875051 >>9875048 does it not add the entry to the text analysis list? If so which browser are you using?

Anonymous Tue Jul 17 18:23:21 2018 No.9875065

>>9875051
It does add it, but no points appear. Text is "hello"
Happens in both firefox and chrome.
Oh fuck, I realized now I'm supposed to click the x,y boxes. I'd say I'm an idiot but I think there's not really enough indication that that's what I need to do. Works in both firefox and chrome.

>>	Anonymous Tue Jul 17 18:28:22 2018 No.9875073 File: 35 KB, 806x779, 1531866487.png [View same] [iqdb] [saucenao] [google] hello

Anonymous Tue Jul 17 18:30:32 2018 No.9875079

>>9875065
yeah the UI is something quickly hacked together to have something to visualize the data.
If I ever integrate it into the site properly, then that would need an overhaul for sure.
>>9875073
Search without the quotes though.
Unless you really want to search for "hello" with quotation marks on each end

Anonymous Tue Jul 17 18:34:47 2018 No.9875085
File: 59 KB, 803x787, 1531866764.png [View same] [iqdb] [saucenao] [google]

>>9875079
Ahhh, got it. Well besides that initial confusion, it works very well. It's really nice to keep the analysis list items for comparison with later things.
Really good work anon, very nice tool.

>>	Anonymous Tue Jul 17 18:55:49 2018 No.9875125 >>9875079 how is postsPerPoster calculated?

>>	Anonymous Tue Jul 17 18:57:27 2018 No.9875129 >>9875125 oh I think it must be a WIP, it doesn't give visual indication of being clicked (though it is apparently some value)

Anonymous Tue Jul 17 20:03:44 2018 No.9875263

>>9875125
>how is postsPerPoster calculated?
it's the average number of posts of unique IPs per thread.
basically (repliesPerThread_mean / postersPerThread_mean)
>>9875129
>it doesn't give visual indication of being clicked
oh, there was an issue, because /vip/ doesn't give info about unique IPs per thread.
I just set it to 0 now for that board.

>>	Anonymous Tue Jul 17 21:31:16 2018 No.9875433 File: 70 KB, 380x349, boomer.png [View same] [iqdb] [saucenao] [google] >>9872594 started on /fit/

>>	Anonymous Tue Jul 17 23:34:41 2018 No.9875623 File: 13 KB, 559x527, 1525184296550.png [View same] [iqdb] [saucenao] [google] >>9845054 thanks /pol/

>>	Anonymous Wed Jul 18 00:41:31 2018 No.9875699 File: 57 KB, 784x779, file.png [View same] [iqdb] [saucenao] [google] Why does /sci/ have so many reddit links?

>>	Anonymous Wed Jul 18 00:46:02 2018 No.9875710 File: 126 KB, 1159x819, file.png [View same] [iqdb] [saucenao] [google] Places where Lee spams or that talk about him. There are a suspicious number of zeros. You'd probably need a longer data period to say anything interesting.

>>	Anonymous Wed Jul 18 01:25:53 2018 No.9875773 >>9875710 what? why isn't /v/ #1?

>>	Anonymous Wed Jul 18 01:32:06 2018 No.9875782 >>9875699 because we're not stupid

>>	Anonymous Wed Jul 18 06:19:31 2018 No.9876098 >>9875026 Could you get it to accept slashes so that it can search for terms like "/b/" and "/pol/"?

>>	Anonymous Wed Jul 18 06:25:35 2018 No.9876110 File: 135 KB, 320x363, hatedAnon.png [View same] [iqdb] [saucenao] [google] >>9875699 some people on sci are unironically >I fucking love science >based black science guy >Women in Stem, it's current year

Anonymous Wed Jul 18 13:06:09 2018 No.9876791

>>9876098
yeah didn't think of that.
Takes a bit longer though because many of those may be in board crosslinks ( >>>/pol/123456 ) and so far I just removed all board/postlinks, so they wouldn't add to the character count of the post.
I guess this is a special case where the /board/ part should be kept in.
Only difference that this would make is that posts, that only have a single crosslink as their content now won't be counted as no-text replies anymore since the boardname remains, but I guess that's maybe only a handful post across the whole site.
>>9875710
>You'd probably need a longer data period to say anything interesting.
true, only checking visible content isn't ideal.
visible + last day would be better, but I have to see if I can make it work with the memory available.

>>	Anonymous Wed Jul 18 17:21:45 2018 No.9877238 File: 778 KB, 750x422, Popsci Internet Defence Force.png [View same] [iqdb] [saucenao] [google] >>9875699 popsci plebs

>>	Anonymous Wed Jul 18 18:35:33 2018 No.9877373 >>9853675 /co/ and /lit/ are my two favorite boards.

>>	Anonymous Wed Jul 18 18:44:23 2018 No.9877390 Consider only analyzing posts that have replies to them. Many posts are very low quality spam.

>>	Anonymous Wed Jul 18 19:39:31 2018 No.9877489 >>9853675 No /diy/?

Anonymous Wed Jul 18 20:30:22 2018 No.9877592
File: 114 KB, 1588x782, boardsearch.png [View same] [iqdb] [saucenao] [google]

>>9876098
I switched everything over to a new server that should be able to handle it. Slashes for searching for board mentions also work now.
It's now looking through visible + last day posts, which should result in much more stable results for faster boards.
Maybe give it a day to regenerate the comment cache though for better results, since right now it only has visible + ~last hour or so.
Always worried that I made retarded mistake somewhere and some of the results may be total bullshit, but nothing really weird stands out so far.

Don't know why I am wasting that much time on this.
Could have played some CS:GO instead.

>>	Anonymous Thu Jul 19 04:42:47 2018 No.9878445 >>9877489 That diagram is a total scam.

>>	Anonymous Thu Jul 19 06:34:43 2018 No.9878618 >>9878445 Figured as much, collecting that data in the first place would be impossible (unless you're hiroshimoot)

>>	Anonymous Thu Jul 19 06:44:59 2018 No.9878636 >>9877592 >Don't know why I am wasting that much time on this. >Could have played some CS:GO instead. Because science.

>>	Anonymous Thu Jul 19 08:12:07 2018 No.9878746 >>9878618 It just goes to show that fancy presentations and names will convince the lot. You know perhaps this one? https://brainosoph.wordpress.com/2014/11/13/the-ghost-of-stronzo-bestiale-and-other-fake-scientific-authors/

Anonymous Thu Jul 19 12:13:36 2018 No.9879106

In the last day, based on post ratios (excludes /qa/ and /vip/):
/adv/ mentioned /r9k/, /soc/ more than other boards
/asp/ mentioned /sp/ more than other boards
/bant/ mentioned /an/, /po/, [s4s] more than other boards
/cm/ mentioned /y/ more than other boards
/d/ mentioned /aco/ more than other boards
/diy/ mentioned /out/ more than other boards
/e/ mentioned /h/ more than other boards
/f/ mentioned /a/ more than other boards
/fa/ mentioned /cgl/, /fit/ more than other boards
/gd/ mentioned /3/, /t/, /wsr/ more than other boards
/h/ mentioned /d/, /e/ more than other boards
/i/ mentioned /ic/, /qst/, /trash/ more than other boards
/ic/ mentioned /gd/ more than other boards
/lit/ mentioned /his/, /tv/ more than other boards
/m/ mentioned /wsg/ more than other boards
/n/ mentioned /o/, /toy/ more than other boards
/news/ mentioned /pol/ more than other boards
/out/ mentioned /adv/, /asp/, /fa/, /k/ more than other boards
/po/ mentioned /diy/, /i/, /n/, /tg/, /trv/ more than other boards
/r/ mentioned /b/ more than other boards
/r9k/ mentioned /lgbt/ more than other boards
/s/ mentioned /gif/, /hc/ more than other boards
[s4s] mentioned /mu/, /s/ more than other boards
/sci/ mentioned /bant/, /c/, /f/, /lit/, /mlp/, /x/ more than other boards
/t/ mentioned /r/, /vr/ more than other boards
/trv/ mentioned /ck/, /int/, /p/ more than other boards
/v/ mentioned /u/ more than other boards
/vr/ mentioned /v/, /vp/ more than other boards
/w/ mentioned /wg/ more than other boards
/wg/ mentioned /hr/, /w/ more than other boards
/wsr/ mentioned /biz/, /co/, /g/, /jp/, /m/, /sci/, /vg/ more than other boards
/y/ mentioned /cm/, /hm/ more than other boards
/news/ didn't get mentions outside of /vip/ and /qa/. What a betafag.

It might not be accurate due to a single day's worth of mentions, but you can totally see the "board families" and which boards share a common userbase, like /r/ and /t/, /y/ and /cm/, /d/ and /aco/, /h/ and /d/, /ic/ and /gd/, /trv/ and /int/.

>>	Anonymous Thu Jul 19 12:49:34 2018 No.9879144 >>9879106 why not /tg/?

>>	Anonymous Thu Jul 19 18:00:41 2018 No.9879645 >>9879106 this is fucking cool

>>	Anonymous Thu Jul 19 21:17:21 2018 No.9880003 >>9878636 I am not even a /sci/ anon. Saw OP posting about the idea of the diagram in another thread (can't remember which board) and it seemed like an interesting idea and potential use case for the data I already had here, but it sure eats time.

>>	Anonymous Fri Jul 20 01:42:48 2018 No.9880393 new bread when?

>>	Anonymous Fri Jul 20 03:20:35 2018 No.9880494 >>9880393 Soon, I hope. This is a most excellent thread.

>>	Anonymous Fri Jul 20 12:13:11 2018 No.9881214 Do we have any time-of-day or day-of-week stats yet?

Anonymous Fri Jul 20 12:44:23 2018 No.9881263

>>9881214
I have data covering almost a year now for rate of posts and threads made on each board (as daily, hourly and 5-minute intervals)
Looks like this: https://api.4stats.io/history/day/biz
in this case the arrays are
[start-time as unix ms, duration in ms, posts during duration, posts/minute during duration]

But no metadata generated from that yet, like comparing time-of-day a board is most active or day-of-week patterns.

Advanced search
Text to find
Subject [?]Search by post subject. Leave empty for any.
Username [?]Search for user name. Leave empty for any user name.
Tripcode [?]Search for tripcode. Leave empty for any.
Email [?]Search by email. Leave empty for any.
Filename [?]Search by image filename. Leave empty for any.
From Date [?]Enter what date to start searching from. Format is YYYY-MM-DD
To Date [?]Enter what date to start searching until. Format is YYYY-MM-DD
Image hash
Search in	All Posts OPs Only
Deleted posts	Show all posts Show only deleted posts Only show non-deleted posts
Internal posts	Show all posts Show only internal posts Show only archived posts
Order	New posts first Old posts first
Capcode	All Posts Only by Users Only by Mods Only by Admins Only by Developers
Results	Posts Threads
Action	[ Simple ]

/sci/ - Science & Math