[ 3 / biz / cgl / ck / diy / fa / ic / jp / lit / sci / vr / vt ] [ index / top / reports ] [ become a patron ] [ status ]
2023-11: Warosu is now out of extended maintenance.

/sci/ - Science & Math


View post   

File: 330 KB, 567x600, Hertzsprung-Russel_StarData.png [View same] [iqdb] [saucenao] [google]
9837467 No.9837467 [Reply] [Original]

Those familiar with astronomy will recognize this image - the Hertzsprung–Russell diagram for stellar classification. For those who aren't familiar with it, the HR diagram plots the brightness and temperature of observed stars and groups the stars into physically significant groups and families - giants, main sequence stars, dwarf stars, etc. What I'd like to propose as a project for /sci/ is trying to accomplish something similar with this website - a 4chan HR diagram which plots out the various boards in terms of some combination relevant parameters (board traffic, board age, image posting, etc etc etc) and determine if it's possible to illustrate some kind of rough empirical structure of 4chan's communities and cultures. Does such a parameterization lump boards into groups and families? Does a "main sequence" of 4chan boards exist? Are there "evolutionary paths" which follow natural progressions? These are the kinds of questions I'd like to see if we can answer.

>> No.9837468
File: 22 KB, 721x961, 4chan hr first attempt.png [View same] [iqdb] [saucenao] [google]
9837468

As a kind of rough 'proof of concept' of the idea, I acquired stats on average posts-per-day for each board for the last month as well as image-to-post ratios and plotted the boards with that parameterization and already we see that some interesting groupings already appearing with this relatively simple parameterization:
1) At the top of the graph we have a close grouping of 'eye candy' boards (wallpapers, cute stuff, and all of the porn boards)
2) In the middle of the graph there is a large, amorphous blob of blue boards
3) There are two distinct 'branches' of pink boards in the lower half of the graph
- an upper branch which consists of boards like /bant/, /trash/, and /b/
- a lower branch which consists of boards like /soc/, /r9k/, and /pol/

I'd be curious to hear /sci/'s thoughts, as well as any ideas people might have for further refining this concept.

>> No.9837473

>>9837467
I like this idea!!
We could classify them on the level of brainletness of each broad

>> No.9837496

Sounds pretty interesting. You could also try looking at the average words used per post, or maybe how often users will post links to other websites, and see if there's any correlation between those and the data you already collected.

>> No.9837528
File: 879 KB, 2230x1408, Screen Shot 2018-06-28 at 10.19.39 PM.png [View same] [iqdb] [saucenao] [google]
9837528

required reading: this UN-funded study about /pol/. Good insight into researching 4chan, it will be useful OP

https://arxiv.org/pdf/1610.03452.pdf

>> No.9837560
File: 105 KB, 409x284, the least rare pepe.png [View same] [iqdb] [saucenao] [google]
9837560

>>9837473
Appreciate the spirit, but hoping to try for something a little more objective.

>>9837496
Yeah that might be interesting, I'm interested in seeing what kind of correlations exist between different parameters. The biggest obstacle is data collection - finding details like board traffic info or image posting is easy with sites like 4stats and all the various archives and traffic monitors that exist... doing something like figuring out the average number of words per post across 70-something boards is something I can't even fathom figuring out how to do. That's a big part of why I wanted to get /sci/'s two cents on the idea; I figured the folks here would have some insight on what parameters might really be worth looking at and how to go about collecting information on them.

>>9837528
That's fucking hysterical and I have a new reaction image now. Thank you anon.

>> No.9838118
File: 244 KB, 1416x820, 0sRB20K.jpg [View same] [iqdb] [saucenao] [google]
9838118

>>9837560
me again from the other thread

in the past I also gathered such data as ~post-length and posts/unique-user.
Stopped because it takes too long to check all threads individually and still update the stats in a timely manner.
You can calculate the data from the public 4chan API ( https://github.com/4chan/4chan-API ), but it's going to take some time. Not an issue really, if it's supposed to be only a one time snapshot.
API limit is one request per second. So with 72 boards * ~10 pages * ~15 threads, it's going to take a few hours, but then you have all the data you would want.

>> No.9838189

>>9838118
Cool, I'll take a look at that.

>> No.9838219

>>9837560
https://4stats.io

>> No.9838449

>>9837528
>https://arxiv.org/pdf/1610.03452.pdf
It seems some of the authors have been hanging out here rather than just extracting data from bots, ref the "comfy" pepe.

>> No.9838778

>>9838118
Actually I don't suppose you could you go into a bit more detail with how to do that? I'm not really familiar with this sort of thing.

>> No.9838809

I like to see something about average post length as a proxy for thoughtfulness vs. traffic. Kind of similar to this one >>9837468 but different.

>> No.9838859

>>9838778
you would need some minimal programming knowledge to automate the API requests.

You would first get a list of all existing boards
https://a.4cdn.org/boards.json
then get info on existing threads (taking /sci/ as an example board here)
https://a.4cdn.org/sci/threads.json
or
https://a.4cdn.org/sci/catalog.json
take the OP post number and fetch the threads like
https://a.4cdn.org/sci/thread/9837467.json (this thread)
and that object contains all the posts and also some extra data like "unique_ips" in the OPs object
Put all the data in some kind of database I guess and where you go from there is up to you.

Some boards don't have archives afaik, so if you fetch /b/ threads for example, then it would probably make sense to start loading them from the back of the catalog, so they aren't pushed of the board by the point you reach them, if you would start from the beginning.

>> No.9838864
File: 22 KB, 376x349, 1530159782903.jpg [View same] [iqdb] [saucenao] [google]
9838864

I think it would be interesting to see how pics spread. Like, if an image/meme is first posted on /g/, then how long does it take to travel over to /jp/? Sort of like a web matrix of how people spread information across the different boards.

the 30 year old boomer was originally only on a couple of boards, but now it has spread as a far as /o/, and /ck/.

>> No.9838888

>>9838778
>>9838859
and it might be a good idea to pay attention to the time of day, that you plan to create that data snapshot.
3 hours is a long time and different boards have their peak-activity at different times of the day.
So you have to decide whether you want to compare boards at one single point in time or get the data at the weekday/time, that they normally are most visited.
You can check the graph on the 4stats site for that I guess.

And even then some boards have some interesting activity patterns like /asp/ or /qst/.
/asp/ for example is always super-active monday and tuesday nights, while /qst/ is the board with the highest variance of activity and highly dependent on time of day like no other board. (dropping as low as 10-15% during off times).
Everyone seems to meet up there at only very specific times during the day to participate in the guided threads OPs are organizing.

>> No.9839000

>>9838888
If we're talking about comparing things like 'post length' and 'image ratios' and shit, then it probably isn't super critical what time the data is taken. You'd have a large enough sample size that any collection of posts you look at can probably be taken to be indicative of average posting trends as far as those kinds of things are concerned. Likewise sites like 4stat take averages over several weeks, so things like one or two big nights of activity surrounding a big announcement or event should get washed out in the statistical averaging.

>> No.9839167

>>9839000
that could be interesting by itself.
Compare post quality on peak vs off-time. I would bet that on all boards during daily peak you get a lot more shitposts and bait attempts, while during less active times you have a better chance to find genuine discussions with people interested in the boards topic.
But that's nothing more than a guess anyway.

>> No.9839385

>>9838864
God I hate this shitty fucking forced meme.

>> No.9839408

>>9839385
that's the thing, there must be some interesting story behind how it spread to every board. I don't think it's a bot

>> No.9839415

>>9837473
Percentage of frogposts would be an excellent proxy for this.

>> No.9840007

>>9839000
>probably isn't super critical what time the data is taken
We have some polls from >>>/g/cyb that show large variations across the time zones. I'd say a 24 h period is needed to be sure.

>> No.9840294

>>9840007
Looking at posting trends over time - especially as you cross the timezones from the US into Asia into Europe would be interesting to look at for different sites.

/sci/ should look at expanding this thread idea into just a general 4chan statistics project, I'm sure there's a lot of really interesting and surprising insights we could gain into this site by looking at different posting trends.

If anyone else has ideas, suggest them here or, if you're proactive, trying and put some data together.

>> No.9840468
File: 35 KB, 669x661, Demography2.png [View same] [iqdb] [saucenao] [google]
9840468

>>9840294
One thing is for sure, neither the boards nor the topics are homogeneous. Pic. very much related.

>> No.9841964

>>9840468
Wow, that's a lot more PhDs than I suspected.

>> No.9841986

it's an anonymous website full of trolls who love to answer questions falsely

good luck lad

>> No.9842013

Wow an actually interesting thread on /sci/ I never thought I'd live to see the day

>> No.9842028

>>9840294
>If anyone else has ideas, suggest them here or, if you're proactive, trying and put some data together.
Looking at flag rates on sites like /pol/ and /int/ could be interesting - seeing if concentrations of posters in the Americas, Asia, and Europe follow expected trends based on timezones or are more unexpected.

>> No.9842038

>>9842013
I know, right?

Also post your IQ

and write a book about how you hate CS

dont forget to mention elon musk and black science man

>> No.9842142

>>9841964
That was my first thought too. Then again, 4chan has a lot of autists so why not? There are several of my colleagues I strongly suspect, many have a PhD.

>> No.9842145

>>9842142
>There are several of my colleagues I strongly suspect, many have a PhD.
how can you not know? Ask them and laugh inside that you have the same job but without a ph.d

>> No.9842269

>>9842145
Given that I too have a PhD I am not sure about that laughing part.

>> No.9842274

>>9842142
>There are several of my colleagues I strongly suspect, many have a PhD.
Wait, so you suspect that they have PhDs or you suspect that they post on 4chan?

>> No.9842304

>>9842269
well then if they have the same job as you with an undergraduate degree and you're the one with a ph.d, that sounds bad

>> No.9842453
File: 49 KB, 1645x995, snapshot.png [View same] [iqdb] [saucenao] [google]
9842453

>>9838778
I created a snapshot of the first 3 pages for each board, which should be enough to at least experiment with some of the properties and see if there are some obvious patterns maybe.

It's 130MB of JSON, but in that human readable format at least it should be easy to work with.

https://drive.google.com/file/d/15zki-GD9j8c6gjQoUwDGApN8wqh3iRut/view

>> No.9842597

>>9842453
This is some interesting data, but going through it by hand is gonna be a bitch. I'm not familiar with programming in Java, but I am familiar with using Mathematica for modelling and data analysis and it can execute URL code very easily.

If I get some free time this week I'll try and code up something that can (hopefully) automate data collection and analysis for the site. Should be a pretty straightforward flowchart:
>Select board to analyze
>Mathematica calls board data API to get thread data
>Extract a list of all of the OP numbers for existing threads
>Go down the list starting with the ones closest to getting bumped off
>Mathematica calls thread data API for each number
>Extract the quantities you're interested in
>Repeat until the thread list is exhausted
>Export the data points for that board to an external file
>Repeat with each board

150 threads per board, say 10 seconds to process each thread, that's about 45 minutes to collect a full sample for each board. It'd take a while, but it'd be a helluva lot easier than doing the analysis by hand.

>> No.9842607

>>9837467
Why is temperature plotted from high to low?

>> No.9842616

>>9842607
what will change if it was plotted from low to high

>> No.9842621

>>9842597
there is no scenario where you would analyze anything manually by hand.
Any programming language should be able to parse data in JSON format. After all, that's what the 4chan API sends back to you as well.

And processing a board will be way faster than that. Pretty sure the limiting factor will always be the API limit of 1 request per second.

>> No.9842664

>>9842621
Oh I know it'll probably be more efficient than that, but years of data analysis have taught me to always be a little pessimistic when making estimates... that way you're always pleasantly surprised when it processes more quickly. Realistically, if it's only truly limited by the request cap, it could well finish a board in 5-10 minutes and you could do the whole site in a day or two once you get all the bugs worked out.

>>9842607
Stars cool as they age - by plotting the HR diagram with temperature decreasing from left to right, it turns the HR diagram into a map of stellar evolution

>> No.9843166

>>9837467
This is actually interesting, its a shame this wasn't done several years ago when 4chan's boards were much more distinct culture wise. But this should still be worthwhile.

>>9839408
Its called a forced meme for a reason; the same retard or group of retards post the same image again and again across various boards in vain attempt to make it catch on.
In this case its a bunch of spergs from nu-/v/ trying to spread their cancer everywhere.

>> No.9843201

>>9843166
>its a shame this wasn't done several years ago when 4chan's boards were much more distinct culture wise
The ones that people actually care about are all still pretty distinct.

>> No.9843221

>>9843201
maybe /v/, /tv/, /pol/, and /r9k/ are very clustered together on the diagram, and /b/ is off with the porn boards

>> No.9843338
File: 34 KB, 721x961, hr pink - labels.png [View same] [iqdb] [saucenao] [google]
9843338

>>9843221
Interestingly, with the limited info we've got so far, there seem to be three distinct branches in pink boards (excluding the massive 'porn regime'), one leading to /gif/ and /trash/, one leading to /bant/ and /b/, one leading to /soc/, /r9k/, and /pol/. It'll be interesting to see if these similar branches appear with more advanced versions of the plotting.

Update on the programming - I've got the basic importing and API calling stuff working. I can pick a board, have it collect the numbers for all the active threads, and start calling the API for the thread details with minimal effort. Extracting the data from the thread is a little more challenging. Extracting the readily labeled info like unique ips, replies, images, etc is easy, but I haven't thought of a good way to do word/character counting since there's so much extraneous information about each post I don't want counted. Will keep updating this week, but I've got a research group meeting first thing in the morning and real work that I need to get done this week, so it may be a while.

>> No.9843406

>>9843338
can't wait to see what blueboards yield, although i doubt we could gleam much specifics from a board's culture or how that changes over time.

>> No.9843449

>>9842145
I think he meant suspect of having autism

>> No.9843452

>>9843338
Wouldn't the extraneous info cancel out since it's effectively the same for every post?

>> No.9843462
File: 21 KB, 721x961, hr blue.png [View same] [iqdb] [saucenao] [google]
9843462

>>9843406
blue boards are much harder to see any kind of patterns in, with the exception of the three at the top that are part of that same 'eye candy' cluster, almost all the blue boards are in sort of the same band and just smeared out through that area.

image-to-post ratio vs posts-per-day only tells us a little bit: we can conclude that boards posting above 50% images to posts are *almost* exclusively pornographic, we can see that there may be some trends among pink boards, and almost all the blue boards exist in a big band between about 15-35%, with only about a dozen or so boards deviating from it... but that's about all we can get from that.

we need more data, more parameters to look at:
how do post lengths and word counts compare between boards? are posts short and sweet? long but simple? long and thought out?
how evenly spread out are users? is it a lot of people in a lot of different discussions, or are only a handful of threads attracting a lot of posters?
etc etc etc

i think the more details we can get, the more patterns and trends we'll see develop

>> No.9843469

>>9843452
yes, but it's not going to scale uniformly because you get basically the same amount of extra info (html formatting, fonts, post numbers, image info, etc) for each post. so 20 extremely short posts might have the same total character count as five or six multi-paragraph posts, which is a problem if we're going to look at using word counts/post lengths to characterize board activity.

i did a bit more work on it and i've come up with a string pattern code that seems to purge the bulk of the extra stuff while leaving most of the actual post content intact. it's not perfect, but it should work well enough. but i've seriously got to call it a night for now.

>> No.9843475

>>9843462
Very cool anon, but uh, it makes me wonder if we are just as bad as companies that collect people’s data...

>> No.9843586

>>9842274
I know what their degrees are and I just suspect they are autists. I have o idea if they post on 4chan. That comma should have made it clear.

>>9842304
Am I really that ambiguous? In my line of work it is normal to have a PhD, where I work it is about 25% while in some countries it would be the majority. And it is a field where a PhD really does help and importantly a PhD is also appreciated. People are skilled though as I mentioned there are a lot with high functioning autism.

>> No.9843600

>>9837467
>calling our sun "Sun"
It's name is Sol you idiot.

>> No.9843849
File: 33 KB, 1353x709, analysis_imageRatio_postLength.png [View same] [iqdb] [saucenao] [google]
9843849

That site you are using to visualize the stats is really neat. Just found it today as well.
I took the data of the first 3 pages I gathered yesterday and compared some of the stats.

Simple, but maybe obvious example here.
The more images people are posting, the shorter their posts are on average.

"Post Length" is the # of characters in a post with HTML tags and post-number quotes (>>123456) removed, so it only counts actual written and quoted text.

>> No.9843940
File: 65 KB, 525x481, thumbs up.png [View same] [iqdb] [saucenao] [google]
9843940

>>9843849
>Simple, but maybe obvious example here.
Don't undersell it, that's actually a really interesting result and not one that's necessarily intuitive. One might expect that there would be at least a few examples of boards in all four extremes:
>1) high image, long post
>2) high image short post
>3) low image, short post
>4) low image, long post
Yet we see two big 'forbidden zones' in this graph in what would be regions 1 and 3. With the exception of /f/ (a board that doesn't allow image posting) there really aren't that many examples of boards where posters say little and contribute little (and, in fact, if you were to substitute flashes for images on /f/, it would probably push them into the linear regime with the other threads, removing that lone exception). And on the other side of things we see a MASSIVE empty space where one might expect to find boards that say a lot but also post lots of content.

Your plot's important because it demonstrates that there's a very strong correlation (an almost linear one at that) between how much content a boards posts vs how much of an actual conversation they're probably having. More images tend to mean less conversation - that's really a really neat result! Well done! I'd also be interested in hearing what approach you used for removing tags and post numbers while still being able to isolate and count each post.


This is kind of why I made this thread - not just to discuss the HR diagram idea, but the general concept of applying rigorous, scientific analysis to 4chan activity, behavior, trends, etc. There's a lot of really cool stuff we could look at and analyze, from plots like this, to looking at how posting trends change throughout the day, or the 'meme diffusion' concept >>9838864 suggested, etc etc etc.

>> No.9843942

>>9838864
I think you can find its origin if you look up "30 year old at the gym" on, /fit/ archive

>> No.9843980

>>9843849
Interesting stuff. I mainly hang out in >>>/g/cyb and >>>/sci and my impression is that some threads (like /cyb/) tends to run until autosage.

So why not add error bars with max/min and standard deviations?

>> No.9843994

>>9843849
I really like that trash falls in almost the exact middle of the distribution, since presumably it has content that originated in multiple different boards and averages out

>> No.9844047

>>9843849
That's really cool - nice job.

>>9838864
Is it possible to create an image with some kind of 'tag' or 'tracker' that could be easily identified from searches even if the image is cropped or altered? You could make a meme and post it on a board and then track it through archives and see how it spreads and changes over time.
>Day 1 - posted Test Meme #27 on /b/
>Day 2 - TM27 has been reposted 42 times, but has yet to cross the board barrier
>Day 3 - a poster redrew TM27 with a twiddly little mustache, dubbed TM27-a
>Day 4 - 132 variants of TM27 have been reposted 8700 times across 37 boards

>> No.9844082
File: 56 KB, 1453x200, regex.png [View same] [iqdb] [saucenao] [google]
9844082

>>9843994
>>9843980
it's not necessarily 'that' accurate though, since it's only the first 3 pages, meaning mostly fairly active threads with more than average replies are represented.
>why not add error bars with max/min and standard deviations?
I would need to look up how to do those things properly. I am far from being an expert when it comes to data visualization or statistics in general.

>>9843940
>I'd also be interested in hearing what approach you used for removing tags and post numbers while still being able to isolate and count each post.
I used regex to replace/remove certain text. Everything in JavaScript, since that's what I know best.

https://pastebin.com/QXSGJbrf (line 50-54)
This is an improved version of what I used for the image, which doesn't show /p/ for example, because I didn't remove the EXIF text yet.

The resulting JSON looks like this
https://pastebin.com/XGkfLGh6

>> No.9844097

>>9843475
Nah, none of this is personal info and it's all purely academic in interest.

>> No.9844138

>>9844082
>I would need to look up how to do those things properly. I am far from being an expert when it comes to data visualization or statistics in general.
For error bar analysis the max, min and mean is a good start. You can also use hinges to indicate how even the distrinution is.
http://www.statisticshowto.com/upper-hinge-lower-hinge/

>> No.9844205

>>9844138
Alternatively, if you're already calculating mean averages, calculating a standard deviation is pretty easy to do from there. There's probably a built-in-function for it in whatever you're using.

>> No.9844216

>>9843600
It has no set name. People call it many different things, just like Earth.

>> No.9844255

>>9844216
I think you mean the 13th colony

>> No.9844669
File: 44 KB, 1448x768, postLength_middleThirdBar_averageDot.png [View same] [iqdb] [saucenao] [google]
9844669

>>9843980
not sure if I did something wrong or the distribution is just fucked up across all boards.
Dot is the average post length and the bar stretches from the 34th to 66th percentile.

I can only assume this is because of the mixing of large amounts of posts with no text with some outliers that are literal textwalls

>> No.9844688

>>9844669
How are you calculating the error bars?

>> No.9844709
File: 41 KB, 1421x755, postLength_middleThirdBar_medianDot.png [View same] [iqdb] [saucenao] [google]
9844709

>>9844688
>How are you calculating the error bars?
I realized I am an idiot the moment I finished reading that. mixing averages and percentiles...
Don't know why, but I did (Average - 34th percentile) for the bottom error

This one is now 33th---50th---67th, still hard to make out anything useful in that mess

>> No.9845054
File: 9 KB, 523x686, hr automation test.png [View same] [iqdb] [saucenao] [google]
9845054

OP here with an update - I've gotten a good chunk of the automation sorted out. Individual thread info is gonna take a while, but I've got most of the broad board statistics stuff partially working. The pic from >>9837468 took like an hour to assemble going through the object data by hand. Pic related was processed in about 40 seconds so that's a massive fucking improvement.

As you can see there are still some kinks to work out - there seem to be some issues with identifying worksafe and NSFW boards from the API data, so that's got to get fixed, and I don't have it doing any kind of statistics or normalization yet. But it's definitely progress and we see that the automated results get roughly the same shape as the manually assembled plot so that's a good sign.

>> No.9845945

>>9845054
Wow, that is a big improvement. I'm sure doing stuff like thread statistics will take a lot more time (0.5 seconds, 150 threads, 72 boards = 1.5 hours) but at the very least this means it should be possible to look at how the broader board statistics change throughout the day or in response to events and things.

It'd be interesting to see which boards are stable and which move around on the images-per-reply vs average-posts-per-day plots... who knows maybe there are even specific transition paths boards will follow as they go between different regimes.

>> No.9846024

How do you parse the site?

>> No.9846150
File: 158 KB, 1153x905, snap2.png [View same] [iqdb] [saucenao] [google]
9846150

Creating a full snapshot right now with all pages included this time.
If everything works out as intended, I will post the data in a bit.

>> No.9846364
File: 3.27 MB, 240x320, 1475147064143.gif [View same] [iqdb] [saucenao] [google]
9846364

>>9844709
Cool stuff. As for usefulness there are a few things that leap out of the diagram.
- first off image heavy boards tend to be very brief, most postings just an image with a few posts of significant lengths. Looking at e the left part of the error bar overlaps the average length close to zero.
- as image radio decreases so do the variance. My hypothesis is that the short ones are reaction images but the longer ones are when reaction image are less suitable
- averages are skewed to the left. My hypothesis 2 is that heavy left skewed boards have a lot of poo posters.
- boards like /diy( are VERY slow and centered. Uninteresting posts sink slowly while interesting posts live on for a long time.

So from your excellent graphs we present a few hypothese that can be tested (with a lot of work).

>> No.9846527
File: 205 KB, 739x799, analysis_result.png [View same] [iqdb] [saucenao] [google]
9846527

>>9846150
about 45minutes to fetch everything with this 6 in parallel setup with requests from different IPs.

Full snapshot of all threads from all boards:
https://drive.google.com/file/d/1FGMRwhTOEmfM2xS2VPoRKSYrEYKfAO6D/view?usp=sharing

some data after processing the threads (with some extra data I took from 4stats like "avgPostsPerDay"):
https://pastebin.com/YqePgRMz
(does this look about right?)

The shoddy work-in-progress code I used to fetch and analyze the thread data:
https://github.com/Nocory/4chan-detailed-statistics

>>9846364
hm, the full snapshot should be more helpful now to find any interesting correlations
>boards like /diy/
yeah, the smaller boards can vary wildly in their activity from day to day.


>>9846024
the public 4chan API

>> No.9846773
File: 166 KB, 1030x219, 4chan data sample.jpg [View same] [iqdb] [saucenao] [google]
9846773

Programming update - I have most of the data collection code worked out and have it collecting everything from average users per thread to post lengths to average reply numbers, and deviations for everything collected on a per-board basis (so stuff like Top PPM and average posts per day aren't going to have error bars).

As stated in >>9845054, I can do basic board stats in about 40 seconds, but because of the API call limit (0.5 seconds) it takes a LOT longer to do thread analysis on individual boards. Running a test case of 10 boards took about 25 minutes, so extrapolating you're looking at about 3 hours for a full sweep and there's not really any getting around that without setting up some kind of parallel data collection with proxies and stuff like >>9846527 (which is super impressive by the way, well done)

I want to try running at least one full sweep, threads and all, just to see what kind of data I can get, but I think I'm going to focus more on seeing what I can get out of the shorter sweeps - they can be done faster and still yield a lot of data. I may look at something like >>9845945 suggested - looking at how different plots evolve over time and whether there are stable and unstable boards or fixed paths that boards take to transition from one part of a plot to another.

>> No.9846823

>>9837528
this paper is actually informative and entertaining. thanks based anon

>> No.9846858
File: 202 KB, 1462x740, Screen Shot 2018-07-03 at 7.45.39 PM.png [View same] [iqdb] [saucenao] [google]
9846858

>>9846823

>we followed standard ethical guidelines
>the rest of this paper features language likely to be upsetting.

>Although deeper analysis of these differences is beyond the scope of this paper, we highlight that, for some of the coun- tries, the “rare flag” meme may be responsible for receiving more replies. I.e., users will respond to a post by an uncom- monly seen flag. For other countries, e.g., Turkey or Israel, it might be the case that these are either of particular interest to /pol/, or are quite adept at trolling /pol/ into replies (we note that our dataset covers the 2016 Turkish coup attempt and /pol/ has a love/hate relationship with Israel).

>So-called janitors, volunteers periodically recruited from the user base, can prune posts and threads, as well as recommend users to be banned by more “senior” 4chan employees. Generally speaking, although janitors are not well respected by 4chan users

>We find that 12% of /pol/ posts contain hateful terms, which is substantially higher than in /sp/ (6.3%) and /int/ (7.3%).

they sure had fun with this research

>> No.9846860

>>9837468
I've been wanting to create a 4chan archiver for research so long. When I learn how to properly program webcrawlers and get a few spare TBs of disk space, I will get all the stats and graphs we nerds need. Unless someone comes and does it first, sure

>> No.9846864

>>9846860
>a few spare TBs of disk space
You're going to need a LOT more than that.

>> No.9846867
File: 65 KB, 540x868, correlation.png [View same] [iqdb] [saucenao] [google]
9846867

>>9846773
you did one request every 0.5 seconds without any problems?
API docs stated not to make more than one per second and I even set it to 1.1s, when I fetched them, just to be on the safe side and not error out.

I wondered before what the actual limit is and what kind of API usage admins are ok with. (actual traffic and request limits)
Would be interesting to know how the various archive sites are going about fetching the data, especially when an archive is covering several boards and maybe even saves images.

on a side note
I tried checking for correlation between different board properties and ended up with this
https://pastebin.com/42rsWSdm
most of it fairly obvious again, but also some interesting starting points like
>[ 'imageRatio-avgPostsByPoster', '0.67' ]
the more image focused a board, the more often a user posts in the same thread repeatedly (image dumping threads)
with outliers like /mlp/,/vg/,/qst/, that have people post repeatedly without uploading many images at the same time

>>9846858
>We find that 12% of /pol/ posts contain hateful terms, which is substantially higher than in /sp/ (6.3%) and /int/ (7.3%).
text analysis of posts would be interesting to do as well on the thread data

>> No.9846869

>>9846864
Yeah, I planned on just saving a few hours worth of complete 4chan history, it's not like I could afford to go much further than that without thousands of dollars on storage. Maybe doing the research on those hours, dumping everything (saving the data/most used words/most evolved pictures/trends and flavors of the day) and fetching shit again to make new research. We could play with some Neural Network and make it a piece of shit, or discover trendy pastas or whatever

>> No.9846897

>>9846867
The API github says the limit is 2 requests per second from a single IP, but you can also do batches of 20 (don't know what the cooldown is for that though

>the more image focused a board, the more often a user posts in the same thread repeatedly (image dumping threads)
Makes sense - most threads that involve lots and lots of image posting are going to be things like porn dumps, rekt/ylyl/reaction threads, storytimes, etc etc where one person or a few people post a lot of content and people lurk

>> No.9847564

>>9846858
That table would be very sensitive for time of day snapshot. Also national holidays would skew things.

You could do a time of day combined with day of week analysis. The sky really is the limit here.

>> No.9847758

>>9846897
are you mixing things up or are we just talking about different ones?

2 requests per second and 20 burst is the limit I set for the 4stats.io API.
The 4chan.org API on the other hand has a 1 request per second limit.

>batches of 20 (don't know what the cooldown is for that though
Basically think of it as some kind of overheat meter, that refuses requests if it goes above 20, but cools at a rate of 2 per second :p

>I can do basic board stats in about 40 seconds
are you sending one request per board to 4stats, when assembling the data?
The /allBoardStats endpoint should give you everything you need in a single one.
https://api.4stats.io/allBoardStats
>but because of the API call limit (0.5 seconds) it takes a LOT longer to do thread analysis on individual boards.
So are you using the 0.5s delay for both 4stats and 4chan API requests?

>> No.9847894

>>9847758
Yes, I did have those request limits mixed up. I've been calling both with a half second delay between requests - which, since each board is taking about 2.5 minutes to complete means the 4chan API is only delivering on the requests every 1 second regardless of my delay. Do you have a link to the docs for the 4chan API?

Calling the 4stats allBoardStats endpoint gets you a ton of data and you can analyze it all and produce plots and stuff in under a minute, easily, (see >>9845054) but as far as I can tell there's only certain types of data collected under that. You can get stuff out of it like posting averages, image reply rates, etc, but getting more detailed info like average post length or users per thread appears to require looking at individual threads, which is why the process takes so goddamn long.

I'm sure there's a better way, but I haven't found one yet (at least not one that's within my programming ability).

>> No.9847915

>>9843849
>The IQ axis
>/co/ is even below /hr/

LOL

>> No.9848042
File: 21 KB, 826x201, size.png [View same] [iqdb] [saucenao] [google]
9848042

>>9847894
4chan API docs are here https://github.com/4chan/4chan-API

>but as far as I can tell there's only certain types of data collected under that
>getting more detailed info like average post length or users per thread appears to require looking at individual threads
yeah it's only the data, that can be calculated from the catalog alone. Would be great if at least 'unique_ips' was part of the catalog as well.
The site is supposed to provide (semi-)live stats and taking 3 hours to update would be way too long. Currently one full board update cycle is 5 minutes.

Though I store some data long-term to calculate daily peak posts/minute and to use the data for the timeline graph at the bottom of the page.
From that it would be possible to calculate some other interesting things, like volatility of a board or the time of day it's usually most active.
How to load historic data is also detailed in the API docs if you are interested.

>I'm sure there's a better way
I mean there isn't really any alternative to just going through all boards and fetching every thread.
Best thing to do would probably be some kind of centralized server setup, that continuously creates board snapshots and makes the raw and extracted metadata available for download.
This way you, me or someone else wouldn't need to waste hours loading the same data again.

Size of the raw snapshot for me is 352MB, extracted metadata (without full-text comments) is only 9.1MB and the result of the analysis of the metadata ends up being only 100KB
So having a way to just directly load 1.2MB of zipped metadata would be a huge time and space improvement.
And even if someone wanted to do analysis of all post comments (lets say you wanted to track the spread of the word "boomer" across different boards), that would only be another 25MB zipped download

>> No.9848113

>>9848042
I'm running my first attempt at a full sweep and already I've run into few issues with dealing with higher traffic boards - threads are disappearing faster than they can be requested (even reversing the thread order to try and snag the threads at the bottom before they disappeared isn't foolproof), so when all is said and done and the sweep finishes in another hour or so, this will probably be a lot of incomplete data.

Like I said - long term I think I'm going to focus on *just* the boardstats data. It doesn't have statistics on users, replies, etc but it still has potentially useful statistics and it doesn't take three hours to collect data from it.

>> No.9848154

>>9848113
same thing for me.
Starting from the last thread of the last page, I only got 135/150 /b/ and 192/202 /pol/ threads for example.
Most of the missing ones is due to manual mod deletion though is my guess.

>> No.9848176

>>9848154
That's true, I hadn't even thought about thread deletions, that could be a major issue especially on threads with a lot of active mods and janitors. The problem really comes down to the call limit for the API - 1 second between threads means you're looking at 2 1/2 minutes (minimum) per board... most boards tend to update every 30 or 60 seconds so that's 2-5 board updates, plus any deletions taking place... you could be losing dozens of threads on high traffic boards.

My biggest concerns with this at the moment are (a) the possibility that this will skew results significantly and (b) mathematica will automatically disable a command if it receives null or unexecutable inputs too many times (a feature designed to stop the program from getting stuck in broken loops and stuff like that), so there's a concern that if too many threads are removed it may just stop collecting data for the board all together, which means a big waste of time for no results. I've already gotten several error messages about it trying to do statistical calculations on null data sets which means it's not collecting data on certain threads or possibly even whole boards.

>> No.9848253

>>9848176
Speak of the devil - sweep just finished: Full run time was about 2:40 so a little shorter than expected, but that may be because some boards terminated prematurely: I wasn't able to get details for about a dozen boards including /int/, /trv/, /n/, /x/, /ic/, etc so yeah, I don't think doing full site sweeps like this is a very viable option.

That said - a few highlights from the results of the sweep:

1) There seems to be no real correlation between board activity (Avg PPD) and the number of users per thread, nearly all of the boards (with only a handful of exceptions) tend to average about 20-60 posters per thread, the standard deviation for posters per thread is even narrow, with almost every board having a deviation of about +/- 30-40 posters about their respective means.

2) Average replies per thread seem to follow a sort of bell curve on a log scale. Low traffic boards don't get many replies and high traffic boards probably have short attention spans and threads are quickly abandoned once they get to more than a couple dozen posts. The sweet spot seems to be boards with a few thousand posts per day, many of which get up into the 100s for thread length. These threads also have the highest variance in thread length. The exception is /vg/, which is not only extremely high traffic, but also has nearly all of its threads reaching capacity.

3) It should come as no surprise, but the porn boards have the highest image-per-user rate of the site. /c/ posts an average of 4.1 images per user, and most of the other porn boards range from 2-3 IPU. /lit/, /sci/, /fit/, and /adv/ have some of the lowest IPU counts of the site.

4) /qst/ has the highest average post length per user count of the site by a WIDE margin, coming in at about 1050 characters per user. The two runners up are /mlp/ and /tg/ at about 660 and 340 CPU respectively. Not surprisingly, the porn boards are also at the bottom for CPU counts, with many of them in the sub-100 characters per user

>> No.9848257

>>9848176
can't you just check if the response is valid, before passing it on to be analyzed?
It's also worth imo to add some retry functionality. Very rarely the API doesn't respond in time from my experience, but then immediately sends the data just fine on a second attempt.
>(a) the possibility that this will skew results significantly
Threads that get deleted probably didn't have time to get very big in the first place

>> No.9848279

>>9848257
It's doable, I'm just not sure it's worth the time or effort - as >>9848154 pointed out some of these boards will lose dozens of threads in the time it takes to sweep through them, and if you're taking time to do a second call every time one fails to register that time adds up fast - we're looking at 2 1/2 to 3 hours to sweep the site, adding double-checks could easily push that into the 3 1/2 hour territory or higher. Being able to look at detailed statistics on users, images, replies, and post lengths for threads on an individual board is great... but does it justify the ~15000% increase in runtime?

>Threads that get deleted probably didn't have time to get very big in the first place
Or they're big threads that got autobumped before there was time to analyze them. There's no way to tell.

>> No.9848661

>>9837467
You dumb nigger everyone knows that rchan is /pol/ + honby board. With the mentally ill and pedos congregating around /b/. This is why i hate /sci/, its all glorified priests in white lab coats

>> No.9848772
File: 42 KB, 1483x833, nigger_jew_occurence.png [View same] [iqdb] [saucenao] [google]
9848772

I have successfully determined, that boards that talk about "jew"s also mention "nigger"s more frequently.
Interesting to note, that as the extremes, /b/,/an/,/gif/ and /k/ tend to talk more about "nigger"s, while /biz/ and /lit/ seem to focus more on "jew"s.

>> No.9848873

>>9848772
log scales make more sense for this sort of stuff imo

>> No.9848932

>>9848661
t. '16er

>> No.9849476

bump

>> No.9849572
File: 181 KB, 339x359, comfy.png [View same] [iqdb] [saucenao] [google]
[ERROR]

>>9837560
>That's fucking hysterical and I have a new reaction image now. Thank you anon.
Me too

>> No.9849714 [DELETED] 
File: 1.25 MB, 1390x1083, west_europe_board_distribution.png [View same] [iqdb] [saucenao] [google]
[ERROR]

>Boards line up with the UK and France
>Nearly all of the porn is in the UK
>/b/, /pol/, and /v/ are on the German border
What does it mean?

>> No.9849730

>>9848772
this is fucking hilarious

>> No.9849740

>>9848772
I'd just graph N:J ratio against total frequency. I'd also like to see the most unique word for each board, not that I'm sure how you'd figure that out. Something like [math]max(\frac{count_{word-board}^2}{count_{word-total}*count_{all-words}})[/math] should work.

>> No.9849832

>>9838219
also it is accessible in https://4stats.moe/

>> No.9849836

>>9844047
All images may be tracked if slightly altered through their perceptual hash which is how reverse-image search engines work. See https://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html for an example.

>> No.9849882

>>9849740
That may prove a bit more challenging than you think - counting words isn't the same as counting numbers - you have to take into consideration case sensitivity, spelling errors, etc.

>> No.9849947

>>9848772
like >>9848873 said - recast this on a log/log plot

>> No.9850425

>>9848873
>>9849947
Except from that there is one point that appears to be on (0, 0). I wouldn't log that.

>> No.9850449

>>9850425
I refuse to believe there's a single board on 4chan without at least one mention of either word... if it is it's a momentary fluke.

>> No.9850490
File: 69 KB, 1483x832, jn.png [View same] [iqdb] [saucenao] [google]
[ERROR]

>>9849947
>>9848873
here is some data with log scale, though it's not the exact same dataset. This one now is from ~30 minutes ago

The previous one showed how much of a users post consisted of the word on average.
This one charts the ratio of posts, that include either word at least once.
(with possible false positives when someone talking about "jewels")

>> No.9850502

>>9850490
>with possible false positives when someone talking about "jewels"
haha, that would explain /cgl/
and maybe /tg/, /his/ and /lit/

>> No.9850550
File: 181 KB, 1307x846, boomerPosts.png [View same] [iqdb] [saucenao] [google]
[ERROR]

/biz/ and /lit/ surprisingly taking the lead in percentage of posts with mentions of "boomer"

also of all the boards only /f/ and /r/ had no mention of "jew"

/c/ was the only board with zero occurrence of the word "nigger"

>> No.9850597

>>9850550
you mean in threads right now or does it go through the archives too?
there's no way no one on /c/ ever told someone their waifu takes bbc

>> No.9850605

>>9850597
yeah, just the current snapshot of threads that are in the catalog right now

>> No.9850609

>>9850597
>>9850605
it we would have a more comprehensive data set if samples were taken over multiple days

>> No.9850727
File: 90 KB, 686x732, snapper3.png [View same] [iqdb] [saucenao] [google]
[ERROR]

>>9850609
yeah that would be ideal. Some kind of history where you could take the average over the last day or compare day/night, weekday/weekend stats.

I set up a server to continuously generate snapshots, though right now it's only storing the latest one and doesn't keep any older data.
A full sweep of all boards also doesn't take too long (80-90 minutes), since a lot of threads of the slower boards don't change between catalog-checks and therefore don't need to be requested again.

I added some work-in-progress API endpoints to 4stats.io
https://api.4stats.io/snapshotMetaAnalysis
https://api.4stats.io/snapshotTextAnalysis
that data will automatically be updated every time the server finishes processing another board

Probably going to try to add it to the site, so people can play around with it and compare different stats themselves and see if they find something interesting.

>> No.9850785

>>9850727
Including this kind of meta data in the API endpoint is EXTREMELY helpful! Even having just basic statistics on average reply lengths, OP lengths, poster counts, etc opens up so many possibilities to compare, and having it done through the site itself means that work that would have taken hours or days now takes minutes.

Bless you based 4stat anon.

>> No.9850958

I think it would be interesting to look at meme'd etiquette (if you can call it that). For instance, the prevalence of "reddit spacing"

>> No.9851111

>>9850958
Or a search for copypastas.

>> No.9851115

>>9851111
or trigger words? for instance, if sneed is mentioned in a /tv/ thread, that should increase the chance of the thread being deleted.

>> No.9851144

>>9850958
eh, most people complaining about "reddit spacing," are just complaining about how pre-2010 forum posts were sometimes made

>> No.9851358

>>9851115
Or just what words get the most replies on each board. Make an OP-generating machine learning algorithm to optimise the most replies per thread.

>> No.9851363
File: 5 KB, 251x240, 1326504610415.jpg [View same] [iqdb] [saucenao] [google]
[ERROR]

>>9851358
>Make an OP-generating machine learning algorithm to optimise the most replies per thread.

>> No.9851370

>>9851358
a smart algorithm would post a baity but shit OP and then five posts later post one of those roll for a waifu derail images. instant 300 replies

>> No.9851580

>>9848772
>/gif/ being this high
cracked me up

>> No.9851763

>>9850449
It would be a typical 4chan thing to keep the language clean just to blow up some statistics.

>> No.9852002

>>9851763
>inb4 /pol/ goes completely clean for a day just to fuck with us

>> No.9852253

>>9852002
They would blow their communal gasket trying but they would do it.

>> No.9852407

Bump
this thread is great

>> No.9853100

>>9851580
surprised /lit/'s as far as it is

>> No.9853108
File: 404 KB, 1272x1800, the tumult of niggers.jpg [View same] [iqdb] [saucenao] [google]
[ERROR]

>>9853100
its probably due in part to this

>> No.9853582
File: 191 KB, 1243x507, comments.png [View same] [iqdb] [saucenao] [google]
9853582

>>9850727
working on getting the history functioning, but it's a bit more complicated than expected, as those things usually are.
Storing the meta data is no problem (~8MB per snapshot), but I was thinking of whether it would be worth it to also keep thread comments for a while, so it's always possible to go back and check for certain new words or properties, without first waiting a few days for new data to be gathered again.
Storing comments is going to take a lot more space though. (~90MB per snapshot)
Of course it would be possible to save space by de-duplicating the comment data, since many comments exist in multiple successive snapshots, so you just store a comment once and then keep a reference to it in each snapshot.
Though I really like the concept of having full snapshots with all comments in human readable form, where you can just copy one file and immediately have everything, instead of querying a database to generate the comment snapshot from the stored post-number references.
Assuming a full snapshot of meta+text data is 100MB, with ~1 snapshot per hour, that would end up being
100MB * 24 * 7 = 16.8GB for one week, which is quite a bit and could easily reach the limit of the disk, as everything is just running on a small 25GB SSD virtual server.
>>9853100
/lit/ also had a /boomercore/ thread at the time I checked the comments for >>9850550
which alone made up a good chunk of the mentions.
The data is only from a single point in time and varies quite a bit between checks. Wouldn't necessarily give too much about it, until it's an actual average over at least a few days.

>> No.9853675
File: 1.00 MB, 2000x2000, 1530732809931.png [View same] [iqdb] [saucenao] [google]
9853675

>>9837467

>> No.9853698

>>9853675
[citation needed]

>> No.9853721

>>9853675
That graph is impossible

>> No.9853763

>>9853582
Trying to save full snapshots for text analysis is admirable, but as with any potential rabbit hole you might go down I find it's important to ask yourself three basic questions:

1) How much extra work does this involve?
2) What can you get out of this that you can't get out of other, simpler methods?
3) Do the results justify that extra work?

I might be wrong, but it sounds like this is a LOT of extra work, storage, etc for something with only a few applications, many of which can be accomplished with less effort and space just by summarizing the data into posting statistics (post length, use of keywords, etc) as opposed to saving a complete snapshot of every post on the site. Now that might be oversimplifying or misrepresenting things, and if it is - then I think it's important to lay out what else can be done with this data that can't be done another way and is it worth it?

>> No.9853803
File: 1.71 MB, 374x219, 1489305990504.gif [View same] [iqdb] [saucenao] [google]
9853803

Not him, nevertheless:
>>9853763
>1) How much extra work does this involve?
Probably a bit not not too overwhelming

>2) What can you get out of this that you can't get out of other, simpler methods?
This is /sci/: hard facts and measurements with error bars trump opinions any day.

>3) Do the results justify that extra work?
Yes.

Stats-anon is doing a good job and I salute his efforts. Data science is a major buzzword these days and his work is a lot better than a lot of other gunk I see passes as "science." It is also an effort that involves /sci/ and provides an excellent opportunity to think, not just about the results but also the methods and the implications, the conclusions and also the pit falls, about who we are in here and how there is a huge variation.

>> No.9853939

>>9853803
Don't get me wrong, I'm not saying this whole thing isn't a worthwhile endeavor - I'm talking specifically about the problems posed by archiving and analyzing full text snapshots of the site - as multiple people working on this have pointed out it takes GB of storage and hours of processing to do real, in-depth analysis of stuff.

As with any experimental method, or theoretical or computational analysis you get to a crossroads where you have to ask yourself "am I trying to do too much work for too little payoff. You don't do laser induced fluorescence to analyze a plasma if a langmuir probe will do well enough, and you don't fill up a 25 GB SSD server with endless text data if just storing summarized data points is enough.

>> No.9854489

bump

>> No.9854610

>>9853939
>Don't get me wrong, I'm not saying this whole thing isn't a worthwhile endeavor - I'm talking specifically about the problems posed by archiving and analyzing full text snapshots of the site - as multiple people working on this have pointed out it takes GB of storage and hours of processing to do real, in-depth analysis of stuff.
GB storage was hard 20 years ago but now disks are in the TB range. And hours of processing, well that is when prototyping. I cannot see this is a problem.

>As with any experimental method, or theoretical or computational analysis you get to a crossroads where you have to ask yourself "am I trying to do too much work for too little payoff.
Sometimes plain exploration is the goal in itself. As I believe it is here.

>You don't do laser induced fluorescence to analyze a plasma if a langmuir probe will do well enough, and you don't fill up a 25 GB SSD server with endless text data if just storing summarized data points is enough.
I disagree with the comparison, the tech you describe is mature. In the early days it would have made sense if only to check that the two methods gave comparable results. In fact the use of multiple methods is a sign of healthy scepticism.

And all this gives us a good discussion and more insight,

>> No.9854943

>>9853582 again
turns out, that processing a week of comments isn't actually that trivial.
With some boards like /v/ or /pol/ there may be ~850.000 comments to go through.
Only tested it briefly, but the most time consuming part seems to be the fetching the items from the disk/database.
Still looks to be doable though, even if it takes a moment.

There is also a tricky issue on how to best go about analyzing a boards comments.
Just using comments that were posted during a certain time window may not work for slower boards, as it wouldn't be enough to get a good picture. (lots of boards with less than 1k posts a day)
But simply taking what's currently visible also doesn't seem ideal, as fast boards are changing way too quickly to yield any results with consistency.
A good compromise may be to take what's currently visible on a board + comments from the last week

>> No.9855097

Working on processing about a half a dozen different figures right now based on the new meta data JSON >>9850727

The plots range from image replies vs posting activity, to replies per thread vs OP length, image responses vs text responses, etc. I'll try to get the framing and labeling and shit done in the next half hour. If anyone has any requests or suggestions post them and I'll get to them if I have time.

>> No.9855137
File: 30 KB, 640x1033, 4chanhr-imagesvsppd.png [View same] [iqdb] [saucenao] [google]
9855137

Dumping. Some of these show some interesting trends, others are pretty obvious, but still nice to confirm. Starting with an updated version of the OP figure.

>> No.9855138
File: 28 KB, 960x733, 4chanhr-rwivsrwt.png [View same] [iqdb] [saucenao] [google]
9855138

>> No.9855140
File: 28 KB, 640x992, 4chanhr-rptvsppt.png [View same] [iqdb] [saucenao] [google]
9855140

>> No.9855145
File: 28 KB, 640x995, 4chanhr-rptvsopl.png [View same] [iqdb] [saucenao] [google]
9855145

>> No.9855147
File: 31 KB, 640x1023, 4chanhr-aplvsppd.png [View same] [iqdb] [saucenao] [google]
9855147

>> No.9855148
File: 27 KB, 640x985, 4chanhr-aplvsppp.png [View same] [iqdb] [saucenao] [google]
9855148

>> No.9855173

>>9851358
>in the near future:
>12 SQT threads

>> No.9855275

>>9855137
>>9855138
>>9855140
>>9855145
>>9855147
>>9855148
Man, the porn/lewd boards stick out like a sore thumb in basically every plot.

>> No.9855632

>>9855137
Excellent stuff, anon.

>> No.9855825

Coolest thread in all 4chan

>> No.9855827

This thread is golden, any self respecting mod would encourage it by making it sticky, and all of subsequent threads like this.

It's a shame that this board has been largely unmoderated since 2014

>> No.9856031

>>9855137
>>9855148
These two are particularly interesting in that there seem to actually be large scale structures - isolated groups, large bands with branches extending out

>> No.9856225

made some progress, but the code is a total work-in-progress clusterfuck until I find time to clean it up.
I hope that everything is accurate, but can't guarantee it 100%. Will just go back tomorrow and see if everything still looks alright, especially the day-averages.

https://api.4stats.io/textAnalysisLastDay
https://api.4stats.io/metaAnalysisLastSnapshot
https://api.4stats.io/metaAnalysisLastDay

"textAnalysisLastDay" is the result of all currently visible posts + any other posts from the last 24 hours, that the server saw (even if they are no longer on the board)
"metaAnalysisLastSnapshot" is just the last snapshot result
"metaAnalysisLastDay" is the average of the individual snapshot results from the last day

>> No.9856255
File: 150 KB, 500x478, a5bcb1b5ceadb86013877ad7956179cf40b9a49809aa014236999595742eb1ac.jpg [View same] [iqdb] [saucenao] [google]
9856255

OP was not a faggot indeed

>> No.9856377

>>9853675
>/mu/
>114

>> No.9856395
File: 99 KB, 424x420, 1530309839938.png [View same] [iqdb] [saucenao] [google]
9856395

now also as .csv files instead of json

https://api.4stats.io/csv/textAnalysisLastDay
https://api.4stats.io/csv/metaAnalysisLastSnapshot
https://api.4stats.io/csv/metaAnalysisLastDay

>> No.9856436
File: 45 KB, 1290x747, reddit_boomer.png [View same] [iqdb] [saucenao] [google]
9856436

no reason this has to be a scatter plot, but anyway

/tv/ is the board far ahead of anyone else, when it comes to the ratio of posts mentioning "reddit" with ~1.1% of posts
/biz/ has almost 1.7% of all current posts mentioning "boomer", with /fit/ following close behind

>> No.9856469

>>9856436
What's the point size based on?

>> No.9856473

>>9856469
average posts per day for the board

>> No.9856479

>>9856473
Ooh, nice, that's an interesting way of incorporating a third axis into a 2D plot, I might try that on some of mine.

>> No.9856505

>>9856436
You might want to add in synonyms like "pleddit."

>> No.9856574

>>9837528
>un funded
>/pol/
your tax-dollars ladies and gentlemen.

>> No.9856629

>>9856574
Missing from the "analysis" is the question how much is hate and how much is basement edginess. After all, for all the words slung around we have a lot of posts from Israel in the /watch/ generals over in /g/, showing berets and Hebrew keyboards, people who don't seem to be scared away.

>> No.9856718

>>9856629
yeah, "what a nigger" has surpassed hate and is now in the realm of "what a jerk", at least on 4chan & other internet communities

>> No.9856737

>>9856718
>"what a nigger" has surpassed hate and is now in the realm of "what a jerk", at least on 4chan
its been like that since the the early days, we've even had "nigger" and "faggot" word filtered at times due to their constant near universal use as a generic insult or pejorative

>> No.9856813

>>9856505
I guess I could just try "eddit" instead. That should catch everything.

>> No.9857125

>>9856813
>plebbit

>> No.9857255

>>9856436
I think a logarithmic plot might be better suited for this

>> No.9857284

anyone have ideas in terms of how they want all these charts presented?
I've got a pirated copies of Photoshop and illustrator that i'm dying to use

>> No.9857309

>>9857284
1) go to your library and check out everything by Ed Tufte
2) read
3) do what he says to do in the books

>> No.9858341

>>9857255
seconding - anytime you're looking at a difference of an order or two of magnitude along an axis you should consider a log scale

>> No.9858894
File: 79 KB, 1315x828, chart.png [View same] [iqdb] [saucenao] [google]
9858894

Spent some time working on a page to interact with the stats directly on the site.
It automatically loads the latest available data from the snapshot API.

All work in progress and still needs visual polish, but works quite alright already.
Play around with it, if you want.

https://4stats.io/snapshotAnalysisWorkInProgress

>> No.9859022

>>9858894
now this is podracing

>> No.9859156

>>9858894
God damn dude, that's fucking impressive.

>> No.9859242

>>9855137
>>9855138
>>9855140
>>9855145
>>9855147
>>9855148
>>9856436
>>9858894
This is fantastic work guys!

>> No.9860022

>>9857284
Almost any plotting tool should be fine - I think it's more a matter of working out what parameters give us the best "description" of 4chan board activity.

>> No.9860059

>>9858894
this is beyond great work dude

>> No.9860281
File: 55 KB, 768x768, eeyore.png [View same] [iqdb] [saucenao] [google]
9860281

If you compare overall posting activity to posts per users it look like Eeyore. Is this our equivalent of Ellis' quantum penguin diagrams?

>> No.9860376
File: 52 KB, 789x777, s4s.png [View same] [iqdb] [saucenao] [google]
9860376

>>9851763
>>9852002
or [s4s] doing the opposite
>>>/s4s/6892903

it's not really affecting the % of post-mentions, but it's visible in the text ratio

>> No.9860512

>>9860376
That's pretty funny actually.

>> No.9860525
File: 46 KB, 1600x1200, 4chan stats board mention.png [View same] [iqdb] [saucenao] [google]
9860525

>>9837467

>> No.9860532

>>9860525
Cool. Interesting thought though, you could try organizing the boards by some other statistic (ex. posts per day or average post length) instead of alphabetically and see if there are any contours.

>> No.9860575

>>9860525
Hey OP, do text and post ratios for mentions of each board. There might be interesting correlations between mentions of /pol/ and mentions of /b/ for instance

>> No.9860596

>>9860525
what is up with the linking of /m/ on /soc/?

>> No.9860608
File: 42 KB, 562x437, 1526056722812.jpg [View same] [iqdb] [saucenao] [google]
9860608

>>9860525
>/r/
>>>/r/eddit
lmao

>> No.9860679
File: 332 KB, 1426x757, scilab.png [View same] [iqdb] [saucenao] [google]
9860679

>>9860532
This was made by another /sci/ anon a few years ago. It should be trivial to recreate and improve on it though. The difficulty seems to be in collecting data.

>>9860575
Interestingly enough, as I recall the anon originally posted this for that purpose. Specifically it was to point out how all the boards consistently tell /pol/esmokers to go back to their containment board. It was a /sci/ meta thread talking about all the /pol/ spammers and Moot participated as well (and for the next several days he had /pol/ playing some cuckold training videos and shit nonstop).

>>9860596
It's actually caused by all the lonely males desperately responding to 'a/s/l' requests.

>>9860608
lol, board mentions are a pretty good containment board detection heuristic.

>>9856225
It's interesting to see people putting effort into this sort of thing. Too bad it looks like there are /pol/esmoker elements behind the wheel given the keywords chosen.

>>9848042
>>9847894
Have you two considered using something like AWS for the data collection (to deal with the requests per IP limit)? Alternatively, 4chan actually goes out of its way for special cases (eg. board archival sites), another alternative would be asking the 4chan admins if they would be willing to provide a better way to obtain the data so that we don't have to hammer on their API.

Perhaps at some point I'll join in depending how this progresses. I still remember the early days when /sci/ was young and we wanted to have some interesting projects of our own.

>> No.9860774

>>9860679
I'm starting a new job next week, but once I've settled into the new rhythm I might try and implement this, it could be an interesting little side project

>> No.9860810

>>9837467
It should include (maybe with colours) the level of shitness. For example:
Black: No originality and toxic as fuck (/b/)
White: Lots of fun/original/interesting content. (Perhaps one of the topic-specific boards? I don't know about all the boards so i would not be able to say)

>> No.9860826

>>9860679
>Too bad it looks like there are /pol/esmoker elements behind the wheel given the keywords chosen.
I just picked some quick words that the average person might think are common on 4chan as a whole, but as the stats show, some of them are really just found on /pol/ and not much outside.
There was no intention to create some kind of edginess ranking, but I will make a new list of words tomorrow I think.

>something like AWS for the data collection (to deal with the requests per IP limit)?
hm, maybe
Currently I use 3 small VPS for everything.
1 for the live stats (loading catalogs during fixed 5 minute cycles), 1 for making the snapshots (60-80 minutes for all boards+threads at 1 request/second) and 1 for the API server, so that clients don't directly connect to the stats gatherers.
> Alternatively, 4chan actually goes out of its way for special cases
Is there any info on this somewhere?
It would of course be neat to not be strictly limited by the 1 request/second rule or have some other options in addition.

>> No.9860831

>>9848772
It makes sense /bant/ is more or less in the middle. For the last few months /bant/ users have been in a "defensive war" against /pol/. /pol/ post racist threads as bait very frequently, which usually get replied with the ussual "back to /pol/" by banters. At least this has led to some OC created by /bant/ users.

>> No.9860850

>>9860810
Devise a way for us to measure "shitiness" qualitatively and we'll look into it.

>> No.9860853

>>9837467
After reading the whole thread, i believe i can happily say that 4chan actually has some very smart people in it. Congratulations, really. Thanks for giving me back hope in this site and in it potential to do fantastic stuff.

>> No.9860870

>>9860850
I am not the guy that posted that but the only answers i could think of would be to make a poll, asking the users of its board if they think their board is good at creating original content or being fun/interesting. The problem would be deciding how to avoid trolling affecting the results. Perhaps by using bait questions such as "is OP a faggot?" or other memes(troll-bait basically). Those how say yes in the example (or fall for the clear troll-bait) would just be considered trolls trying to affect the results instead of people taking it seriously.

Its the only idea i could think of.

>> No.9860886
File: 3.63 MB, 3218x6418, pol plays capture the flag.jpg [View same] [iqdb] [saucenao] [google]
9860886

>>9860853
4chan can come up with some fucking insane demonstrations of intelligence and creativity when it finds the right project - it's just that the "right project" is whatever sounds like it'll be fun, interesting, or challenging not necessarily what'll actually be the most productive or worthwhile.

Sometimes that means /sci/ breaking out graduate level physics to determine whether you can cook a steak via orbital reentry. Sometimes that means /co/ learning cryptography to decipher codes in a cartoon show. Sometimes that means /pol/ using airplane contrails and fucking stars to triangulate the position of a flag they're trying to steal. This is 4chan: we're creatures of whim.

>> No.9860935
File: 121 KB, 1472x858, sentiment_thank.png [View same] [iqdb] [saucenao] [google]
9860935

>>9860810
the closest thing for now is a sentiment analysis on the comment text, though far from being an accurate representation.
There are certainly better ways to check for that.

>> No.9860946

>>9860886
>/sci/ breaking out graduate level physics to determine whether you can cook a steak via orbital reentry
sauce

>> No.9860954

>>9860946
Can i get an screencap of that? I sounds very interesting. Hopefully when i finish my studies i can help in the next fun thing /sci/ does.

>> No.9860985

>>9856737
I never saw faggot getting filtered but I definitely remember roody poo, and 10 years ago peanut butter and jelly got filtered to peanut butter and niggers, iirc. It's been so long it's hard to remember.

>> No.9860993

>>9860985
it was around the same time iirc that "faggot" was word filtered

> It's been so long it's hard to remember
i know, i was starting high school around the time these these happened

>> No.9861009
File: 548 KB, 1276x1048, 1524426258700.png [View same] [iqdb] [saucenao] [google]
9861009

>>9860993
Well, I found a list, very much longer than I knew.

https://www.lurkmore.com/view/4chan_Wordfilters

We do have a current one sitewide, from the butthurt /g/ mod getting booty blasted over this image (and other things) in the /mkg/.
Since you're an oldfag too, here's a filterproof so᠌yboy that you can copy to a text file, works with so᠌yboard as well.
*snicker*

>> No.9861014

>>9861009
Only cancerous containment board users use that term.

>> No.9861016

>>9860946
>>9860954
https://yuki.la/sci/5209640

Almost 200 posts over a 24 hour period. Graduate level might be a bit of an exaggeration, but you had anons legitimately trying to estimate frictional heating on reentry, the effects of rotation, whether to jettison it frozen or raw, etc. IIRC the guy who does XKCD picked up on it and did a whole "What If?" entry working out the problem.

>> No.9861017
File: 928 KB, 500x244, 1367488036816.gif [View same] [iqdb] [saucenao] [google]
9861017

>>9861014

>> No.9861019
File: 80 KB, 499x434, wordfilter assmad.png [View same] [iqdb] [saucenao] [google]
9861019

>>9861009
>lurkmore was still up
wew lad

I actually like that word filter though, it produced a lot of butthurt on boards with srsfags

>> No.9861020
File: 94 KB, 1342x279, electionfags and wordfilters.png [View same] [iqdb] [saucenao] [google]
9861020

>>9861009

>> No.9861025
File: 20 KB, 616x106, Capture.jpg [View same] [iqdb] [saucenao] [google]
9861025

>>9861019

>> No.9861028

>>9861025
lol

>> No.9861030
File: 29 KB, 316x368, don't lie.jpg [View same] [iqdb] [saucenao] [google]
9861030

>>9861017

>> No.9861034

>>9860886
Right now, it's /biz/ hyping up a severely undervalued cryptocurrency

Screencap this

>> No.9861554

What exactly do sentimentScore and sentimentComparative represent?

>> No.9861633

>>9853675
https://cdn2.desu-usergeneratedcontent.xyz/qa/image/1521/38/1521388673182.png

>> No.9861636

>>9851363
>>9851358
Have you guys never been on /v/? Any thread not made by a filter-evading-bot dies with fewer than ten replies. Locate a blemish, now that the dust has settled, whats her name? >enemies can open doors

>> No.9861657
File: 35 KB, 786x788, images vs text.png [View same] [iqdb] [saucenao] [google]
9861657

Best graph that really shows the negative correlation of posts with images and posts with texts, and showing the decline from discussion to image dump

>> No.9861717
File: 26 KB, 770x293, sentiment.png [View same] [iqdb] [saucenao] [google]
9861717

>>9861554
it's just something I stumbled upon while building the analyzer and used it, because I didn't really have much other criteria to check yet.
The sentiment analyzer categorizes the individual words into positive/neutral/negative and also scores them from -5 to +5.
'sentimentScore' is the total score of all words divided by the # of comments analyzed on that board
'sentimentComparative' is the average of how much of a comment consists of positive/negative words (scoring whole posts from -5 to +5 this time).
https://github.com/thisandagain/sentiment#how-it-works

It was just an experiment, but I see now that it's much better suited for something like news articles or blog posts instead.
With 4chan posts, it easily misreads things, as seen in the 3rd comment in the pic, where it interprets 'no problem' as 2 negative words.
>>9861633 might be a good thing to check for actually. Is Flesch-Kincaid something that's generally accurate for all kinds of text, even 1 sentence comments?

>> No.9861735

>>9861717
>it interprets 'no problem' as 2 negative words.

It seems like you can use a custom corpus of words so it might be possible to get it to interpret phrases like "no problem" as a positive

>> No.9861767

>>9861735
yeah but doing it manually you will never reach the end of covering those special cases. I think it's simply not made for analyzing dialogue-like language.

>> No.9861795

>>9861767
That's true. There does seem to be some research on applying sentiment analysis to dialogue though, which hopefully would yield better results with informal text. I found this paper which I'm skimming now, but seems like it could be applicable:
https://pdfs.semanticscholar.org/0d4f/c603e54dd864dbb0ac1deaa116c67c14f7bd.pdf

>> No.9861815

>>9861657
I dunno if that shows it quite as accurately - almost the entire first 2/3rds of that are porn, lewd, and wallpaper threads and they seem to be throwing almost every single figure off because they're inherently the boards with the most images, least text, and yet a relatively average amount of traffic. We keep getting this big porn clusters in every single figure - we may have to consider excluding some of these boards if we want to get a more accurate picture of the rest of 4chan.

>> No.9861888

>>9860376
my post :)

>> No.9861898

>>9861815
>we may have to consider excluding some of these boards if we want to get a more accurate picture of the rest of 4chan.

If this is needed, i would exclude /b/. It the one with most traffic and, for the most part, everything is porn. It could mess with the rest of the data, and (personal opinion) /b/ is considered by most long-time anons as unsalvable.

If it is needed this is what i would choose to do, but if the situation arrises i will let the experts decide(i am just lurking a bit here since this is way too complicated for me, but very interesting)

>> No.9861904

>>9861898
And yet, ironically /b/ is rarely in the extremes on these graphs - it's always boards like /c/, /cm/, /s/, /aco/, /y/, /e/, etc that are always way in the fringes for image ratios, posts per poster, etc

>> No.9861917
File: 171 KB, 475x614, 1509473574162.png [View same] [iqdb] [saucenao] [google]
9861917

>>9839415
Yes, frog = impeccable intellect

>> No.9861925

>>9861917
Tell that to /bant/. Or many of the other boards.

>> No.9861941
File: 1012 KB, 1000x1000, 1514405991603.png [View same] [iqdb] [saucenao] [google]
9861941

Here's a completely unreadable circular relationship graph I made a while ago, with data collected over a period 2 hours or so. It would be great to have an interactive version of it but I could not find an easy/lazy way to do it.
Here's the raw data: https://pastebin.com/raw/PVRDP59q

>>9840294
>or, if you're proactive, trying and put some data together.
You can find a literal metric shit ton of slightly outdated data here https://archive.org/download/archive-moe-database-201506
I personally don't have enough HDD space left to do anything with it.

>> No.9862026

>>9861735
Similarly "pretty bad" is a doubly negative rather than one positive and one negative.

>> No.9862171
File: 489 KB, 480x262, 1529423607244.gif [View same] [iqdb] [saucenao] [google]
9862171

>>9838118
>I was accurately able to intuit the most heavily trafficked board without knowledge of the statistics in order to best casually understand the zeitgeist.
Neat

>> No.9862179

>>9849836
I was going to mention this.
It's actually not difficult to track images, though the many variations would certainly cause a headache.

>> No.9862419

>>9861941
Is this by board mentions or board crosslinks?

>> No.9862420

>>9862419
Crosslinks only

>> No.9862436

>>9862420
kay, yeah I figured that would be way easier to search for

>> No.9863018
File: 172 KB, 633x314, on whom the pale moon gleams.png [View same] [iqdb] [saucenao] [google]
9863018

>>9837467
>and determine if it's possible to illustrate some kind of rough empirical structure of 4chan's communities and cultures.

asking for 4chans help in datamining operations, huh?

>> No.9863053

>>9863018
Nope, just a neat project.

>> No.9863062

>>9863053
I don't believe you.

>> No.9863069

>>9863062
That's nice.

>> No.9863076

>>9863069
>That's nice.

Thanks, I thought the same thing.

>> No.9863691

>>9863018
Oh get fucked, this is the first interesting project that's been on /sci/ in months.

>> No.9863719
File: 78 KB, 1425x311, regex.png [View same] [iqdb] [saucenao] [google]
9863719

Even just counting the characters someone typed in is tricky to get right.
I am removing and replacing
>/p/ EXIF data
>post-number quotes (quotelinks and deadlinks)
>linebreaks (<br> gets replaced with a single character to count pressing enter)
>any remaining HTML-tags, but not their content, like /g/ [code] brackets, that end up as <pre></pre> HTML-tags
>converting HTML-entities to normal characters ("&gt;" becomes ">")
>any whitespace from start and end of post

Now I am not a /sci/ regular and noticed the [math] and [eqn] tags just recently.
Do these even work? They don't do anything for me, when trying them in the preview.
And are there any other boards also have similar special content, that can get inserted into a post?

>> No.9863831
File: 243 KB, 3600x1300, sci latex guide.png [View same] [iqdb] [saucenao] [google]
9863831

>>9863719
>Now I am not a /sci/ regular and noticed the [math] and [eqn] tags just recently.
>Do these even work? They don't do anything for me, when trying them in the preview.
It's latex format, you can type out equations and junk. There's a guide for it

>> No.9863864

>>9863831
I checked it again. uBlock prevented the required script from loading.

>> No.9863882 [DELETED] 

>>9863831
test
{e}^{i \pi} + 1 = 0
[eqn]{e}^{i \pi} + 1 = 0[/eqn]

>> No.9864544
File: 28 KB, 356x629, threadAge.png [View same] [iqdb] [saucenao] [google]
9864544

started checking for oldest thread age (excluding stickies)

Also rewrote the whole text analysis part. I would ultimately like to make it so anyone can pick their own words to check for and not the server having a pre-defined list.
I just don't know how to go about it regarding search performance with full-text search on this kind of scale.
(visible + last day content is ~1.200.000 comments with ~139.000.000 characters, if I did everything right with this)

>> No.9864888

>>9864544
interdasting

>> No.9864971
File: 49 KB, 800x480, 35930078_200159300632581_7458529735379779584_n.jpg [View same] [iqdb] [saucenao] [google]
9864971

This is a cool thread

>> No.9866267

>>9837467
I hope you will submit the paper and put that UN thing to shame.

>> No.9866564

still around OP?
Anything new on your end?

>> No.9867466

>>9853939
>>9853939
/g/ here. I sure hope anon isn't storing the comments as actual text. That would burn a lot of disk space unnecessarily.

tl;dr

If you're analysing words, then give each word a number (32 bits should more than cover 4chan :) rather than use off the shelf compression. That way, you have symbols you can count with no effort at all.

>> No.9867572

>>9860525
could someone arrange this as a directed graph?
if only connections with like 150% the average mentioning rate are shown it should be possible to find clusters

>> No.9867608
File: 5 KB, 365x241, 1512330007216.png [View same] [iqdb] [saucenao] [google]
9867608

>>9867466
afraid that won't work though.
Let's say you want to check for the occurrence of a certain word like "boomer", then you would miss out on any variations like "boomers", "boomerfolio", "boomerpost" and so on.

Space isn't the problem right now. It's rather finding a way to check ~ a day of it in the most efficient way.
Best solution I have so far is to tokenize comments into words, save the token count in a map and then look through all keys to see if part of it matches the search-word.

Though maybe it's not actually worth the effort or even a good idea in the first place.
4chan is a pretty nice and organic place.
Even while working on it, I get the feeling, that doing in depth analysis kind of removes the soul of it all.
The extent of the live stats at least is only to show board activity, but fully analyzing content and categorizing boards is quite different and maybe not that fun in the end.

>> No.9869265

saving

>> No.9869324
File: 55 KB, 787x253, r9k.png [View same] [iqdb] [saucenao] [google]
9869324

>>9867608
>4chan is a pretty nice and organic place.

>> No.9869431

>>9838864
>>9839385
>>9839408
a month or so before 30yo boomer really blew up there were was a compliation image of 30yo boomer threads on at least 12 boards trying to force the meme

>> No.9869742

>>9869431
I've never heard of that meme. I guess that means I don't browse cancer boards.

>> No.9869770

>>9869742
>that 30 y.o. boomer that doesn't keep up with the memes

>> No.9870208

>>9869742
wow you're cool too bad you're on /g/ right now

>> No.9870214

>>9870208
>too bad you're on /g/ right now
you do know that this is /sci/, right?

>> No.9870231

>>9867608
>Even while working on it, I get the feeling, that doing in depth analysis kind of removes the soul of it all.
The soul comes from the people so unless your computer is not powered by nameless horrors the soul should be pretty much safe.

>> No.9870535

>>9870208
Kill yourself

>>9869770
gb2 your containment board

>> No.9870557

>>9870535
*snap*

>> No.9870575

>>9870535
t. self appointed /sci/ guardian

>> No.9871702

The time it takes for an unbumped thread to get archived would be an interesting variable. And related, the average position in the catalog at which threads are bumped from. And if you track bumping, you could also measure the ratio of sage posts.

>> No.9872105

>>9837467
It looks like your remote viewing an alien there wave

>> No.9872175

>>9871702
>The time it takes for an unbumped thread to get archived
that's pretty much the same as rate of new threads with some exceptions, where a board doesn't hold the usual 150 threads
>the average position in the catalog at which threads are bumped from
I don't think you can reasonably check that. Maybe for very slow boards, but otherwise you would have to fetch the catalog so frequently for it to work.

>> No.9872592

>>9848772
>/lit/ is the only one who says "jew" more than "nigger"

>> No.9872594

>>9856436
weird, I thought "boomer" was a /tv/ thing

>> No.9872698 [DELETED] 

>>9872175
>that's pretty much the same as rate of new threads with some exceptions, where a board doesn't hold the usual 150 threads
Not exactly. The speed at which threads are pushed down is a combination of the the rate of thread creation and the rate at which the threads below them are bumped. The expected time to be pushed from position x to position x+1 would be
[math]\displaystyle t(x) = \frac{1}{C + f(x) B}[/math]
where C is the thread creation rate, B is the thread bumping rate, and f(x) is the fraction of bumps that occur in threads below position x. At the bottom of the catalog, f(x) goes to zero and thread creation dominates, whereas at the top of the catalog, f(x) approaches one and bumping dominates. So by measuring the time it takes for threads to reach various positions in the catalog, you can calculate the rate of bumping and the distribution of where the bumping takes place.


For threads near the bottom of the
f(x)

>> No.9872699

>>9867608
An interesting take that I thing could go in a lot of directions is looking at reply-chains graphs, particularly in connection with questions regarding analysis of longer post length, etc.
Going through a thread and adding onto each post's entry: replies_to, replied_by

>> No.9872702

>>9872175
>that's pretty much the same as rate of new threads with some exceptions, where a board doesn't hold the usual 150 threads
Not exactly. The speed at which threads are pushed down is a combination of the the rate of thread creation and the rate at which the threads below them are bumped. The expected time to be pushed from position x to position x+1 would be
[math]\displaystyle t(x) = \frac{1}{C + f(x) B}[/math]
where C is the thread creation rate, B is the thread bumping rate, and f(x) is the fraction of bumps that occur in threads below position x. At the bottom of the catalog, f(x) goes to zero and thread creation dominates, whereas at the top of the catalog, f(x) approaches one and bumping dominates. So by measuring the time it takes for threads to reach various positions in the catalog, you can calculate the rate of bumping and the distribution of where the bumping takes place.

>> No.9872736

>>9872699
I can give you a prediction for this already - plotting number of post replies vs post length will give you a bimodal distribution - one short peak for mid-to-long posts that make really good points and contribute to the discussion, and one fucking huge peak for short, inane posts that happened to get repeating digits.

>> No.9872747

>>9872736
It would be interesting to see which boards care about dubs and which do not.

>> No.9872759

Read about half of the thread, but I don't have the patience to check for this: have you tried coming up with as many parameters as you can, doing a principal component analysis on the data, discarding the highly correlated parameters and simply applying a few clustering algorithms to the boards to get some sort of similarity index between different boards? Or maybe running some text analysis on posts across boards to identify what boards certain groups of people tend to visit? (e.g. there'd be a cluster for /sci/, /g/, /lit/, /diy/, etc, one for /hm/, /cm/, /y/, /fit/, etc, and so one)

>> No.9872773

>>9872702
this is interesting, nice approach

>> No.9872927

>>9846869
>>9846860
there are already archiving websites which I'm sure would be happy to satisfy your request for data; 4plebs comes to mind

>> No.9872974

>>9844047
It is possible to embed steganographic data within images by fiddling slightly with RGB values, which is certainly an option for you. There's an tool online somewhere for it.

>> No.9873154

>>9872974
>steganographic data
I've had quite enough of your dinosaur mumbo jumbo, sir!

>> No.9873187

>>9872974
This wouldn't survive JPEG compression.

>> No.9873210

>>9873187
Correct me if I'm wrong, but I don't think 4chan uses lossy compression. Regardless, the tool is fun to play around with
http://steghide.sourceforge.net/

>> No.9873872

>>9861657
Well obviously.
You can't not post text without posting an image.

>> No.9873896

>trying a bit of this for my cozy general, in R
jesus god kill me I hate webshit
I've been struggling way too long with parsing this bullshit, finally resorting to some "xpathapply" gibberish, where it should be so much easier to extract what I want from the quotelink/etc tags, then get rid of the tags and save the text comment
of course also the parsing library I was using shits the bed on pure strings, so I have to ask isHtml(string) or isXML(string), I forgot which because I just ended up so disgusted.

The web was a mistake.

>> No.9873900

>>9873896
Are you trying to scrape the page instead of just using the JSON API?

>> No.9874007

>>9873900
No, getting the catalog json, leading to the thread json, leading to comments in html form
I'm sure that I'm the weak link in the process I don't have much experience with it. I'll probably to start the comment-parsing section over from scratch and come up with better tools/structures to deal with that part. I essentially just want to extract the reply-chain information in some form, and the pure text for practicing some analysis, maybe things like "samefag likelihood" estimation, post generation including attachment, etc

Does anyone know how the post cooldown is calculated? It is longer for posts with images, but it also seems to change periodically, though I could be mistaken.

>> No.9874010

>>9874007
It's different from board to board. There's a list at
https://a.4cdn.org/boards.json

>> No.9874479

>>9872736
>>9872747
I'd like to see a plot of replies vs repeating digits. I'd bet there's an exponential increase as you get better and better gets.

>> No.9874483

>>9872759
That's basically the overlying goal of the project - determining critical parameters and trying to create a 'map' of the boards in that parameters space.

There've been quite a few attempts at that so far. Ex. the set starting with >>9855137 but one of the problems is that a lot of different parameters are correlated with each other and determining what parameters are the *true* independent parameters is difficult in those cases. For example in >>9855138 there's a clear correlation between text vs image posting, but which is the actual independent parameter in this case? Or is it neither and both depend on other board characteristics?

>> No.9875026
File: 116 KB, 1597x867, text_analysis.png [View same] [iqdb] [saucenao] [google]
9875026

Finished modifying the text analysis, so words can be chosen on the fly instead of it being a predefined list.
Though it's only checking the most recent snapshots of each board now (usually not older than 90 minutes.).
It's not going through any history, just what's currently visible, therefore results could vary quite a bit from hour to hour.

'text_ratio' is the ratio of text the string takes up among all characters
'posts_ratio' is the ratio of posts containing that string at least once
try it if you want
https://4stats.io/snapshotAnalysisWorkInProgress

>> No.9875048

>>9875026
I might be retarded, but I hit "Analyze", it said all good, but it's not doing anything.

>> No.9875051

>>9875048
does it not add the entry to the text analysis list?
If so which browser are you using?

>> No.9875065

>>9875051
It does add it, but no points appear. Text is "hello"
Happens in both firefox and chrome.
Oh fuck, I realized now I'm supposed to click the x,y boxes. I'd say I'm an idiot but I think there's not really enough indication that that's what I need to do. Works in both firefox and chrome.

>> No.9875073
File: 35 KB, 806x779, 1531866487.png [View same] [iqdb] [saucenao] [google]
9875073

hello

>> No.9875079

>>9875065
yeah the UI is something quickly hacked together to have something to visualize the data.
If I ever integrate it into the site properly, then that would need an overhaul for sure.
>>9875073
Search without the quotes though.
Unless you really want to search for "hello" with quotation marks on each end

>> No.9875085
File: 59 KB, 803x787, 1531866764.png [View same] [iqdb] [saucenao] [google]
9875085

>>9875079
Ahhh, got it. Well besides that initial confusion, it works very well. It's really nice to keep the analysis list items for comparison with later things.
Really good work anon, very nice tool.

>> No.9875125

>>9875079
how is postsPerPoster calculated?

>> No.9875129

>>9875125
oh I think it must be a WIP, it doesn't give visual indication of being clicked (though it is apparently some value)

>> No.9875263

>>9875125
>how is postsPerPoster calculated?
it's the average number of posts of unique IPs per thread.
basically (repliesPerThread_mean / postersPerThread_mean)
>>9875129
>it doesn't give visual indication of being clicked
oh, there was an issue, because /vip/ doesn't give info about unique IPs per thread.
I just set it to 0 now for that board.

>> No.9875433
File: 70 KB, 380x349, boomer.png [View same] [iqdb] [saucenao] [google]
9875433

>>9872594
started on /fit/

>> No.9875623
File: 13 KB, 559x527, 1525184296550.png [View same] [iqdb] [saucenao] [google]
9875623

>>9845054
thanks /pol/

>> No.9875699
File: 57 KB, 784x779, file.png [View same] [iqdb] [saucenao] [google]
9875699

Why does /sci/ have so many reddit links?

>> No.9875710
File: 126 KB, 1159x819, file.png [View same] [iqdb] [saucenao] [google]
9875710

Places where Lee spams or that talk about him. There are a suspicious number of zeros. You'd probably need a longer data period to say anything interesting.

>> No.9875773

>>9875710
what? why isn't /v/ #1?

>> No.9875782

>>9875699
because we're not stupid

>> No.9876098

>>9875026
Could you get it to accept slashes so that it can search for terms like "/b/" and "/pol/"?

>> No.9876110
File: 135 KB, 320x363, hatedAnon.png [View same] [iqdb] [saucenao] [google]
9876110

>>9875699
some people on sci are unironically
>I fucking love science
>based black science guy
>Women in Stem, it's current year

>> No.9876791

>>9876098
yeah didn't think of that.
Takes a bit longer though because many of those may be in board crosslinks ( >>>/pol/123456 ) and so far I just removed all board/postlinks, so they wouldn't add to the character count of the post.
I guess this is a special case where the /board/ part should be kept in.
Only difference that this would make is that posts, that only have a single crosslink as their content now won't be counted as no-text replies anymore since the boardname remains, but I guess that's maybe only a handful post across the whole site.
>>9875710
>You'd probably need a longer data period to say anything interesting.
true, only checking visible content isn't ideal.
visible + last day would be better, but I have to see if I can make it work with the memory available.

>> No.9877238
File: 778 KB, 750x422, Popsci Internet Defence Force.png [View same] [iqdb] [saucenao] [google]
9877238

>>9875699
popsci plebs

>> No.9877373

>>9853675
/co/ and /lit/ are my two favorite boards.

>> No.9877390

Consider only analyzing posts that have replies to them. Many posts are very low quality spam.

>> No.9877489

>>9853675
No /diy/?

>> No.9877592
File: 114 KB, 1588x782, boardsearch.png [View same] [iqdb] [saucenao] [google]
9877592

>>9876098
I switched everything over to a new server that should be able to handle it. Slashes for searching for board mentions also work now.
It's now looking through visible + last day posts, which should result in much more stable results for faster boards.
Maybe give it a day to regenerate the comment cache though for better results, since right now it only has visible + ~last hour or so.
Always worried that I made retarded mistake somewhere and some of the results may be total bullshit, but nothing really weird stands out so far.

Don't know why I am wasting that much time on this.
Could have played some CS:GO instead.

>> No.9878445

>>9877489
That diagram is a total scam.

>> No.9878618

>>9878445
Figured as much, collecting that data in the first place would be impossible (unless you're hiroshimoot)

>> No.9878636

>>9877592
>Don't know why I am wasting that much time on this.
>Could have played some CS:GO instead.
Because science.

>> No.9878746

>>9878618
It just goes to show that fancy presentations and names will convince the lot. You know perhaps this one?
https://brainosoph.wordpress.com/2014/11/13/the-ghost-of-stronzo-bestiale-and-other-fake-scientific-authors/

>> No.9879106

In the last day, based on post ratios (excludes /qa/ and /vip/):
/adv/ mentioned /r9k/, /soc/ more than other boards
/asp/ mentioned /sp/ more than other boards
/bant/ mentioned /an/, /po/, [s4s] more than other boards
/cm/ mentioned /y/ more than other boards
/d/ mentioned /aco/ more than other boards
/diy/ mentioned /out/ more than other boards
/e/ mentioned /h/ more than other boards
/f/ mentioned /a/ more than other boards
/fa/ mentioned /cgl/, /fit/ more than other boards
/gd/ mentioned /3/, /t/, /wsr/ more than other boards
/h/ mentioned /d/, /e/ more than other boards
/i/ mentioned /ic/, /qst/, /trash/ more than other boards
/ic/ mentioned /gd/ more than other boards
/lit/ mentioned /his/, /tv/ more than other boards
/m/ mentioned /wsg/ more than other boards
/n/ mentioned /o/, /toy/ more than other boards
/news/ mentioned /pol/ more than other boards
/out/ mentioned /adv/, /asp/, /fa/, /k/ more than other boards
/po/ mentioned /diy/, /i/, /n/, /tg/, /trv/ more than other boards
/r/ mentioned /b/ more than other boards
/r9k/ mentioned /lgbt/ more than other boards
/s/ mentioned /gif/, /hc/ more than other boards
[s4s] mentioned /mu/, /s/ more than other boards
/sci/ mentioned /bant/, /c/, /f/, /lit/, /mlp/, /x/ more than other boards
/t/ mentioned /r/, /vr/ more than other boards
/trv/ mentioned /ck/, /int/, /p/ more than other boards
/v/ mentioned /u/ more than other boards
/vr/ mentioned /v/, /vp/ more than other boards
/w/ mentioned /wg/ more than other boards
/wg/ mentioned /hr/, /w/ more than other boards
/wsr/ mentioned /biz/, /co/, /g/, /jp/, /m/, /sci/, /vg/ more than other boards
/y/ mentioned /cm/, /hm/ more than other boards
/news/ didn't get mentions outside of /vip/ and /qa/. What a betafag.

It might not be accurate due to a single day's worth of mentions, but you can totally see the "board families" and which boards share a common userbase, like /r/ and /t/, /y/ and /cm/, /d/ and /aco/, /h/ and /d/, /ic/ and /gd/, /trv/ and /int/.

>> No.9879144

>>9879106
why not /tg/?

>> No.9879645

>>9879106
this is fucking cool

>> No.9880003

>>9878636
I am not even a /sci/ anon.
Saw OP posting about the idea of the diagram in another thread (can't remember which board) and it seemed like an interesting idea and potential use case for the data I already had here, but it sure eats time.

>> No.9880393

new bread when?

>> No.9880494

>>9880393
Soon, I hope. This is a most excellent thread.

>> No.9881214

Do we have any time-of-day or day-of-week stats yet?

>> No.9881263

>>9881214
I have data covering almost a year now for rate of posts and threads made on each board (as daily, hourly and 5-minute intervals)
Looks like this: https://api.4stats.io/history/day/biz
in this case the arrays are
[start-time as unix ms, duration in ms, posts during duration, posts/minute during duration]

But no metadata generated from that yet, like comparing time-of-day a board is most active or day-of-week patterns.