Forum::BOINCstats Bug Report::Boincstats not updated today

Index :: BOINCstats Bug Report :: Boincstats not updated today

Pages: [1]

hampsteadpete: BAM!ID: 102775; Joined: 2011-07-20; Posts: 2; Credits: 55,416,649; World-rank: 17,544

2011-08-22 20:37:10

My "stats" page: http://boincstats.com/stats/boinc_user_graph.php?pr=bo&id=152f5138269efc0f7e13e948d977135c

Did not update today for some reason.

Pete Soderman

[BOINCstats] Willy: Forum moderator - Administrator - Developer - Tester - Translator; BAM!ID: 1; Joined: 2006-01-09; Posts: 9461; Credits: 353,172,950; World-rank: 5,099

2011-08-22 21:03:38

It is updated actually, on the database server that is. But for some reason (that I'm trying to find out now) copying the update to the webservers is now already taking over five hours instead of the 30 minutes it usually takes.

Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.

hampsteadpete: BAM!ID: 102775; Joined: 2011-07-20; Posts: 2; Credits: 55,416,649; World-rank: 17,544

2011-08-22 21:21:23

Thank you sir! It was the first day I started using GPU & the numbers were really wild compared to normal. I thought that might have had something to do with it. Thanks for your reply.

Pete

[BOINCstats] Willy: Forum moderator - Administrator - Developer - Tester - Translator; BAM!ID: 1; Joined: 2006-01-09; Posts: 9461; Credits: 353,172,950; World-rank: 5,099

2011-08-22 21:45:14

@All: the update is coming, just very slow.

The server causing the slowdown is the same server which was dead-slow before. I though I solved that by replacing some hard drives and for a couple of weeks it was all working fine until today.

At the moment there isn't much more I can do than watch it go slow, so I will get some sleep. If things haven't improved by tomorrow it's back to one web-server again

Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.

magic8192: BAM!ID: 101241; Joined: 2011-06-05; Posts: 2; Credits: 1,453,325,083; World-rank: 1,835

2011-08-23 18:01:56

Thanks for the info.

[BOINCstats] Willy: Forum moderator - Administrator - Developer - Tester - Translator; BAM!ID: 1; Joined: 2006-01-09; Posts: 9461; Credits: 353,172,950; World-rank: 5,099

2011-08-23 18:51:09
last modified: 2011-08-23 18:53:33

I wasn't able to solve the problem. And I have no more ideas left. Tomorrow I will stop using the slow server and retire it. It's the second time it's been acting up and I did a complete reinstall and replaced slow drives, so something must be wrong with the hardware (probably the RAID controller). Unfortunately, this means that the website will be hosted from one server again which will not improve speed.

Just in case some Linux guru is out there, I believe this is the problem (larger numbers):



Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util

sda               2.81 14220.57  107.11  205.90 24898.44 115412.99   448.26    12.30   39.28   1.50  46.85

sdb               2.95     0.43  114.91   14.32  7713.79  2192.88    76.66     1.04    8.08   0.45   5.81

sdc               5.38     2.93    1.43    0.07    61.54    23.97    57.15     0.01    6.43   4.06   0.61

scd0              0.00     0.00    0.00    0.00     0.00     0.00     8.00     0.00   28.84  28.84   0.00

This started yesterday. From the logs I can clearly see disk I/O is the problem. A simple chown command takes over a second to complete. Now if there was a drive failure in the RAID I could understand it. BTW: the chown commands are slow on sdb as well

Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.

Christos Despotakis: BAM!ID: 69381; Joined: 2009-05-18; Posts: 17; Credits: 767,610,023; World-rank: 2,917

2011-08-23 19:40:03

Well iostat is a useful tool. I suppose you are fully aware what each value means but for all the rest here there are:

rrqm/s : The number of read requests merged per second that were queued to the hard disk
wrqm/s : The number of write requests merged per second that were queued to the hard disk
r/s : The number of read requests per second
w/s : The number of write requests per second
rsec/s : The number of sectors read from the hard disk per second
wsec/s : The number of sectors written to the hard disk per second
avgrq-sz : The average size (in sectors) of the requests that were issued to the device.
avgqu-sz : The average queue length of the requests that were issued to the device
await : The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
svctm : The average service time (in milliseconds) for I/O requests that were issued to the device
%util : Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.

The last 3 are the more important. So you are right about I/O problems. But without the actual hardware setup I can not be sure what is the actual problem. What kind of R.A.I.D. do you use? How many disk? etc. Try running a S.M.A.R.T test to your disks. (Check here & here for instructions). I hope it is just a failing H.D.D. than the R.A.I.D. controller.

[BOINCstats] Willy: Forum moderator - Administrator - Developer - Tester - Translator; BAM!ID: 1; Joined: 2006-01-09; Posts: 9461; Credits: 353,172,950; World-rank: 5,099

2011-08-23 20:38:22
last modified: 2011-08-23 20:42:12

It's run off a four channel LSI hardware RAID controller (if you need the exact type I need to look it up) with a Intel 12x SAS expander.
sda is a RAID10, 4x SEAGATE ST373455SS (SAS, 15k RPM), operating system, website files and swap
sdb is a RAID10, 4x INTEL SSDSA2CW080G3 (SATA, SSD), database
sdc is a single drive for backups, SEAGATE ST3750525AS (SATA 7k2)

The ST373455SS are reused from the "old" database server and were working fine there. The SSDSA2CW080G3 are also used in the other webserver and again working fine there.

The slowness is actually occurring on sdb. Which makes me think that every filesystem action goed throug / (sda) regardless of the device the action actually is targeted at.

Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.

[BOINCstats] Willy: Forum moderator - Administrator - Developer - Tester - Translator; BAM!ID: 1; Joined: 2006-01-09; Posts: 9461; Credits: 353,172,950; World-rank: 5,099

2011-08-23 20:52:19

Can't test the SMART status, probably because they are RAID drives, but according to the megacli tool, all drives are fine.

Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.

Christos Despotakis: BAM!ID: 69381; Joined: 2009-05-18; Posts: 17; Credits: 767,610,023; World-rank: 2,917

2011-08-23 22:47:22

You say that the swap is at sda, and the I/O must occur at sdb (the database). The only reasonable way for the db to cause such queue depth at sda is through the swap. And to use the swap the is "not enough physical memory" or a db setting to cache the I/O. Please check your swap usage and see if we get something there.

[BOINCstats] Willy: Forum moderator - Administrator - Developer - Tester - Translator; BAM!ID: 1; Joined: 2006-01-09; Posts: 9461; Credits: 353,172,950; World-rank: 5,099

2011-08-24 05:24:45

I was wrong, swap is on sdc.

willy@www3:~# free

             total       used       free     shared    buffers     cached

Mem:      24682360   17409832    7272528          0     209156    4816616

-/+ buffers/cache:   12384060   12298300

Swap:     47850488    3242240   44608248

Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.

Christos Despotakis: BAM!ID: 69381; Joined: 2009-05-18; Posts: 17; Credits: 767,610,023; World-rank: 2,917

2011-08-24 10:34:46

You should check what causes the sda I/O. Using iotop and lsof will give you a better understanding of what is going on.

Use

iotop -o

to pinpoint which process / thread is the culprit.

Then use

lsof -p <PID>

to see which files are opened by the process.

Knowing what is written, where and by whom hopefully will help us narrow down the problem.

Keep up the good work.

[BOINCstats] Willy: Forum moderator - Administrator - Developer - Tester - Translator; BAM!ID: 1; Joined: 2006-01-09; Posts: 9461; Credits: 353,172,950; World-rank: 5,099

2011-08-24 10:51:43
last modified: 2011-08-24 10:59:07

I used iotop -o and it shows an ever changing number of mysql processes (up to 20), Apache drops by every now and then, and the ones that strike me are kjournald and flush-8:0. The run very often and are at 99.99% IO. What I read about them is that they are needed by the OS.

Edit: This all looks exactly the same as the other webserver which has a much slower sda (RAID10 on 7k2 rpm 2.5" notebook disks) and which doesn't suffer from this problem. It does have another RAID controller and doesn't use an expander.

Please do not PM, IM or email me for support (they will go unread/ignored). Use the forum for support.

Christos Despotakis: BAM!ID: 69381; Joined: 2009-05-18; Posts: 17; Credits: 767,610,023; World-rank: 2,917

2011-08-24 12:22:26

BOINCstats Willy wrote:

kjournald and flush-8:0. The run very often and are at 99.99% IO. What I read about them is that they are needed by the OS.

kjournald is the journaling deamon of ext3. Journal "keeps track" of all the disk operations and during high I/O activity it's normal to see this. And yes, it's basically more I/O job for your disk.
flush is the writeback from cache to disk, so more I/O there.

You can tweak them them a bit but I doubt this is your problem.

For kjournald you can either adjust commit=num_secs mount option to each ext3 filesystem or chance journal mode to "writeback". num_secs by default is 5 but, setting it to something like 300 or 600 (5 and 10 minutes respectively) should be just fine. I can give you a walkthrough for editing /etc/fstab if you need. You can also increase the journal size with tune2fs.
To tune flush to 60 seconds (default is 5), add vm.dirty_writeback_centisecs=6000 to /etc/sysctl.conf

Using lsof -p <PID> check that all db I/O are made at sdb.

txyankee: BAM!ID: 65094; Joined: 2009-02-03; Posts: 56; Credits: 19,602,865; World-rank: 23,823

2011-08-26 10:13:03

my stats have not changed in over 24 hous; even thou it says it has. i run on wcg 24-7

: BAM!ID: 64136; Joined: 1970-01-01; Posts: 0; Credits: 0; World-rank: 0

2011-08-26 16:04:44

txyankee wrote:

my stats have not changed in over 24 hous; even thou it says it has. i run on wcg 24-7

knreed - World Community Grid Tech wrote:

Yes. We turned off the periodic scripts that we use to update various things in addition to some of the validators. However, the scripts are running again and the BOINC XML stats have been updated with in the past couple of hours.

TPCBF: BAM!ID: 94441; Joined: 2010-12-21; Posts: 190; Credits: 0; World-rank: 0

2011-08-26 18:39:09

Crystal Pellet wrote:

txyankee wrote:

my stats have not changed in over 24 hous; even thou it says it has. i run on wcg 24-7

knreed - World Community Grid Tech wrote:

That's all fine and dandy (what you quoted is the response from knreed to my question in the WGC forum), but BOINCStats hasn't run an incremental update for more than 6 hours now

Last daily update:
2011-08-25 17:33:24 GMT
1 day 01:02:29 ago
Last incremental update:
2011-08-26 11:56:36 GMT
06:39:17 ago

So even if the WGC issue is resolved, it looks like Willy has still more issues with this web server at least...

Ralf

: BAM!ID: 64136; Joined: 1970-01-01; Posts: 0; Credits: 0; World-rank: 0

2011-08-26 19:38:00

TPCBF wrote:

Crystal Pellet wrote:

txyankee wrote:

my stats have not changed in over 24 hous; even thou it says it has. i run on wcg 24-7

knreed - World Community Grid Tech wrote:

That's all fine and dandy (what you quoted is the response from knreed to my question in the WGC forum), but BOINCStats hasn't run an incremental update for more than 6 hours now

Last daily update:
2011-08-25 17:33:24 GMT
1 day 01:02:29 ago
Last incremental update:
2011-08-26 11:56:36 GMT
06:39:17 ago

So even if the WGC issue is resolved, it looks like Willy has still more issues with this web server at least...

Ralf

1. If you can read, Ralf. That it is a quote of Kevin is stated directly above the quote: "knreed - World Community Grid Tech wrote"

2. The big blue WCG with all their money and full time employees dedicated to BOINC itself isn't even able to run their own statistics more often than twice a day for users and for teams only once a day, so why complaining here with 1 volunteer webmaster living from donations?

TPCBF: BAM!ID: 94441; Joined: 2010-12-21; Posts: 190; Credits: 0; World-rank: 0

2011-08-26 20:29:18

Crystal Pellet wrote:

2. The big blue WCG with all their money and full time employees dedicated to BOINC itself isn't even able to run their own statistics more often than twice a day for users and for teams only once a day,

And that is a problem because?

so why complaining here with 1 volunteer webmaster living from donations?

Well, maybe you should read, I wasn't complaining at all, just stating a fact that the problems with the update are not only with the WGC, which in turn had a problem with their server the other day them self...

Ralf

txyankee: BAM!ID: 65094; Joined: 2009-02-03; Posts: 56; Credits: 19,602,865; World-rank: 23,823

2011-08-27 10:27:41

seems to be up and running again. Thankyou for your help and time.

Pages: [1]

Index :: BOINCstats Bug Report :: Boincstats not updated today

Status

Shoutbox