Prometheus (range_)query steady-state tolerance verification probes#10
Conversation
Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>
Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>
Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>
Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>
Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>
Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>
Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>
Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>
Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>
Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>
Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>
Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>
…ed correctly by chaostoolkit-lib Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>
Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>
Lawouach
left a comment
There was a problem hiding this comment.
Hi @sneJ-
I wanted to apologise for forgetting about this PR!
I do like what it tries to add but there are a few things I would like to simply discuss with you before we can decide to merge. If that's okay?
Namely, as I understand it you were using the environment (or globals()) to store state, is that right? Is that necessary you think? It's not very clean IMO but I'm sure I'm missing something.
Would be able to comment by any chance?
| from logzero import logger | ||
|
|
||
| __version__ = '0.3.0' | ||
| __version__ = '0.3.1' |
There was a problem hiding this comment.
For futrure reference, we usually do not change the version in the PR
| If no threshold is given it throws an exception. | ||
| """ | ||
| if threshold is None: | ||
| raise Exception("No threshold given") |
There was a problem hiding this comment.
That should be ActivityFailed instead Exception
| if threshold is None: | ||
| raise Exception("No threshold given") | ||
| logger.info("threshold: %d" % (threshold,)) | ||
| print(value) |
There was a problem hiding this comment.
trailing print should be removed
| rtn = False | ||
|
|
||
| if rtn: | ||
| logger.info("Probe: ok, all values are below the given threshold") |
There was a problem hiding this comment.
We don't usually log at INFO level unless it helps reading the experiment's flow.
| values. | ||
| """ | ||
| if threshold_variable: | ||
| if os.getenv("%s-%s" % (threshold_variable_prefix, |
There was a problem hiding this comment.
Why look in the env rather than in the configuration?
There was a problem hiding this comment.
Hi, that's a good point. I've to check how to change it though.
|
|
||
| def set_result_as_threshold_variable(threshold_variable: str, | ||
| resize: int = 100, | ||
| value: dict) -> bool: |
There was a problem hiding this comment.
We prefer the typing module:
from typing import Dict
| resize: int = 100, | ||
| value: dict) -> bool: | ||
| """ | ||
| Saves the passed Prometheus query value in an environment |
There was a problem hiding this comment.
This is a bit odd to store a state in the environment.
| if ("%s-%s" % (threshold_variable_prefix, threshold_variable)) | ||
| in globals: | ||
| if ("%s-%s" % (threshold_variable_prefix, threshold_variable))\ | ||
| in globals(): |
There was a problem hiding this comment.
That's not very clean to use globals() that way.
| if threshold is None: | ||
| raise Exception("No threshold given") | ||
| logger.error("Probe: No threshold given") | ||
| raise ActivityFailed() |
There was a problem hiding this comment.
Good! Though a message in the exception could be useful :)
|
Hi @Lawouach your comments are valid. Thanks. Regarding storing the threshold in a global variable vs. storing it in the (experiment's?) configuration: In my use-case I have experiments where I need to detect the utilization of a distributed system (e.g. queries per second), store that utilization in a temporary reference value, then outage one part of the distributed system, wait a certain amount until it recovers, and then evaluate again if the utilization of the recovered distributed system is similar to the temporary reference. As we have different hardware sizes that impact the utilization it is easier to have only one experiment that can be used for every hardware size instead of having multiple experiments that only differ in the fixed threshold. I agree using global variables as storage isn't optimal. If there is a way to set these values in the configuration from the action code I'm happy to change it. |
|
Hi @sneJ-, your use-case makes total sense and is sensible. I think, I would indeed approach it rather differently. You might be familiar with the concept of controls in the toolkit. You could create one that would store the output of a probe/action and inject it in the arguments of an next action. https://docs.chaostoolkit.org/reference/extending/create-control-extension/ The idea of a control is that they provide a mechanism by which you can expand on the toolkit's behavior without changing the core or its specification. Here are some examples:
While these two don't show you can modify the experiment itself, that's allowed and supported. |
|
Hi, thanks. These are very good references. I'll have a look and fix my code and experiments accordingly once I'm not that busy anymore. |
|
I would be happy to help when you do get the time. Ping me on slack or here :) |
|
[THIS IS NOT A CONTRIBUTION] Hi, Is this going to get merged ??? |
|
Hard to tell. I agree with the usefulness of the feature. I had forgotten about them and actually redone them :/ @sneJ- could you let us know if you could squash this PR by any chance? I could review it :) |
Added steady-state tolerance probes [1] to verify Prometheus (range_)query results against int or float thresholds.
Also added test cases to verify the steady-state tolerance probes' functionality.
[1] https://docs.chaostoolkit.org/reference/tutorials/tolerance/#advanced-scenarios