Prometheus (range_)query steady-state tolerance verification probes by sneJ- · Pull Request #10 · chaostoolkit-incubator/chaostoolkit-prometheus

sneJ- · 2019-08-09T15:07:47Z

Added steady-state tolerance probes [1] to verify Prometheus (range_)query results against int or float thresholds.

query_results_lower_than_threshold checks if the Prometheus results are below a given threshold
query_results_higher_than_threshold checks if Prometheus results are higher than a given threshold
query_result_degradation checks if the average Prometheus query results deviate between the first run of the steady-state-probe and the second run.

Also added test cases to verify the steady-state tolerance probes' functionality.

[1] https://docs.chaostoolkit.org/reference/tutorials/tolerance/#advanced-scenarios

Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

…ed correctly by chaostoolkit-lib Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

Lawouach

Hi @sneJ-

I wanted to apologise for forgetting about this PR!

I do like what it tries to add but there are a few things I would like to simply discuss with you before we can decide to merge. If that's okay?

Namely, as I understand it you were using the environment (or globals()) to store state, is that right? Is that necessary you think? It's not very clean IMO but I'm sure I'm missing something.

Would be able to comment by any chance?

Lawouach · 2019-10-03T11:53:17Z

 from logzero import logger

-__version__ = '0.3.0'
+__version__ = '0.3.1'


For futrure reference, we usually do not change the version in the PR

Lawouach · 2019-10-03T11:53:36Z

+    If no threshold is given it throws an exception.
+    """
+    if threshold is None:
+        raise Exception("No threshold given")


That should be ActivityFailed instead Exception

Lawouach · 2019-10-03T11:53:47Z

+    if threshold is None:
+        raise Exception("No threshold given")
+    logger.info("threshold: %d" % (threshold,))
+    print(value)


trailing print should be removed

Lawouach · 2019-10-03T11:54:45Z

+            rtn = False
+
+    if rtn:
+        logger.info("Probe: ok, all values are below the given threshold")


We don't usually log at INFO level unless it helps reading the experiment's flow.

Lawouach · 2019-10-03T11:55:20Z

+    values.
+    """
+    if threshold_variable:
+        if os.getenv("%s-%s" % (threshold_variable_prefix,


Why look in the env rather than in the configuration?

Hi, that's a good point. I've to check how to change it though.

Lawouach · 2019-10-03T11:56:27Z

+
+def set_result_as_threshold_variable(threshold_variable: str,
+                                     resize: int = 100,
+                                     value: dict) -> bool:


We prefer the typing module:

from typing import Dict

Lawouach · 2019-10-03T11:56:49Z

+                                     resize: int = 100,
+                                     value: dict) -> bool:
+    """
+    Saves the passed Prometheus query value in an environment


This is a bit odd to store a state in the environment.

Lawouach · 2019-10-03T11:59:01Z

-        if ("%s-%s" % (threshold_variable_prefix, threshold_variable))
-        in globals:
+        if ("%s-%s" % (threshold_variable_prefix, threshold_variable))\
+          in globals():


That's not very clean to use globals() that way.

Lawouach · 2019-10-03T11:59:49Z

    if threshold is None:
-        raise Exception("No threshold given")
+        logger.error("Probe: No threshold given")
+        raise ActivityFailed()


Good! Though a message in the exception could be useful :)

sneJ- · 2019-10-08T18:17:39Z

Hi @Lawouach your comments are valid. Thanks.

Regarding storing the threshold in a global variable vs. storing it in the (experiment's?) configuration:

In my use-case I have experiments where I need to detect the utilization of a distributed system (e.g. queries per second), store that utilization in a temporary reference value, then outage one part of the distributed system, wait a certain amount until it recovers, and then evaluate again if the utilization of the recovered distributed system is similar to the temporary reference.

As we have different hardware sizes that impact the utilization it is easier to have only one experiment that can be used for every hardware size instead of having multiple experiments that only differ in the fixed threshold.

I agree using global variables as storage isn't optimal. If there is a way to set these values in the configuration from the action code I'm happy to change it.

Lawouach · 2019-10-15T19:14:50Z

Hi @sneJ-, your use-case makes total sense and is sensible.

I think, I would indeed approach it rather differently. You might be familiar with the concept of controls in the toolkit. You could create one that would store the output of a probe/action and inject it in the arguments of an next action.

https://docs.chaostoolkit.org/reference/extending/create-control-extension/

The idea of a control is that they provide a mechanism by which you can expand on the toolkit's behavior without changing the core or its specification.

Here are some examples:

While these two don't show you can modify the experiment itself, that's allowed and supported.

sneJ- · 2019-10-28T15:17:52Z

Hi, thanks. These are very good references. I'll have a look and fix my code and experiments accordingly once I'm not that busy anymore.

Lawouach · 2019-10-29T13:39:44Z

I would be happy to help when you do get the time. Ping me on slack or here :)

nunziox · 2020-05-27T17:44:54Z

[THIS IS NOT A CONTRIBUTION]

Hi,

Is this going to get merged ???
This looks a pretty important feature.
We would like to use this module but without this feature does not look useful.

Lawouach · 2020-05-27T18:55:41Z

Hard to tell. I agree with the usefulness of the feature. I had forgotten about them and actually redone them :/

@sneJ- could you let us know if you could squash this PR by any chance? I could review it :)

mariadb-JensRowekamp added 14 commits August 6, 2019 16:50

initial step for verification probes

43231cd

Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

DBAAS-977 - initial activities integration

110d810

Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

DBAAS-977 - packaging bugfix

56b79f8

Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

DBAAS-977 - first actual logic for query_results_lower_than_threshold

6782e32

Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

DBAAS-977 - verification probes - iteration

3d6ecee

Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

DBAAS-977 - bugfix

4f68fa9

Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

DBAAS-977 - bugfix

efe16b7

Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

DBAAS-977 - bugfix

1588fc6

Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

DBAAS-977 - further bugfix

91aa8e7

Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

DBAAS-977 - bugfix

05ae9cc

Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

DBAAS-977 - typo

44a488a

Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

DBAAS-977 - bugfix

2b3a61d

Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

DBAAS-977 - changed return from False to ActivityFailed to be evaluat…

bbac9ee

…ed correctly by chaostoolkit-lib Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

DBAAS-977 - tests added

949aa2d

Signed-off-by: sneJ- <jens.rowekamp@mariadb.com>

Lawouach requested changes Oct 3, 2019

View reviewed changes

Uh oh!

Conversation

sneJ- commented Aug 9, 2019

Uh oh!

Lawouach left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sneJ- commented Oct 8, 2019

Uh oh!

Lawouach commented Oct 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sneJ- commented Oct 28, 2019

Uh oh!

Lawouach commented Oct 29, 2019

Uh oh!

nunziox commented May 27, 2020

Uh oh!

Lawouach commented May 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Lawouach commented Oct 15, 2019 •

edited

Loading