Welcome to the Salon Owners Podcast, Phorest FM Episode 94. Co-hosted by Killian Vigna and Zoé Bélisle-Springer, Phorest FM is a weekly show that puts forth a mix of interviews with industry thought-leaders, salon/spa marketing tips, company insights and information on attending Phorest events and webinars. A new Phorest FM episode airs every Monday morning for your enjoyment with a cup of coffee on your day off.

Phorest FM Episode 94

Just because everything we know is taking a digital turn, it doesn’t mean that technology is infallible. Sometimes, outages happen. As a consumer, it’s important to know what causes these disruptions and what you can do to protect your business when a system you rely on temporarily fails. This week on the show, Killian and Zoe sit down with John Doran, Director of Engineering at Phorest Salon Software. You’ll get a better understanding of what outages are, their possible causes and what happens behind the scenes when a system goes down.

Related:

 

Leave a Rating & Review: https://bit.ly/phorestfm

Transcript

Killian Vigna: Welcome to the Phorest FM Podcast Episode 94. I’m Killian Vigna.

Zoe Belisle-Springer: And I’m Zoe Belisle-Springer. This week, we’re talking about software outages. We won’t get too technical, but we’ll be focusing on the typical causes of an outage and what’s being done behind the scenes to get a product back up and running. Stay tuned for this week’s conversation, and as always, we’ll top off the show with our latest announcements and upcoming Phorest Academy webinars.

Killian Vigna: Grab yourself a cup of coffee, sit back, relax, and join us weekly for all your salon’s business and marketing needs. Good morning, Zoe.

Zoe Belisle-Springer: Good morning, Killian.

Killian Vigna: We’ve just kicked off December and you already sound sick.

Zoe Belisle-Springer: Yeah, well, I never really get sick to be honest. This is the first time this year, so it always happens when you’re about to fall on holidays and then… like you’ve been working so hard for so long and then you know you’re about to get onto holidays and then you get sick.

Killian Vigna: Every time without fail. It’s like even if you go away for the weekend, you come back Sunday evening and you’re [inaudible 00:01:09]. It’s just like you can’t do holidays right.

Zoe Belisle-Springer: Yup, yup.

Killian Vigna: This week’s one is an interesting one. I suppose for anyone that looked at the title, they were probably thinking, “What the hell?”… probably sounds a bit techy, but it’s not actually that techy. I suppose it was actually your idea to come up with this show and it was to do with anyone that uses Snapchat, anyone that uses Netflix. You’re seeing these kind of, I suppose not so much a software outages, but error connections and things like those. We just decided what better way to describe what is actually going on here and what’s going on behind the scenes than I’m bringing in our very own director of engineering John Doran, so welcome to the show JD.

John Doran: Thanks for having me on, guys.

Killian Vigna: We’ll probably refer to him as JD a lot here, yeah. For JD, I suppose the four dreaded words that he ever hears in the offices the system is down.

John Doran: Yep, so absolutely terrifying. Puts chills down my back. It’s something that we take really, really seriously at Phorest, and it’s something that we try to do our best to make sure it doesn’t happen. It’s one of the main parts of my role is to ensure stability and reliability of the system and we work day in, day out to make sure that as we scale and we’re helping more salons on the platform that we can continue to kind of cope with the traffic and make sure that they don’t have any connectivity areas as you mentioned or frustrated users and that the applications running nice and fast for them.

Killian Vigna: See, this is the thing, though. It’s like the more and more we move into, I suppose, more technology-based systems, software… we try to move into the 21st century, you are going to experience a bit more of this though. It’s not just for one company or another company. A lot of companies will experience this and I suppose throughout this show, we’ll understand a bit more of why outages happen.

John Doran: Yeah, so, I mean, typically, you’re gonna see things like Facebook, Instagram, Snapchat, you mentioned, most big software companies, they’re essentially, they’re trying to make a lot of changes to improve the system or to improve their products. With those changes comes the risk of maybe human error or a failed kind of release and what happens then is people are affected, so the kind of big tech companies and ourselves, we do our best to make sure that as we incrementally make changes to the system or as new features are being released, that we do it really, really carefully. We get into [inaudible] that kind of works and the types of issues that we have seen in the industry.

Zoe Belisle-Springer: Well, you’ve mentioned human error there and concretely, what is an outage? Is it just when the system is completely down or can it be a certain part of the system that’s not working?

John Doran: Yeah, so it really depends on how the system was built and that will be on kind of like the directors of engineering and the CTOs and the tech teams to assemble a product in a way that it’s a little bit more modular and broken up into different pieces and those different pieces serve various pieces of functionality in the system. Maybe, for example, the Facebook Feed is one backend stack and the Messaging is another one, so you could kind of say if the Messaging team on Facebook we’re going and releasing an update and it broke that part of the system, it maybe wouldn’t affect the Feed or sharing. That’s just kind of one example of how tech teams are breaking up functionality.

To give you like what is an outage, essentially, that is… most software products are hosted on platform providers, so essentially, their software is running on a service that they pay to run it. That’s Google Cloud Platform or Amazon Web Services are two huge ones. Most of the products that people listening today would all be hosted on that. Typically, when you’re releasing software, you’re putting new… essentially code or functionality onto those systems and if you don’t have the proper testing and kind of procedures in place while you’re doing that, you could lead to human error.

You’d see maybe startups taking some shortcuts, doing some stuff faster than they should be or maybe not giving enough diligence and that’s where human error can take place. That could be a misconfiguration of the application or it could be corrupting some data and that all affects how the application performs and things that could happen would be like it would lock the system up and block people from using it or it could just take the system out of service. Does that make sense?

Zoe Belisle-Springer: Yeah, it does. Absolutely.

Killian Vigna: I just wanted to take it back there. You said that things could be hosted onto Google Cloud and Amazon. I suppose for people that don’t build software, what are those things? Because some software is built on those, does that mean there’s a possibility that if that went down, multiple different software companies could be affect?

John Doran: Definitely, definitely, so we would have an example last year for Phorest where Amazon hosts storage on the internet, so the majority of images you see on the internet or are hosted in Amazon service called S3 and that’s their storage service. Essentially, last year, someone in the Amazon Data Center made a human error which brought down, I would say, half the internet in that S3-

Killian Vigna: I actually remember that. Yeah, it was like a blackout.

John Doran: Yeah, so-

Killian Vigna: We hit the millennial-

John Doran: Yeah, so it’s up to us as an engineering team and other teams to be able to react to when those things happen and make sure that the system doesn’t fall apart, so that might mean instead of going and fetching that image from Amazon Service, you display a placeholder image because you can’t reach it so the technical term is called a circuit breaker, but it’s this idea that you just switch off when a service isn’t working and you can do something else.

Yeah, it’s really important to say that the Amazon Web Services, those, it essentially outsources all of their compute power to tech companies who want to host their systems there. Definitely when something happens on their side, it has a chain reaction, so that could be like their connectivity in their data centers or maybe an eight an issue on their side with an update. But we choose those providers because they are the best in class. They’ve invested billions of dollars in building those infrastructures. If Netflix and big companies like that can trust them, so can we. They have a really, really good record in terms of their service and how they actually run.

Killian Vigna: Just on that, you’re saying that there’s a lot of companies like Amazon three-

John Doran: S3.

Killian Vigna: S3, yeah, that’s right. Is there somewhere that you could go, so let’s hypothetically say Phorest stopped working today. As a client, how do I know if it’s Phorest or if it’s Amazon?

John Doran: Typically we would, well, most services would host a status page, so you could actually maybe just do a quick google Phorest status or Facebook status and that would bring you and show you is it operational or is it up.

Killian Vigna: That’d be kind of similar to take it back to my simpler terms of if I see Facebook isn’t working properly or Instagram isn’t working, the first I’m gonna do isitdown.com and just type in the webpage.

John Doran: Yeah, pretty much, yeah.

Killian Vigna: Okay, so you can do that then.

John Doran: Yeah, but it’s also kind of depends on your software provider and the level of service that they give you and the touch points. As I said, we take it really seriously, so we maybe send a text message to salon owners and we do email communications, whereas if you’re maybe one of a million users, you’d be looking to maybe get a chatbot or something like that. It kind of depends on the product you’re using and what’s going on.

Zoe Belisle-Springer: JD, what happens when the system does go down?

John Doran: Okay, so to take a step back a second, we have a bunch of monitoring in place which is essentially automated checks on the system to make sure it’s performing, so it kind of simulates what a user does. For example, maybe for us, it’s is our online bookings working correctly? Are they creating bookings for the salons? Are we able to create appointments? Are we able to send reminders? We have heartbeats checking on all of that stuff all of the time and if one of them fails, the engineering team will get SMS. So we have an on-call rota, so depending on time of the year, we’ll have to draw a lucky straw for Christmas time, but we’ll essentially be on call and be ready to respond to that and investigate what’s going on.

In our case, if a couple of calls are coming in or if we notice that those heartbeats are fading, we’ve got a physical red button that we’ll press which will kind of alert the whole company. This is how serious we take it that stuff is going wrong. We can’t perform what we’re set out to do in terms of the platform functionality, so we need all hands on deck to get this sorted out.

Killian Vigna: That red button is one of those scary things to see pop up on Slack, isn’t it?

John Doran: Yeah, our company messaging platform is Slack. It’s a bit like Whatsapp for business, so we’ll literally notify everyone in the company what’s going on. Everyone on the engineering team will get notified and then we, it’s not chaos, so we tried to be really controlled about it. What we do is actually form — and most big companies will do similar — we form a war room and what we call the war room is essentially a place where only the people who are gonna be focusing on the problem sit down and work on it together because you don’t want a lot of noise and panic and pandemonium in the background.

We try to be really controlled about that and we assigned people kind of different hats and those hats will be maybe the communicator, the person who’s gonna update the company and the customers what’s going on, maybe the manager, the person who’s responsible for making decisions, so maybe hard decisions, but to kind of guide the team. And then you’ve got the fixers, which are the people who were essentially gonna analyse the problem and propose solutions for it.

Killian Vigna: Do you have any average turnaround time for when something like this does happen? Like I’m in the salon, Phorest goes down, obviously, I’m going to start panicking. I’m going to start contacting Support and everything like that, but is there an average turnaround time where I could sit back and go, well hang on, this is probably only going to be five, 10, an hour?

John Doran: Yeah, so we strive to have a 99.9% uptime, so that means 10 minutes a month is what we would, if we go any more than that, we would hold ourselves really accountable and those 10 minutes even include system out-of-hours upgrades and stuff. 10 minutes would be, if we’re pushing past that, we’d be really worried. Luckily, we’re past that, I would say, in terms of where we’re at now and where most companies should be.

Zoe Belisle-Springer: You’ve pressed the red button. You’ve got into the war room, you’ve assigned your roles, what happens after?

John Doran: Once we’re happy that we’ve got a fix in place, we would essentially do a bit of testing on our site, make sure that everything is fine and then send comms out to customers to let them know that the system is back to normal. We take a deep breath, then have a cup of coffee and do come straight back into it and really try to understand the root cause of the problem.

We try to be really transparent with our customers and the company and we would essentially write an outage report to send out to everyone in the company, any customer that wants to see it and give them a clear explanation on things. What that outage report consists of is a summary of the problem, how many customers were affected, what areas of the system were affected, so as I mentioned, most systems are broken up into different parts, so maybe it’s just online bookings or maybe it’s payment processing or appointments. We would then look at the time to resolution, so how long did it take for us to realise that the problem was in place and how long did it take for us to fix it, and then we will go into details of how we fixed the problem.

And then what’s really important for us is preventative measures, so what are we doing as a team over the next couple of weeks to make sure that this doesn’t happen again? And we would really hold ourselves accountable for getting that stuff done and making sure that happens. That will be kind of a standard one and if something really terrible happened, we would probably do like a post-mortem and what that would be is essentially we grab anyone who is a stakeholder or really affected by this.

We’d just come all into a room and really try to look at the whys and dig into those. It’s called the five whys as in: why was there human error and why was that person responsible for that on his own and why didn’t he get someone else to help them with that? You normally get down to a root cause of what really went wrong, but it’s also really good to let people maybe vent some frustrations and really let them express how badly it affected them and it really puts the onus on us as well to make action on it.

Zoe Belisle-Springer: You mentioned something earlier in that, where you said it could be Phorest Pay or it could be online bookings. When multiple things go wrong, how do you prioritise what to fix first?

John Doran: Yeah, so we have a prioritisation list of five key things and most big tech companies would, and what we look at for those priorities is to give us a common language around what is the severity of this issue. P1 is essentially everything is down, all hands on deck. This is really serious. Most of the priorities are very serious, but P1 is essentially doomsday.

Maybe P2 is like at a certain area or functionality of the system is affected or maybe, again, back to P1, it could be security or data. Those kinds of things would be top of the list and then as you go down, you’re less critical, so maybe a P5 is a very isolated piece of the system and only a couple of salons are affected. Obviously, it’s important to fix those issues for that salon, but it wouldn’t mean all hands on deck, message everybody in the company. We would just take that into a queue and we would sort it out as soon as we can.

Just wanted to mention one thing about the post-mortem. One thing I’ve been doing recently to try and preempt any issues is a pre-mortem, which is essentially get people into a room and talk about what could go wrong, what are we really, really at risk here or what part of the system is weaker than others to try and preempt those problems and put those preventative measures in place before they even happen, so it could be we know this area to system is a little weak, so let’s, for the next couple of months, make it better, make sure it’s solid and we’re happy with it. Just being really proactive on it is something that we try to do.

Killian Vigna: All of these outages that could happen from human error, would you see outages for software updates?

John Doran: Around software updates, so an outage normally happens, there’s two reasons, it’s an unexpected event, so like a network fault or a disk runs out of space or an update as in like where somebody is releasing a change to a system that hasn’t been tested properly or is a human error, so that’s change management and what happens… what we do to reduce risk there, what most people would do is we have a QA Team so they’re responsible for the quality and assurance of the product and then also we would have a lot of automation around what we do, so we would make sure a change to one piece of functionality in the system doesn’t effect any others.

And then, again, when you’re rolling out features, I mentioned earlier that, you might see a different version of a screen on your Facebook App compared to your friends and that’s because it’s important to do experimentation and to make sure that the changes that you’re making to the system don’t have negative effects or don’t cause user experience issues. We would do A/B testing and, and gradual rollouts of software and what that may mean is us doing a specific region, maybe it’s a percentage of users in different regions and measuring is it having good effect on maybe the online booking numbers or how they’re using the system, is it making it more easier? That’s the kind of stuff we would do.

Killian Vigna: This is why the users of Phorest would see a new payroll screen appear, but have the ability to switch back to an older one or the manager screen might look a bit more animated than it used to.

John Doran: Yeah, exactly, so we would call it kind of call it feature toggling and what that means for us is, we continually developing and rerelease from Phorest’s side about 50 times a week, so we’re continually making changes and tweaks and we make really small changes along the way to reduce any risk with that and then we have these feature toggles I mentioned, which mean we’re building new areas, maybe a new staff list screen is a good example. Only a certain amount of our customers can see that or maybe our beta users and while they’re giving us feedback on it and catching issues that aren’t affecting anyone else, we’re addressing them and continually updating them. Until we’re happy that it’s having a really positive effect on the system, we’ll roll it out.

Killian Vigna: You just mentioned beta users there. Does us having more and more beta users help kind of deter anything like this, deter outages or is that different again?

John Doran: It’s a little bit different to outages. It will be more like defects or missed functionality where we would find that really important so we would use a lot of our customers as our champions to give us the feedback. They’re using the system every day. They know how they want it to work and we would use all of that to make sure that we’re moving in the right direction in stuff we do.

Killian Vigna: I just wanted to pull it back to the salon just for a little bit. I’m in a salon. I’m using Phorest. I’m using Mailchimp. I’m using Slack. I’m using all these different softwares and then one of them crashes. My instant reaction is panic, like that is what’s gonna… it’s the same for when we’re using systems in here and it happens. Is there anything that all you could do there and then, I suppose I’m not gonna help get the system up and running, but is there any strategies I can put into place in my salon for scenarios like this?

John Doran: It’s a tough question so-

Killian Vigna: You asked for a grilling.

John Doran: What we try to do is make sure you’re not in that situation, so that’s the very first thing. But if you are, I think, first of all, check… Go to the internet. Check it’s not your internet connection, maybe there’s something wrong with the router. Maybe someone plugged the wire out of it, so just make sure it’s nothing on your side first before completely panicking. Then maybe google Phorest.com/status or Instagram.com/status, whatever it is and that should give you a history of what’s going on with the system. If there’s any big red flags or anything, it’s on them. Again, it depends on the software system and the provider and how high touch there and how important that you are as a user to them.

For us, for example, if we get a call, we’ll make sure that we help you in any way we can. We would ask you maybe for as much detail as possible on the issue. Maybe it’s an error popup with a little code on it. Maybe if you could give us that code, we can easily help resolve that issue or if it’s more of a widespread issue, we’ll make sure that you’re communicated to about what’s going on.

I think from a really practical terms, have a pen and paper ready beside your till and just be, in case you need to run a transaction or something and you can go back later in the system and fill it out. Yeah, that’s a tough question. Thanks.

Zoe Belisle-Springer: Maybe another tough one for you, JD, but if I’m, I don’t know, looking for new software, whether it be like internal messaging or really anything that has to do with technology, how can I make sure that this software company is going to take any issue or any outage seriously and that I’m gonna be completely covered in terms of if I need to reach out to them or anything?

John Doran: You can kind of look at it from how high touch they are to maybe throughout the sales process or onboarding. You could ask them straight up, “What’s your outage history? What sort of issues have you had and how do you reassure users when something does happen? What sort of service level do you provide to me?” I think as well, there’s a lot, users are extremely vocal. When there is issues, you will know about it. If you go to maybe some of the user forums or the communities, you could literally just ask how are things going with this software? Have you had any troubles with it, things like that.

Killian Vigna: I suppose when you go to look at the support team that they have on offer as well. Is it just email? Is it just live chat?

John Doran: Definitely.

Killian Vigna: Or do they actually take live calls? That’s us, by the way, and I suppose like JD said, we do send out a Gmail after something happens and that’s also kind of another reason why we did this episode because there are probably people listening going, “Do I really need to know this information?” That outage email that goes out after something happens actually has a very high open rate, so a lot of our clients do actually want to know what happened, why an outage happened, what was the cause of it and what was done to fix it. That does kind of add to this episode.

And I suppose we’ll get JD to wrap up a few points, but if people do want to know a bit more about the company, if they want to hear more from our tech side, our engineering side, they can always just reach out to us, so what is it, it’s?

John Doran: For me, jdoran@phorest.com. If you have any questions or if you’ve maybe experienced anything around this, I’m happy to have a chat with you through this stuff. As I said, we held ourselves really accountable for this. We do our best to make sure it doesn’t happen. We put a lot of effort and invest a lot in making sure our system will scale with salons’ needs and something we’re really passionate about.

Killian Vigna: Do you want to give us just a quick recap again? Maybe just touch off the strategies that you could have in place and what to look for in a company.

John Doran: Yeah, definitely. As I mentioned, it’s a lot about the quality of the software that you’re buying and how fast it is as you’re clicking around, how quick they are to respond to your feedback is really important as well. Are you, are you noticing issues and are those technology companies listening to you and maybe addressing your needs or accounting for those challenges into their roadmap and what they’re building. I think it’s really important to really be involved and try to push to be the voice of the customer for the products that you use. When issues do happen, watch out for what the communication channel is and how quick they are to respond and how accountable they do hold themselves for that stuff.

Killian Vigna: If anything ever does happen, you check, isitdown.com and the other one was?

John Doran: Phorest.com/status will have essentially a health indicator for us or status.phorest.com.

Killian Vigna: Well, listen, JD, thanks very much for joining us on the show. Hopefully, give some more clarity around outages and what to look for in companies that… like this stuff is going to happen. The more and more we move digital, it’s gonna happen, but it’s not necessarily always for the worst as JD said. Sometimes it’s because they’re trying to improve the product or grade it or it could just be human error. The most important part is how is the company you’re with addressing it and I’d just like to throw a final note. If anyone is looking for more topics like this, JD, he was pure excited to join us on the show today and we know he’d love to talk about more stuff because you’re a regular contributor for-

John Doran: For our NothingVentured Blog, which is a lot more on the technical side of things, which is something that we’re really, really passionate about and we love talking about.

Killian Vigna: Yeah, we just love to share everything. Thanks very much for joining us today, JD.

John Doran: Thanks, guys. Cheers.

Killian Vigna: So that was JD, our Director of Engineering, shedding some light on outages and why you shouldn’t necessarily panic straightaway, but I can, I suppose, steps that you can take. Moving on now to the second part of the show, we have Zoe’s section and I believe you have some news about an event?

Zoe Belisle-Springer: The Salon Owners Summit which is hosted at the Convention Centre in Dublin on January 7th of the coming year is now sold out officially. But that doesn’t mean you can’t get in. If you are a Phorest client, you can join the waiting list. If you are interested in coming to the Summit and you are a Phorest client, do get onto that as soon as possible ’cause the spots will fill up quite fast. Myself and Killian will be hosting, like last year, a live episode at Inside Phorest. Inside Phorest is kind of like the HQ tour and session that happens the day after the Summit and we have a session in the morning and a session in the afternoon with the product team with Patty and JD, who was just on the show.

What the guys are gonna be talking about with people attending is kind of showing them the new features on the roadmap, trying to get feedback from it, insights, how people would like to see those features working in the future in the salon if they were to be released. And so in the middle of that, Killian and myself will be hosting a live podcast recording.

So as I said, if you are attending the Phorest Salon Owners Summit or if you are looking into attending it in you are a Phorest client, join the waiting list. And before we let you go, we actually have released a new… Well, the YouTube channel, we’ve always had a YouTube channel but we’ve completely rebranded it Phorest Studio and we’re releasing new episodes every week. Don’t miss out on those episodes. There’s a lot of marketing tips. There’s interviews, industry experts and stuff like that, so a whole lot going on there.

It’s six episodes left to the year. Now I know there’s not six weeks left, but there is a few different special episodes coming up during the Christmas period and we’ll leave it at that this week. Now we’re already planning our early 2019 episodes, so if you know someone with a story who’d be great on this show, please slide into our DMs and if you have any feedback, feel free to leave us a review on iTunes or on Stitcher. We’re always looking for suggestions on how to improve the show. Otherwise, have a wonderful week, guys, and we’ll catch you next Monday.

Killian Vigna: All the best.

Thanks for reading! #LetsGrow


Catch up on the previous Phorest FM episode, or check out the next Phorest FM episode!

Note: Phorest FM is designed to be heard, not read. We encourage you to listen to the audio, which includes emotion which may not translate itself on the page. Episode transcripts are produced using a third-party transcription service, some errors may remain.