Bot Nirvana | AI & Automation Podcast

Nandan Mullakara

Technology

Latest episode

Available Episodes

5 of 48

Alex and Doug
In this episode, we are joined by Intelligent Automation experts Doug Shannon and Alex Dixon to unravel the complex terminology dominating today's Agentic AI automation landscape. The conversation delves into how enterprises are integrating cognitive abilities into traditional automation workflows, exploring the evolution from RPA to intelligent automation and now to agentic process automation. Key topics discussed: Defining agentic process automation and how it differs from traditional RPA Large Action Models (LAMs) and how they're transforming UI-based automation How process mining and task mining data are fueling the next generation of automation The importance of maintaining guardrails and human oversight in enterprise automation The convergence of application modernization and automation technologies The distinction between workflow-based agentic automation and goal-oriented AI agents Real-world examples of automation implementation in call centers and other business processes The promising future of combining LLMs and LAMs to reimagine how work happens More information and Links: Connect with Alex: linkedin.com/in/alexanderrdixon/ Connect with Doug: linkedin.com/in/doug-shannon/ Visit Nandan on the web at nandan.info
--------
25:43
Manish Ballal
Manish Ballal is a GTM and Sales leader with over a decade of experience in the automation space. He is currently leading Generative AI initiatives at Amazon Web Services (AWS). He brings a wealth of experience from both large global technology companies and startups. Previously, he held leadership roles at major GSIs and had a significant tenure at Automation Anywhere. In this episode, we discuss: - Automation evolution - Enterprise deployments - Specific use cases - Challenges with security, AI agents - Process-first approach - Vertical Agents More information and Links: Connect with Manish: Linkedin.com/in/manishballal/ Visit Nandan on the web at nandan.info
--------
26:08
Agentic Process Automation (APA)
In this episode, we explore Agentic Process Automation (APA), a paradigm that could revolutionize digital automation by harnessing the power of AI agents. The discussion focuses on the ProAgent system as an example of APA. APA introduces a new paradigm where AI-driven agents can analyze, decide, and execute complex tasks with minimal human intervention. We'll unpack the groundbreaking Automation concept which showcases the true potential of AI agents through its innovative approach to workflow construction and execution. Key Topics Covered Introduction to Agentic Process Automation (APA) Comparison between traditional Robotic Process Automation (RPA) and APA ProAgent: A prime example of APA implementation Key innovations of ProAgent: Agentic workflow construction Agentic workflow execution Types of agents in ProAgent: Data agents Control agents Case study: Using ProAgent with Google Sheets for business line management Potential impacts and implications of APA on work and decision-making Future developments and considerations for APA technology This episode was generated using Google Notebook LM, drawing insights from the paper "ProAgent: From Robotic Process Automation to Agentic Process Automation" Stay ahead in your AI journey with Bot Nirvana AI Mastermind. Podcast Transcript All right, everyone. Buckle up, because today's deep dive is going to be a wild ride through the future of automation. We're talking way beyond those basic schedule this kind of tasks. Yeah, we're diving headfirst into the realm where AI takes the wheel and handles the thinking for us. Oh, yeah, the thinking part. Yeah. If you could give your computer a really complex task, something that needs analysis, decision-making, maybe even a dash of creativity, that's what we're talking about. And right now, your typical automation tools, they would hit a wall. Hard. They're great at following those rigid step-by-step instructions. Like robots. Exactly. But when it comes to anything that requires actual brain power. Still got to do it ourselves. Well, that's where this research paper we're diving into today comes in. It's all about something called agentic process automation, or APA for short. And let me tell you, this stuff has the potential to completely change the game. OK, for those of us who haven't dedicated our lives to the art of automation, give us the lowdown. What is APA, and why is it such a big deal? Think about your current automation workhorse RPA, robotic process automation. It's like that super reliable assistant who never complains but needs very specific instructions for every single step. Right. Amazing at those repetitive tasks, but needs you to hold their hand through every decision point. Exactly. Now, imagine that same assistant, but with a secret weapon, an AI sidekick whispering genius solutions in their ear. OK, now you're talking. That's APA in a nutshell. We're giving RPA a massive intelligence boost. So instead of just blindly following pre-programmed rules, we're talking about automation that can actually think. You got it. APA introduces the idea of agents, which are basically AI helpers embedded directly into the workflow. These agents can analyze data, make judgment calls based on that analysis, and even generate things like reports, all without a human meticulously laying out each step. So it's not just about automating tasks anymore. It's about automating the intelligence behind those tasks. You're catching on quickly. And this paper focuses on a system called ProAgent as a prime example of APA in action. All right, lay it on us. What is ProAgent? So ProAgent really highlights the potential of APA with two key innovations-- agentic workflow construction and agentic workflow execution. OK, so those are some pretty hefty terms. Can you break those down for us? Let's start with how ProAgent constructs workflows. What makes it so revolutionary? Well, with your traditional RPA, you're stuck painstakingly designing every single step of the process. It's like writing a super detailed manual for a robot. Right, like you don't want the robot to deviate at all. Exactly. But ProAgent flips the script instead of you having to lay out every tiny detail. I can just, like, figure it out. You give it high level instructions, and the LLM-- that's the AI engine-- actually builds the workflow for you. Wait, so it's like you're telling it what you want to achieve, and it figures out the how to. Think of it like having an AI assistant who understands your goals and can translate those goals into a functional workflow. OK, that is seriously cool. And then, agentic workflow execution-- that's where those agents we talked about come in, right? They're the ones actually doing the heavy lifting. You got it. ProAgent uses two types of agents-- data agents and control agents. They work together like specialized teams within your automated workflow. OK, I'm really curious about these specialist teams now. Let's start with the data agents. What's their area of expertise? Data agents are the masterminds behind complex data processing. We're not talking simple copying and pasting here. Imagine you need a report summarizing key trends from a massive spreadsheet. Yeah, that sounds fun. A data agent can analyze that data, extract the important bits, and generate a report for you all within the automated workflow. OK, so if the data agents are the analysts, are the control agents like the project managers making sure it all comes together? That's a great analogy. Control agents handle the dynamic aspects of the workflow-- those if this, then that-- scenarios. They can assess a situation and choose the best course of action just like a human would. Wow, so they're not just following a predetermined path. They're making decisions on the fly. This is light years beyond basic automation. It really is. And to really illustrate this, the researchers use a really interesting case study with Google Sheets. Imagine you're a manager, and you've got this spreadsheet with hundreds of different business lines. Hundreds of business lines. I can already feel the headache coming on. Right, and each one might have unique needs. Some need detailed reports emailed out. Others might just need a quick update on Slack. Traditionally, you'd need a human to look at each one, figure out the best way to handle it. Oh, for sure. You'd need a whole team just to manage that. But in this case study, ProAgent uses a control agent to do the reading and the decision making. So it's not just matching keywords or something. It's actually understanding the context of each business line. You got it. The control agent can actually analyze the description of a business line and say, OK, this one seems more business to customer, so it needs this kind of report. That's pretty impressive. So the control agent is like the conductor of an orchestra, making sure everything flows smoothly, and each instrument plays its part at the right time. But what about the actual report writing? That's where those data agents step in, right? Exactly. Let's say the control agent flags a business line that requires a super detailed performance report. The data agent swips in, pulls the relevant data points from the spreadsheet, crunches the numbers, and even adds in some insightful summaries. Hold on. It can actually generate insights. Like, it's not just spitting out numbers. It can analyze the data and tell me what's important. That's the really exciting part. This paper shows that ProAgent can tap into the power of LLMs to move beyond just simple reporting. We're talking about identifying trends, comparing performance across different business lines. It could probably even make suggestions based on the data, right? Exactly. This is about real data-driven insights. OK, now I'm really seeing how this could be a game changer. Even for someone like me, who doesn't necessarily geek out over all the automation jargon, this has huge implications. It absolutely does. Think about all those tasks in your work day that could be handled by a system like ProAgent. Those things that eat up your time because they involve, you know, gathering information from different places, making judgment calls. It's like those tasks that, you know, could theoretically be automated, but they require that extra bit of human touch. Precisely. APA has the potential to bridge that gap. Imagine you could be freeing up all this mental bandwidth. All that time you'd normally spend on these tedious tasks, you could be focusing on the strategic stuff, the creative stuff, the work that really needs your unique human perspective. It's like having an army of AI assistants working tirelessly behind the scenes, handling all the heavy lifting so you can focus on the big picture. And it's not just about productivity. It's about reducing that feeling of information overload. APA could help us sift through all the noise, analyze data more effectively, and ultimately make better, more informed decisions. This all sounds incredibly promising, but where do we go from here? What's next for APA and ProAgent? That's the million dollar question. What's so exciting about this research is that it's really just the tip of the iceberg. As LMS continue to evolve, we can expect to see even more sophisticated versions of APA capable of handling increasingly complex tasks. So we could be talking about even more autonomy, even more intelligence, baked into these systems. What kind of impact could that have on the way we work and live? Imagine a world where personalized automation is the norm. Systems like ProAgent could learn your specific preferences, anticipate your needs. Essentially, become an extension of your own expertise. That's amazing. We're talking about a whole new level of human AI collaboration, where technology augments our abilities instead of replacing them. This feels like a pivotal moment in the evolution of automation. It really does. And while the possibilities are incredibly exciting, it also raises some important considerations about the future of work, how we navigate this evolving landscape. Yeah, it's fascinating to think about. As we're unlocking these new levels of automation, it really makes you wonder, what does work even look like in a future where AI can handle so much of what we do today? Yeah, it's a question we'll all be wrestling with in the coming years, for sure. On the one hand, it's incredibly exciting to think about all the possibilities, right? A world with less drudgery, more time to focus on the things that truly inspire us. But like you said, there are always two sides to every coin. Absolutely. As with any really transformative technology, we need to be mindful of the potential challenges. For example, as APA becomes more and more sophisticated, how do we ensure transparency in the decision making? If an AI is calling the shots, how do we understand its reasoning? Oh, that's such a good point. It's one thing to trust an AI with scheduling emails. But when we're talking about tasks that have real world consequences, transparency becomes absolutely crucial. We need to be able to see how these systems are arriving at their conclusions. Exactly. And beyond just transparency, there's a crawl in of accountability. If an AI makes a mistake, who's responsible? Is it the developers who created the system, the users who deployed it? These are some seriously complex questions. It really highlights how we're entering this new era, where ethics and technology are becoming so intertwined. As APA and other AI-driven systems become more prevalent in our lives, it's more important than ever to have open and honest conversations about the implications. 100%. And it's not just about having these conversations among technologists and policymakers. It's about bringing everyone to the table. Exactly. Because at the end of the day, these technologies are going to impact all of us, right? They will. It's about demystifying AI, making these conversations accessible, and deciding together what role we want these technologies to play. It's not about letting AI dictate the future. It's about using these incredible tools to help us build the future that we want. Well said. I couldn't agree more. Well, on that note, for our listeners, I hope this deep dive has sparked your curiosity about agendic process automation and giving you plenty to ponder as we venture into this exciting new frontier of, well, everything. It's been a pleasure exploring these ideas with you. And as always, thank you for joining us on the deep dive. We'll see you next time for another deep dive into the world of cutting-edge technology and its impact on our lives. Thank you for joining the Bot Nirvana podcast. Appreciate if you can leave a review on iTunes or wherever you're consuming your podcast. Catch the show notes on bot nirvana.org. While you are there, feel free to explore more free digital automation resources and more. See you next time.
--------
11:06
OCR 2.0
In this podcast, we dive into the new concept of OCR 2.0 - the future of OCR with LLMs. We explore how this new approach addresses the limitations of traditional OCR by introducing a unified, versatile system capable of understanding various visual languages. We discuss the innovative GOT (General OCR Theory) model, which utilizes a smaller, more efficient language model. The podcast highlights GOT's impressive performance across multiple benchmarks, its ability to handle real-world challenges, and its capacity to preserve complex document structures. We also examine the potential implications of OCR 2.0 for future human-computer interactions and visual information processing across diverse fields. Key Points Traditional OCR vs. OCR 2.0 Current OCR limitations (multi-step process, prone to errors) OCR 2.0: A unified, end-to-end approach Principles of OCR 2.0 End-to-end processing Low cost and accessibility Versatility in recognizing various visual languages GOT (General OCR Theory) Model Uses a smaller, more efficient language model (Quinn) Trained in diverse visual languages (text, math formulas, sheet music, etc.) Training Innovations Data engines for different visual languages E.g. LaTeX for mathematical formulas Performance and Capabilities State-of-the-art results on standard OCR benchmarks Outperforms larger models in some tests Handles real-world challenges (blurry images, odd angles, different lighting) Advanced Features Formatted document OCR (preserving structure and layout) Fine-grained OCR (precise text selection) Generalization to untrained languages This episode was generated using Google Notebook LM, drawing insights from the paper "General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model". Stay ahead in your AI journey with Bot Nirvana AI Mastermind. Podcast Transcript: All right, so we're diving into the future of OCR today. Really interesting stuff. Yeah, and you know how sometimes you just gain a document, you just want the text, you don't really think twice about it. Right, right. But this paper, General OCR Theory, towards OCR 2.0 via a unified end-to-end model. Catchy title. I know, right? But it's not just the title, they're proposing this whole new way of thinking about OCR. OCR 2.0 as they call it. Exactly, it's not just about text anymore. Yeah, it's really about understanding any kind of visual information, like humans do. So much bigger. It's a really ambitious goal. Okay, so before we get ahead of ourselves, let's back up for a second. Okay. How does traditional OCR even work? Like when you and I scan a document, what's actually going on? Well, it's kind of like, imagine an assembly line, right? First, the system has to figure out where on the page the actual text is. Find it. Right, isolate it. Then it crops those bits out. Okay. And then it tries to recognize the individual letters and words. So it's like a multi-step? Yeah, it's a whole process. And we've all been there, right? When one of those steps goes wrong. Oh, tell me about it. And you get that OCR output that's just… Gibberish, told gibberish. The worst. And the paper really digs into this. They're saying that whole assembly line approach, it's not just prone to errors, it's just clunky. Yeah, very inefficient. Like different fonts can throw it off. And write. Different languages, forget it. Oh yeah, if it's not basic printed text, OCR 1.0 really struggles. It's like it doesn't understand the context. Yeah, exactly. It's treating information like it's just a bunch of isolated letters, instead of seeing the bigger picture, you know, the relationships between them. It doesn't get the human element of it. It's missing that human touch, that understanding of how we visually organize information. And that's a problem. A big one. Especially now, when we're just like drowning in visual information everywhere you look. It's true, we need something way more powerful than what we have now. We need a serious upgrade. Enter OCR 2.0. That's what they're proposing, yeah. So what's the magic formula? What makes it so different from what we're used to? Well, the paper lays out three main principles for OCR 2.0. Okay. First, it has to be end to end. It needs to be… And to end. Low cost, accessible. Got it. And most importantly, it needs to be versatile. Versatile, that's a good one. So okay, let's break it down end to end. Does that mean ditching that whole assembly line thing we were talking about? Exactly, yeah. Instead of all those separate steps, OCR 2.0, they're saying it should be one unified model. Okay. One model that can handle the entire process. So much simpler. And much more efficient. Okay, that makes sense. And easier to use, which is key. And then low cost, I mean. Oh, absolutely. That's got to be a priority. We want this to be accessible to everyone, not just… Sure. You know. Right, not just companies with tons of resources. Exactly. And the researchers were really clever about this. Yeah. They actually chose to use a smaller, more efficient language model. Oh, really? Yeah, they called it Quinn and… Instead of one of the massive ones that's been in the news. Exactly. And they proved that you don't need this giant energy guzzling model to get really impressive results with OCR. So efficient and powerful. I like it. That's the goal. But versatile. That's the part that always gets me thinking because… It's where things get really interesting. Yeah, we're not even just talking about recognizing text anymore. No, it's about recognizing any kind of… Visual information. Visual information that humans create, right? Yeah. Like, think about it. Math formulas, diagrams, even something like sheet music. Hold on. Sheet music. Like actually reading music. Yeah. And it's a really good example of how different this is. Okay. Because music, it's not just about recognizing the notes themselves. Right. It's about understanding the timing, the rhythm. So languid. How those symbols all relate to each other. It's a whole system. That's wild. Okay, so how do they even begin to teach a machine to do that? Well, they got really creative with the training data. Okay. Instead of just feeding it like raw text and images, they built these data engines to teach JART different visual languages. Data engines. That sounds intense. Yeah, it's basically like, imagine for the sheet music they used, let me see, it's called humdrum kern. Okay. And essentially what that does is it turns musical notation into code. Oh, interesting. So Johnny T learned to connect those visual symbols to their actual musical meaning. So it's learning the language. Exactly. That's incredible, but sheet music's just one example, right? What other kind of crazy stuff did they throw at this thing? Oh, they really tried everything. Math formulas, those are always fun. I bet. Molecular formula, even simple geometric shapes, squares and circles. Really? Yeah, they used all sorts of tricks to represent these visual elements as code. So GOT could understand it. Exactly. Like for the math formulas, they used a language called latex. Have you heard of that one? Yeah, yeah, that's how a lot of scientists and mathematicians, they use that to write equations. Exactly. It's how they write it so computers can understand it. It's like the code of math. Exactly. And so by training GOT on latex, they weren't just teaching it to recognize what a formula looks like. Right, right. They were teaching it the underlying structure, like the grammar of math itself. Okay, now that is really cool. Yeah, and they found that GOT could actually generalize this knowledge. It could even recognize elements of formulas that it had never seen before. No way. It was like it was starting to understand the language of math, which is pretty incredible when you think about it. Yeah, that's wild. Okay, so we've got this model. It can recognize text. It can recognize all these other complex visual languages. We're getting somewhere. But how does it actually perform? Like does it actually live up to the hype? So this is it, huh? We've got this super OCR model that's been trained on everything but the kitchen sink. Time to put it to the test. We went through the ringer. Yeah. What did they even start with? Well, the classics, right? Plain document OCR, PDFs, articles, that kind of thing. Basic but important. Exactly. And they tested it in both English and Chinese just to see how well-rounded it was. And drumroll, how to do? Crushed it. Absolutely crushed it. No way. State-of-the-art performance on all the standard document OCR benchmarks. That's amazing. Oh, and here's the really interesting part. It actually outperformed some much larger, more complex models in their tests. So it's efficient and it's powerful. That's a winning combo. Exactly. It shows you don't always have to go bigger to get better results. Okay, that's awesome. But what about real-world stuff? You know, the messy stuff. Oh, they thought of that. Like trying to read a sign with a weird font or a crumpled-up napkin with handwriting on it? Yep. All that. They have these data sets specifically designed to trip up OCR systems with blurry images, weird angles, different lighting. The stuff nightmares are made of. Right. And GOT handled it all like a champ. It was really impressive. Okay, so this isn't just some theoretical thing. It actually works. It's the real deal. I'm sold. But there was another thing they mentioned, something about formatted document OCR. What is that exactly? That's where things get really elegance. The formatted documents, it's not just about recognizing the words. Right. It's about understanding the structure of a document. Okay, like the headings and bullet points? Exactly. Tables, the whole nine yards. It's about preserving the way information is organized. So it's like imagine being able to convert a complex PDF into a perfectly formatted word doc automatically. Precisely. That's the dream, right? I would save me so many hours of my life. Oh, tell me about it. No more reformatting everything manually. Did GOT actually managed to do that? It did. And it wasn't just a fluke. The researchers found that GOT was consistently able to preserve document structure, which really shows that this OCR 2.0 approach, it can understand information hierarchy in a way that we just haven't seen before. That's a game changer. Okay, before I forget, we got to talk about that fine grained OCR thing. They mentioned. Yes, that's where it gets really precise. It sounds like you have microscopic control over the text. Like you're telling it exactly what to read. Yeah. It's like having a laser pointer for text. You can say, read the text in that green box over there, or read the text between these coordinates on the image. That is wild. And how accurate is it when you get that specific? It was surprisingly accurate, even at that level of granularity. That's amazing. And they didn't even have to specifically train it for every little thing. Well, that's this part. They actually found that GOT could sometimes recognize text in languages they hadn't even trained it on. What? Are you serious? Yeah. It's because it had encountered similar characters in different contexts, so it was able to make educated guesses. So it's learning. It's actually learning. Exactly. It's not just pattern matching anymore. It's actually generalizing its knowledge. Okay, so big picture here. Is OCR 2.0 the real deal, or is this just hype? I think the results speak for themselves. This isn't just a minor upgrade. This is a fundamental shift in how we think about extracting meaning from images. GOT proves that this OCR 2.0 approach, it's not just a pipe dream. It has incredible potential to change everything. Yeah, it really feels like we're moving beyond just digitizing stuff. You know, it's like machines are actually starting to understand what they're seeing. Exactly. It's a whole new era of human-computer interaction. And if GOT can already handle sheet music and geometric shapes and complex document formatting, I mean, the possibilities are, it's kind of mind-blowing. It really makes you wonder what other fields are on the verge of their own 2.0 transformations. That's a great question, one to ponder. But for now, this has been an incredible deep dive into the future of OCR. Thanks for joining me. And until next time, keep those minds curious.
--------
11:06
JP Morgenthal
JP Morgenthal (JP) is a seasoned expert in applied AI and automation. With over 20 years of experience as a Chief Technology Officer (CTO) and Solution Architect, JP has been a driving force behind digital transformation for Fortune 1000 companies. His expertise spans IT architecture, cloud strategies, and large-scale system implementations. Currently, JP is the Vice President of Solution Engineering at CafeX Communications, following prominent roles as CTO of Automation Anywhere and App Services at DXC. In this episode, we delve into the convergence of various automation technologies like RPA, BPM, iPaas, and AI. JP shares insights on the influence of new AI advancements, including Large Language Models (LLMs) and AI agents, and explores the future trends in intelligent automation. Join us as we unpack these topics, offering a glimpse into how these innovations reshape the technological landscape. More information and Links: More about JP Morgenthal: https://jpmorgenthal.com/ Connect with JP Morgenthal: linkedin.com/in/jpmorgenthal/ Visit Nandan on the web at nandan.info
--------
28:40

More Technology podcasts

About Bot Nirvana | AI & Automation Podcast

Bot Nirvana is a podcast on all things Intelligent Automation. We cover RPA, AI, Process Intelligence, Process Mining, and a host of other tools and techniques for intelligent automation.

Podcast website

Technology